To continue the test-drive of DuckDB, now I’ll try count distinct
. Link to first and second posts for some context.
Count distinct benchmark
This time it will be pretty straightforward1. Just try this simple query on DuckDB and PostgreSQL, using the same data as before (tiny, small, mid and year), and try querying a few different columns that differ in cardinality and data type.
SELECT COUNT(DISTINCT(col))
FROM table_name
Results
As expected, DuckDB is again consistently faster than Postgres in most queries, except for the timestamp columns, where there’s a similar issue as the previous post and DuckDB ends up being slower than postgres. In this case, however, the difference is not so big.
Here the plots summarizing the {bench}mark.
There are eye-popping differences such as 3 minutes for DuckDB to complete the query vs. almost 2 hours for Postgres. The plot below summarizes all the benchmarks showing relative times 2 to see how much faster/slower can DuckDB be for this kind of queries. In most cases, DuckDB is between 10x and 20x times faster than postgres, and for high-cardinality columns, can be even faster than that. Nice!. DuckDB still struggles with timestamp columns as seen in the previous post, but for this kind of queries, it does not seem to be so problematic as it was for the count aggregate queries (here it is only 1x to 3x times slower than postgres).
Wrap-up
Some non-exaustive TODOs:
-
I see DuckDB has native support to count distinct elements using HyperLogLog (see
approx_count_distinct(x)
) That’s cool!. In Postgres there’s certainly an extension for that (also this post). I’ve use it before (rarely, though, as most of the time I do need exact counts), so it’s nice to have it in DuckDB readily available!. Perhaps try it out sometime. -
…
-
No comparisons for collecting or not the query results, neither for using dplyr/sql to interface with DuckDB and Postgres. This time is plain sql and also no comparisons for Arrow (until I checked why it did not really work for the large dataset). ↩︎
-
Comparing the median of the timings for each backend and expressing them relative to the faster backend. ↩︎