To continue the test-drive of DuckDB, now I’ll try count distinct. Link to first and second posts for some context.

Count distinct benchmark

This time it will be pretty straightforward¹. Just try this simple query on DuckDB and PostgreSQL, using the same data as before (tiny, small, mid and year), and try querying a few different columns that differ in cardinality and data type.

SELECT COUNT(DISTINCT(col))
FROM table_name

Results

As expected, DuckDB is again consistently faster than Postgres in most queries, except for the timestamp columns, where there’s a similar issue as the previous post and DuckDB ends up being slower than postgres. In this case, however, the difference is not so big.

Here the plots summarizing the {bench}mark.

There are eye-popping differences such as 3 minutes for DuckDB to complete the query vs. almost 2 hours for Postgres. The plot below summarizes all the benchmarks showing relative times ² to see how much faster/slower can DuckDB be for this kind of queries. In most cases, DuckDB is between 10x and 20x times faster than postgres, and for high-cardinality columns, can be even faster than that. Nice!. DuckDB still struggles with timestamp columns as seen in the previous post, but for this kind of queries, it does not seem to be so problematic as it was for the count aggregate queries (here it is only 1x to 3x times slower than postgres).

Wrap-up

Some non-exaustive TODOs:

I see DuckDB has native support to count distinct elements using HyperLogLog (see approx_count_distinct(x)) That’s cool!. In Postgres there’s certainly an extension for that (also this post). I’ve use it before (rarely, though, as most of the time I do need exact counts), so it’s nice to have it in DuckDB readily available!. Perhaps try it out sometime.
…

No comparisons for collecting or not the query results, neither for using dplyr/sql to interface with DuckDB and Postgres. This time is plain sql and also no comparisons for Arrow (until I checked why it did not really work for the large dataset). ↩︎
Comparing the median of the timings for each backend and expressing them relative to the faster backend. ↩︎

Count distinct: DuckDB vs. PostgreSQL

DuckDB around 10x-20x faster for count distinct queries

Count distinct benchmark

Results

Wrap-up