Back
Featured image of post Count aggregate queries on DuckDB, Arrow and PostgreSQL

Count aggregate queries on DuckDB, Arrow and PostgreSQL

DuckDB for the win! Arrow querying parquet follows closely but choke on larger dataset

Let’s continue the test-drive of DuckDB and Arrow as alternatives to my beloved PostgreSQL, for in-laptop data analysis. See the first post here for some context.

Now that the raw CSVs have been ingested into DuckDB, Arrow/Parquet and PostgreSQL, we can start querying the data and see how each of them perform.

Count aggregate

I will start with a basic count aggregate query that you can get using plain SQL as SELECT COUNT(*) AS cnt, col FROM tbl GROUP BY col; or using {dplyr} with dplyr::count(tbl, col).

How to query each backend?

To query data ingested into arrow/parquet files, I will use {dplyr} interface to Arrow Datasets, which now supports arrow-native aggregation queries (see release note for Arrow 6.0). “Arrow no longer eagerly evaluates” and you need to “call compute() or collect() to evaluate the query”. Although I would not expect major differences, let’s also throw a collect()/compute() comparison into the mix to have a sense of the overhead associated to collecting the query results. So, the code to benchmark looks like this:

tbl |>
  dplyr::count(col) |>
  dplyr::collect() # and dplyr::compute()

Where tbl is a pointer to the data obtained by calling arrow::open_dataset(path, format = "arrow") or format = "parquet".

To query data ingested into DuckDB and PostgreSQL I will use plain SQL queries sent using a connection established via the {DBI} package (using {odbc} drivers for PostgreSQL and the native duckdb driver). To mimic the comparison between collect()/compute() above, let’s use a CREATE TABLE AS statement (which would be similar to compute()) and simply omit it to mimic collect(). So the code to benchmark looks like this:

CREATE TABLE temptbl AS
SELECT col, COUNT(*) AS cnt
FROM table_name
GROUP BY col
;

Thanks to {dbplyr} it’s also straightforward to use exactly the same {dplyr} interface to also query DuckDB and PostgreSQL, so I will throw that comparison into the mix as well (although I would not expect major differences, because translating such a simple query from dplyr to SQL should generate equivalent queries and query plan in the RDMS).

Thus, the labels of the expressions benchmarked will follow the pattern backend_interface, where possible values are [pg, duckdb, arrow, parquet]_[sql, dplyr]. Backend combines how the data are stored and who does the heavy-lifting of actually crunching the data.

And there is one additional case. DuckDB supports directly querying parquet files. Here I will use the same SQL syntax as above, but using parquet_scan() in the FROM clause of the query.

CREATE TABLE temptbl AS
SELECT col, COUNT(*) AS cnt
FROM parquet_scan('parquet_file_originally_ingested_by_arrow.parquet')
GROUP BY col
;

So in this case the data are stored in the parquet file, originally created by Arrow, but the heavy-lifting for the query will be done by DuckDB. 1 So in this case the label will be parquet_duckdb_sql.

Columns to group by

I wanted to use group-by columns with different characteristics, just to see if there is any benefit/penalty associated to that. This dataset is not so diverse in terms of the columns, so I only got differences in data type and the cardinality (number of unique values), as following: three columns ingested as integer and three columns ingested as text, in each case, selecting one column with very low cardinality (2-4 unique values), one with in-between cardinality (1k-3k unique values) and one with higher cardinality (16M unique values).

In addition, I also used one column ingested as TIMESTAMP (because its values in the CSV files are like 2018-07-28 00:00:00.000), even though it is really a DATE column (and has 365 unique values). Of course, that’s something that could have been fixed, but so far I have been using the defaults, including the data type detection. Similarly, the exercise includes no optimization whatsoever of the tables/queries (e.g. no partitions were defined for parquet files, no indexes created in PostgreSQL or DuckDB).

The queries will be run using each of the datasets ingested in the previous post (tiny, small, mid and whole round data).

Who’s good at counting?

In a nutshell, DuckDB!

I ran benchmarks for the different combinations of backend, interface, group-columns and dataset, and in most of them, DuckDB completes the query faster, very closely followed by parquet (Arrow querying a parquet file). Arrow querying an arrow file is several times slower and seems to be also more variable in its performance. DuckDB querying the parquet file is also not as performant. The slowest, as expected, is PostgreSQL.

Below the plots summarizing the results of the benchmarks (using {bench} and its default autoplot(), with some tweaks). They all tell a similar story, so they are pooled together in the carousel below.

And although DuckDB is most of the time the fastest, there is one notable exception: the TIMESTAMP column. When grouping by that column, DuckDB’s performance degrades substantially and turns out to be several times slower than PostgreSQL.

Plots above show the results for tiny, small and mid data. Unfortunately, it did not really work for the whole year data in the case of Arrow. Neither querying parquet nor arrow files. It crashed the R session again and again. I am not sure what is going on and did not have the time/was not in the mood to debug that. But it is worth noting that “Grouped aggregation and (especially) joins should be considered somewhat experimental in this release. We expect them to work, but they may not be well optimized for all workloads.”. So let’s just move on and run the benchmarks with the larger data using only DuckDB and PostgreSQL.

Results, again, show DuckDB is way faster than Postgres for this kind of queries. For some queries it is 5x-7x faster, in others, up-to 25x faster. As examples, some queries that take ~25 seconds to complete with DuckDB, while taking more than 3 minutes in PostgreSQL.

But the issue with the TIMESTAMP column persists and becomes actually super problematic in this larger dataset. PostgreSQL completes that query in about 5 minutes. But DuckDB takes almost 2 hours !!!

Finally, as expected, neither the interface (dplyr/sql) nor collecting the data have an effect on query times that could change the ranking substantially. Those parameters can induce some differences in performance, in particular, when the result set is large, but overall, nothing to worry about.

Session Info
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] conflicted_1.0.4  tarchetypes_0.4.0 targets_0.9.0     odbc_1.3.2       
##  [5] arrow_6.0.0.2     duckdb_0.3.0      DBI_1.1.2         ggplot2_3.3.5    
##  [9] tidyr_1.1.4       dplyr_1.0.7       magrittr_2.0.1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        compiler_4.1.2    pillar_1.6.4      tools_4.1.2      
##  [5] digest_0.6.29     bit_4.0.4         lifecycle_1.0.1   tibble_3.1.6     
##  [9] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.12      igraph_1.2.10    
## [13] rstudioapi_0.13   cli_3.1.0         yaml_2.2.1        xfun_0.29        
## [17] fastmap_1.1.0     knitr_1.37        withr_2.4.3       fs_1.5.2         
## [21] generics_0.1.1    vctrs_0.3.8       hms_1.1.1         bit64_4.0.5      
## [25] grid_4.1.2        tidyselect_1.1.1  data.table_1.14.2 glue_1.6.0       
## [29] R6_2.5.1          processx_3.5.2    fansi_0.5.0       pacman_0.5.1     
## [33] purrr_0.3.4       callr_3.7.0       blob_1.2.2        codetools_0.2-18 
## [37] scales_1.1.1      ps_1.6.0          ellipsis_0.3.2    assertthat_0.2.1 
## [41] colorspace_2.0-2  utf8_1.2.2        munsell_0.5.0     cachem_1.0.6     
## [45] crayon_1.4.2

Wrap-up

My take home messages

  • DuckDB is fast in these count aggregate queries.
  • Most of the time, Arrow querying a parquet file is as fast as DuckDB.
  • While Arrow is unbelievably fast to ingest the data in its native format (see previous post), querying that very same data is not that impressive.
  • Arrow could not handle the whole year data.
  • DuckDB querying parquet files works, but not as performant as querying data ingested into DuckDB. That’s no surprise of course, but the performance penalty is substantial (and perhaps greater than I expected).
  • DuckDB has an issue with TIMESTAMP columns. I wonder if that was also the underlying cause of the performance degradation in ingesting the data documented in the previous post.

Some non-exaustive TODOs:

  • File an issue in DuckDB about the timestamp thing
  • See if that also happens with DATE
  • See if timestamp can be related to ingesting issues
  • Look closer into the Arrow - DuckDB integration. How does it work under the hood?, Can DuckDB also query files in arrow native format or only parquet?

  1. I am aware of DuckDB - Arrow integration (to_duckdb(), to_arrow() and so on). I haven’t looked into this, but I suppose ultimately that would do something like creating a view on the parquet file using parquet_scan↩︎

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy