Whenever I have to analyze largish1 data in my laptop, I instinctively fire up my good old PostgreSQL server running locally, ingest the data in there and breezily wrangle my way through the data using SQL. Postgres may not be the fastest alternative out there. But it is feature-rich: you can throw anything at it, and it just works. It’s very hard to make it choke with an out-of-memory error and, by default, runs just fine in resource-constrained environments (e.g. in a Raspberry Pi).
I’ve kept an eye on DuckDB and Apache Arrow as other open-source alternatives for the workflow outlined above. Both share a number of features that make them very appealing for analytical workloads in general, and for the in-laptop-analytics in particular 2:
- Columnar format and vectorized operations
- In-process libraries (no need to install a server)
- Handle larger-than-memory datasets
- APIs available for several languages, including R and Python
- Open-source
I did try both DuckDB and Arrow some time ago (very quickly, though). But I do not use them routinely. At least not just yet. When I first tried them, Arrow did not support grouped aggregations and joins and that was a no-go for me. DuckDB did not have support for grouping sets or filtered aggregation (or at least I could not get it to work) and I found it slow ingesting data from CSV files (much slower than Postgres in that, even though queries were indeed faster).
But both projects move at a very fast pace and now I see release news where Arrow 6.0.0 “adds grouped aggregation and joins in the dplyr interface, on top of the new Arrow C++ query engine” and DuckDB announced “support for GROUPING SETS, ROLLUP, CUBE”. So it is time to take both projects for a spin again (and try and document the test-drive here).
The test-drive
Ultimately, I would like to run an existing data wrangling pipeline using DuckDB and Arrow as backends, and compare running time using PostgreSQL as backend. For this, DuckDB is particularly appealing because they use PostgreSQL parser and support a similar feature set. So, ideally, the pipeline can run on DuckDB simply by changing the connection without having to change the queries. That did not really work as smoothly, so I will be trying step-by-step: i) ingest the data, ii) try some aggregate queries, iii) joins, …, and let’s see how it goes.
The data
For this test it was key to use real data. Meaning a large data set that I would actually have to use / have used for analysis in my laptop. Of course I could have simply used the popular NYC Taxi trip record data or any other large data set publicly available. But that would have been missing the point. I wanted to keep this exercise as close as possible to an actual use case for me. After all, this is just an informal test to see how to integrate DuckDB and/or Arrow in my workflow and to assess the potential benefits, for example, in terms of processing time 3.
I will use a few samples of a larger data set 4 that contains 25 columns (2 numeric, 2 timestamp, 10 integer and the rest text, most of them categorical variables, some with low cardinality -e.g. sex-, others with millions of different values). I will use one round of data (~486M rows) stored in a csv file (uncompressed 82GB). First test in smaller samples (tiny, small, mid) before trying out using one whole round of data.
- Tiny data: 1M rows, csv uncompressed size ~170MB
- Small data: 10M rows, csv uncompressed size ~1,7GB
- Mid data: 60M rows, csv uncompressed size ~10GB
- Round data: ~486M rows, csv uncompressed size ~82GB
The code and software
I will be running this test from R, using {targets}
to orchestrate the
different steps and {bench}
to benchmark code snippets and measure execution
time.
- DuckDB version 0.3.0 [I did not pay attention and ran this using 0.3.0 and then I noticed 0.3.1 is already available, which enables by default multi-threading; so perhaps I should re-run this with 0.3.1 to see if it changes the results]
- Arrow version 6.0.0
- PostgreSQL version 14.1
Initially I will be using DuckDB and Arrow without fiddling with any configuration (no custom memory_limit or threads; just the defaults). In PostgreSQL, however, I typically tweak a couple of configuration parameters, in this case, shared_buffers=2GB
and work_mem=1GB
(see this link for some useful rules-of-thumb).
The code for the whole exercise is available in this Github repo
A poorman’s laptop
Running this may take a while, so I cannot really use my daily driver laptop. But I do have an old laptop lying around that can be used for this. It’s a 10-years old Dell XPS 17. I guess at the time it was a good machine: Processor Intel i7-2630QM, 8GB DDR3 RAM, SSD SATA, nowadays running Windows 10 Home. But for today’s standards, it can be considered a resource-constrained environment.
So it’s important to keep in mind the kind of machine running this exercise. Not sure about this, but I would consider the results as a lower bound to the performance improvements that DuckDB and Arrow could bring (e.g. Arrow explicitly try to take advantage of “modern hardware”). It will be interesting to see how much of a difference modern hardware would make. Hopefully soon, I will get my hands on a new fancy MacBook Pro to replicate this test.
Ingesting data
What will be tested
I will compare the time it takes to ingest the raw data from the csv files into each of the backends.
Arrow
In the case of Arrow, let’s try with the two key file formats: i. parquet and ii. arrow (aka ips/feather; see ?arrow:open_dataset
). It boils down to use a couple of functions from the {arrow}
R package: open_dataset()
and write_dataset()
, like this:
csv_ds <- arrow::open_dataset(
sources = src_file,
format = "text",
delimiter = ";"
)
arrow::write_dataset(
dataset = csv_ds,
path = arrow_path,
format = "arrow" # and "parquet"
)
DuckDB
For DuckDB, probably the simplest way is to use a CREATE TABLE AS
statement with a query reading directly from the csv file using read_csv_auto()
. This lets DuckDB detect everything (column names, delimiter, data types, etc.).
CREATE TABLE table_name AS
SELECT *
FROM read_csv_auto('src_file.csv')
;
An alternative is the COPY
statement. You would have to type more to “manually” create the table and decide on the data type for each column doing something like this: CREATE TABLE table_name ("col1" TEXT, ...);
. And after that just copy the csv into that table:
COPY table_name
FROM 'src_file.csv' ( DELIMITER ';', HEADER )
;
A third alternative described in the documentation (hereafter labelled as copy2) is a slight variation of the COPY statement to let DuckDB detect the format “and omit the otherwise required configuration options.”
COPY table_name
FROM 'src_file.csv' ( AUTO_DETECT TRUE )
;
Let’s compare these three approaches, just to take a look if there is any overhead associated to auto detection of csv format and data types.
PostgreSQL
For Postgres, let’s also compare the COPY statement approach, which uses very similar syntax as in DuckDB
COPY table_name
FROM 'src_file.csv' (FORMAT 'csv', DELIMITER ';', HEADER )
;
And an alternative using a foreign table and csv foreign data wrapper.
CREATE EXTENSION IF NOT EXISTS file_fdw;
CREATE SERVER IF NOT EXISTS csv_src FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE table_name_fg (
col1 STRING,
...
)
SERVER csv_src
OPTIONS (
FILENAME 'src_file.csv',
FORMAT 'csv',
HEADER 'TRUE',
DELIMITER ';',
ENCODING 'UTF-8'
)
;
CREATE TABLE table_name AS SELECT * FROM table_name_fg;
Results: Who is faster ingesting such data?
Well, for this exercise Arrow is fast!, way faster than Postgres, but also faster than DuckDB. Arrow native file format is also faster than parquet. Both the tests using tiny and small data show very similar results, where arrow file format is the fastest and compared to that it takes around:
- 1.6x longer to ingest into parquet
- 5x longer to ingest into DuckDB
- 15x longer to ingest into PostgreSQL
There are no major differences associated to the approach used to ingest in each backend (so, no major overhead for auto-detecting file format and data types in DuckDB).
{bench} details for tiny data
expression | min | median | itr/sec | mem_alloc | gc/sec | n_itr | n_gc | total_time | result | memory | time | gc |
---|---|---|---|---|---|---|---|---|---|---|---|---|
arrow | 1.3s | 1.31s | 0.7604814 | NA | 0.4345608 | 7 | 4 | 9.21s | NULL | NULL | 1.836621, 1.467352, 1.328626, 1.314380, 1.420574, 1.326711, 1.312669, 1.297491, 1.298764, 1.326055 | 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 |
parquet | 2.09s | 2.13s | 0.4681122 | NA | 0.0000000 | 10 | 0 | 21.36s | NULL | NULL | 2.272508, 2.125310, 2.138580, 2.101764, 2.098363, 2.136137, 2.157207, 2.126433, 2.114051, 2.092045 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb | 5.15s | 6.79s | 0.1488356 | NA | 0.0000000 | 10 | 0 | 1.12m | NULL | NULL | 5.153599, 6.783404, 6.725932, 6.822941, 6.752167, 6.830131, 6.721660, 6.902978, 6.793911, 7.701503 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb_copy | 8.33s | 8.42s | 0.1179548 | NA | 0.0000000 | 10 | 0 | 1.41m | NULL | NULL | 8.416875, 8.519529, 8.361855, 8.623358, 8.420646, 8.392340, 8.334223, 8.698937, 8.412829, 8.597660 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb_copy2 | 6.47s | 7.15s | 0.1311416 | NA | 0.0145713 | 9 | 1 | 1.14m | NULL | NULL | 8.828867, 8.365383, 8.991698, 8.479016, 6.709015, 6.474999, 6.769448, 7.149739, 6.826154, 6.742778 | 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
pg_foreign | 19.36s | 25.09s | 0.0431928 | NA | 0.0047992 | 9 | 1 | 3.47m | NULL | NULL | 21.76691, 19.67458, 19.53117, 19.52908, 19.36306, 25.09045, 26.06149, 26.48230, 26.44349, 26.19228 | 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
pg_copy | 22.02s | 22.07s | 0.0452669 | NA | 0.0050297 | 9 | 1 | 3.31m | NULL | NULL | 22.01945, 22.23197, 22.09389, 22.03953, 22.18977, 22.06471, 22.11497, 22.01474, 22.03280, 22.13380 | 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
{bench} details for small data
expression | min | median | itr/sec | mem_alloc | gc/sec | n_itr | n_gc | total_time | result | memory | time | gc |
---|---|---|---|---|---|---|---|---|---|---|---|---|
arrow | 12.5s | 16.27s | 0.0658646 | NA | 0.0000000 | 10 | 0 | 2.53m | NULL | NULL | 14.26708, 12.57470, 12.76333, 12.50327, 16.12645, 16.42233, 16.75937, 16.68181, 16.63274, 17.09561 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
parquet | 26.76s | 27.37s | 0.0362535 | NA | 0.0000000 | 10 | 0 | 4.6m | NULL | NULL | 28.99158, 27.96898, 26.95793, 26.76077, 27.31003, 27.87820, 27.42645, 26.93932, 28.47725, 27.12517 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb | 1.15m | 1.42m | 0.0120938 | NA | 0.0013438 | 9 | 1 | 12.4m | NULL | NULL | 96.78378, 85.15403, 85.53791, 85.18231, 68.82363, 73.90329, 85.23597, 86.54822, 87.04882, 86.74796 | 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb_copy | 1.19m | 1.23m | 0.0129047 | NA | 0.0000000 | 10 | 0 | 12.91m | NULL | NULL | 86.58523, 87.16100, 71.94344, 71.47411, 72.32618, 72.20082, 73.62987, 73.58680, 75.32237, 90.67873 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb_copy2 | 1.26m | 1.51m | 0.0116235 | NA | 0.0000000 | 10 | 0 | 14.34m | NULL | NULL | 90.50274, 91.91102, 91.19395, 91.24822, 91.36625, 92.39341, 84.57364, 75.66488, 75.84726, 75.62813 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
pg_foreign | 2.98m | 4m | 0.0045031 | NA | 0.0005003 | 9 | 1 | 33.31m | NULL | NULL | 233.7513, 240.4563, 239.7256, 196.6095, 202.6231, 241.5512, 241.1700, 186.1951, 178.4919, 240.6790 | 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
pg_copy | 2.72m | 3.55m | 0.0049937 | NA | 0.0000000 | 10 | 0 | 33.38m | NULL | NULL | 219.6726, 215.4499, 163.2316, 184.3710, 221.4437, 219.7841, 185.1851, 163.6718, 210.6791, 219.0370 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
Interestingly, when the data are somewhat larger, at some point DuckDB’s performance degrades and falls behind Postgres. While the other relative times remain in the same ballpark (1.6x parquet, ~12x Postgres), now it takes almost 25x longer for DuckDB to ingest the data, compared to arrow. In absolute numbers, while Arrow takes less than 2 minutes to ingest the data, DuckDB takes about 40 minutes (and Postgres less than 20 minutes).
expression | min | median | total_time | n_itr | n_gc |
---|---|---|---|---|---|
arrow | 1.58m | 1.61m | 13.02m | 8 | 4 |
parquet | 2.85m | 2.98m | 30.54m | 10 | 0 |
duckdb | 37.6m | 39.75m | 6.66h | 10 | 0 |
duckdb_copy | 37.43m | 39.21m | 6.5h | 10 | 0 |
duckdb_copy2 | 38.52m | 39.34m | 5.93h | 9 | 1 |
pg_foreign | 18.77m | 19.26m | 2.91h | 9 | 1 |
pg_copy | 17.64m | 17.89m | 2.75h | 9 | 1 |
{bench} details for mid data
expression | min | median | itr/sec | mem_alloc | gc/sec | n_itr | n_gc | total_time | result | memory | time | gc |
---|---|---|---|---|---|---|---|---|---|---|---|---|
arrow | 1.58m | 1.61m | 0.0102385 | NA | 0.0051192 | 8 | 4 | 13.02m | NULL | NULL | 115.88758, 103.35154, 100.03381, 97.96673, 101.98887, 97.18950, 95.27126, 94.86438, 95.85733, 94.87548 | 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 |
parquet | 2.85m | 2.98m | 0.0054566 | NA | 0.0000000 | 10 | 0 | 30.54m | NULL | NULL | 175.7699, 183.1974, 209.0024, 176.9291, 171.2705, 176.0256, 171.7868, 191.0403, 181.2983, 196.3368 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb | 37.6m | 39.75m | 0.0004170 | NA | 0.0000000 | 10 | 0 | 6.66h | NULL | NULL | 2497.150, 2255.792, 2297.318, 2384.956, 2308.614, 2379.006, 2589.718, 2469.575, 2384.735, 2415.811 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb_copy | 37.43m | 39.21m | 0.0004274 | NA | 0.0000000 | 10 | 0 | 6.5h | NULL | NULL | 2354.093, 2389.514, 2415.575, 2351.088, 2245.535, 2364.025, 2281.866, 2282.393, 2388.673, 2321.990 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
duckdb_copy2 | 38.52m | 39.34m | 0.0004216 | NA | 0.0000468 | 9 | 1 | 5.93h | NULL | NULL | 2311.412, 2445.555, 2360.389, 2359.646, 2393.704, 2371.317, 2354.011, 2467.153, 2351.584, 2379.501 | 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
pg_foreign | 18.77m | 19.26m | 0.0008593 | NA | 0.0000955 | 9 | 1 | 2.91h | NULL | NULL | 1249.875, 1142.859, 1145.987, 1125.913, 1171.138, 1182.262, 1204.312, 1131.437, 1214.214, 1155.513 | 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
pg_copy | 17.64m | 17.89m | 0.0009091 | NA | 0.0001010 | 9 | 1 | 2.75h | NULL | NULL | 1062.921, 1107.936, 1058.463, 1080.978, 1109.508, 1061.997, 1071.106, 1073.457, 1138.315, 1216.212 | 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |
The results ingesting a whole round of data 5 (~486 million rows in a single csv file, 82GB) are mostly consistent with the times shown above for mid data; DuckDB took more than 5 hours to ingest the whole file, followed by PostgreSQL that took a bit less than 3 hours, followed by parquet that took about 1 hour and the fastest is, again, Arrow taking only a bit more than 16 minutes.
Session Info
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] conflicted_1.0.4 tarchetypes_0.4.0 targets_0.9.0 odbc_1.3.2
## [5] arrow_6.0.0.2 duckdb_0.3.0 DBI_1.1.2 ggplot2_3.3.5
## [9] tidyr_1.1.4 dplyr_1.0.7 magrittr_2.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 compiler_4.1.2 pillar_1.6.4 tools_4.1.2
## [5] digest_0.6.29 bit_4.0.4 lifecycle_1.0.1 tibble_3.1.6
## [9] gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.12 igraph_1.2.10
## [13] rstudioapi_0.13 cli_3.1.0 yaml_2.2.1 xfun_0.29
## [17] fastmap_1.1.0 knitr_1.37 withr_2.4.3 fs_1.5.2
## [21] generics_0.1.1 vctrs_0.3.8 hms_1.1.1 bit64_4.0.5
## [25] grid_4.1.2 tidyselect_1.1.1 data.table_1.14.2 glue_1.6.0
## [29] R6_2.5.1 processx_3.5.2 fansi_0.5.0 pacman_0.5.1
## [33] purrr_0.3.4 callr_3.7.0 blob_1.2.2 codetools_0.2-18
## [37] scales_1.1.1 ps_1.6.0 ellipsis_0.3.2 assertthat_0.2.1
## [41] colorspace_2.0-2 utf8_1.2.2 munsell_0.5.0 cachem_1.0.6
## [45] crayon_1.4.2
Wrap-up
My take home messages
- Arrow is indeed (unbelievably) fast to parse a csv and ingest data, particularly, into the arrow file format.
- DuckDB looked fine at the beginning, but got slow when the data got larger. Would need to check why that happens (it should be faster than Postgres).
- Despite being fast, Arrow did struggle with encoding.
- I could not quickly find how to tell arrow to read/ingest just some rows and
not the whole csv file. It seems there is only support for
skip_rows
before column names andskip_rows_after_names
, but not something likemax_rows
. This would be handy to inspect huge files and maybe write and test queries on a small sample, before send it on the whole data. In DuckDB, even thoughread_csv()
does not have such an option, you can work around this by using a LIMIT statement while querying the csv file (but watch out if you are trying to avoid invalid lines, that approach can bite you).
Some non-exaustive TODOs:
- Try to find the underlying cause of DuckDB getting slower
- Perhaps fiddle with config options for each backend and ingest method. So far I used only the defaults.
- Take a look at the gains that would bring setting optimal data types
- Compare size on disk
- Compare aggregate queries
- Compare joins
- Sometime, try and run this kind of thing on a not-so-poorman’s laptop
- What about an alter table add column, in a columnar format?
- …
-
meaning it does not fit comfortably in my laptop’s 32GB of RAM. ↩︎
-
Vaex has also been in my watch-list, and include most of these features. But it is Python-centric. ↩︎
-
So this is not really a formal benchmark. Both DuckDB and Arrow have extensive testing and routinely benchmark their software, and this exercise by no means aims to oppose or replace that. ↩︎
-
The whole data comprise 12 csv files, each with 400M - 900M rows and uncompressed size between 50GB and 140GB. ↩︎
-
No
bench::mark()
here, though, ’cause it would have taken ~4-5 days to run. ↩︎