Back
Featured image of post Test-driving DuckDB and Arrow in a poorman's laptop

Test-driving DuckDB and Arrow in a poorman's laptop

Arrow is incredibly fast ingesting data into its native format

Whenever I have to analyze largish1 data in my laptop, I instinctively fire up my good old PostgreSQL server running locally, ingest the data in there and breezily wrangle my way through the data using SQL. Postgres may not be the fastest alternative out there. But it is feature-rich: you can throw anything at it, and it just works. It’s very hard to make it choke with an out-of-memory error and, by default, runs just fine in resource-constrained environments (e.g. in a Raspberry Pi).

I’ve kept an eye on DuckDB and Apache Arrow as other open-source alternatives for the workflow outlined above. Both share a number of features that make them very appealing for analytical workloads in general, and for the in-laptop-analytics in particular 2:

  • Columnar format and vectorized operations
  • In-process libraries (no need to install a server)
  • Handle larger-than-memory datasets
  • APIs available for several languages, including R and Python
  • Open-source

I did try both DuckDB and Arrow some time ago (very quickly, though). But I do not use them routinely. At least not just yet. When I first tried them, Arrow did not support grouped aggregations and joins and that was a no-go for me. DuckDB did not have support for grouping sets or filtered aggregation (or at least I could not get it to work) and I found it slow ingesting data from CSV files (much slower than Postgres in that, even though queries were indeed faster).

But both projects move at a very fast pace and now I see release news where Arrow 6.0.0 “adds grouped aggregation and joins in the dplyr interface, on top of the new Arrow C++ query engine” and DuckDB announced “support for GROUPING SETS, ROLLUP, CUBE”. So it is time to take both projects for a spin again (and try and document the test-drive here).

The test-drive

Ultimately, I would like to run an existing data wrangling pipeline using DuckDB and Arrow as backends, and compare running time using PostgreSQL as backend. For this, DuckDB is particularly appealing because they use PostgreSQL parser and support a similar feature set. So, ideally, the pipeline can run on DuckDB simply by changing the connection without having to change the queries. That did not really work as smoothly, so I will be trying step-by-step: i) ingest the data, ii) try some aggregate queries, iii) joins, …, and let’s see how it goes.

The data

For this test it was key to use real data. Meaning a large data set that I would actually have to use / have used for analysis in my laptop. Of course I could have simply used the popular NYC Taxi trip record data or any other large data set publicly available. But that would have been missing the point. I wanted to keep this exercise as close as possible to an actual use case for me. After all, this is just an informal test to see how to integrate DuckDB and/or Arrow in my workflow and to assess the potential benefits, for example, in terms of processing time 3.

I will use a few samples of a larger data set 4 that contains 25 columns (2 numeric, 2 timestamp, 10 integer and the rest text, most of them categorical variables, some with low cardinality -e.g. sex-, others with millions of different values). I will use one round of data (~486M rows) stored in a csv file (uncompressed 82GB). First test in smaller samples (tiny, small, mid) before trying out using one whole round of data.

  • Tiny data: 1M rows, csv uncompressed size ~170MB
  • Small data: 10M rows, csv uncompressed size ~1,7GB
  • Mid data: 60M rows, csv uncompressed size ~10GB
  • Round data: ~486M rows, csv uncompressed size ~82GB

The code and software

I will be running this test from R, using {targets} to orchestrate the different steps and {bench} to benchmark code snippets and measure execution time.

  • DuckDB version 0.3.0 [I did not pay attention and ran this using 0.3.0 and then I noticed 0.3.1 is already available, which enables by default multi-threading; so perhaps I should re-run this with 0.3.1 to see if it changes the results]
  • Arrow version 6.0.0
  • PostgreSQL version 14.1

Initially I will be using DuckDB and Arrow without fiddling with any configuration (no custom memory_limit or threads; just the defaults). In PostgreSQL, however, I typically tweak a couple of configuration parameters, in this case, shared_buffers=2GB and work_mem=1GB (see this link for some useful rules-of-thumb).

The code for the whole exercise is available in this Github repo

A poorman’s laptop

Running this may take a while, so I cannot really use my daily driver laptop. But I do have an old laptop lying around that can be used for this. It’s a 10-years old Dell XPS 17. I guess at the time it was a good machine: Processor Intel i7-2630QM, 8GB DDR3 RAM, SSD SATA, nowadays running Windows 10 Home. But for today’s standards, it can be considered a resource-constrained environment.

So it’s important to keep in mind the kind of machine running this exercise. Not sure about this, but I would consider the results as a lower bound to the performance improvements that DuckDB and Arrow could bring (e.g. Arrow explicitly try to take advantage of “modern hardware”). It will be interesting to see how much of a difference modern hardware would make. Hopefully soon, I will get my hands on a new fancy MacBook Pro to replicate this test.

Ingesting data

What will be tested

I will compare the time it takes to ingest the raw data from the csv files into each of the backends.

Arrow

In the case of Arrow, let’s try with the two key file formats: i. parquet and ii. arrow (aka ips/feather; see ?arrow:open_dataset). It boils down to use a couple of functions from the {arrow} R package: open_dataset() and write_dataset(), like this:

  csv_ds <- arrow::open_dataset(
    sources = src_file,
    format = "text",
    delimiter = ";"
  )
  
  arrow::write_dataset(
    dataset = csv_ds,
    path = arrow_path,
    format = "arrow" # and "parquet"
  )

DuckDB

For DuckDB, probably the simplest way is to use a CREATE TABLE AS statement with a query reading directly from the csv file using read_csv_auto(). This lets DuckDB detect everything (column names, delimiter, data types, etc.).

CREATE TABLE table_name AS
SELECT * 
FROM read_csv_auto('src_file.csv')
;

An alternative is the COPY statement. You would have to type more to “manually” create the table and decide on the data type for each column doing something like this: CREATE TABLE table_name ("col1" TEXT, ...);. And after that just copy the csv into that table:

COPY table_name
FROM 'src_file.csv' ( DELIMITER ';', HEADER )
;

A third alternative described in the documentation (hereafter labelled as copy2) is a slight variation of the COPY statement to let DuckDB detect the format “and omit the otherwise required configuration options.”

COPY table_name
FROM 'src_file.csv' ( AUTO_DETECT TRUE )
;

Let’s compare these three approaches, just to take a look if there is any overhead associated to auto detection of csv format and data types.

PostgreSQL

For Postgres, let’s also compare the COPY statement approach, which uses very similar syntax as in DuckDB

COPY table_name
FROM 'src_file.csv' (FORMAT 'csv', DELIMITER ';', HEADER )
;

And an alternative using a foreign table and csv foreign data wrapper.

CREATE EXTENSION IF NOT EXISTS file_fdw; 
CREATE SERVER IF NOT EXISTS csv_src FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE table_name_fg (
  col1 STRING,
  ...
)
SERVER csv_src
OPTIONS (
  FILENAME 'src_file.csv',
  FORMAT 'csv',
  HEADER 'TRUE',
  DELIMITER ';',
  ENCODING 'UTF-8'
)
;

CREATE TABLE table_name AS SELECT * FROM table_name_fg;

Results: Who is faster ingesting such data?

Well, for this exercise Arrow is fast!, way faster than Postgres, but also faster than DuckDB. Arrow native file format is also faster than parquet. Both the tests using tiny and small data show very similar results, where arrow file format is the fastest and compared to that it takes around:

  • 1.6x longer to ingest into parquet
  • 5x longer to ingest into DuckDB
  • 15x longer to ingest into PostgreSQL

There are no major differences associated to the approach used to ingest in each backend (so, no major overhead for auto-detecting file format and data types in DuckDB).

{bench} details for tiny data
expression min median itr/sec mem_alloc gc/sec n_itr n_gc total_time result memory time gc
arrow 1.3s 1.31s 0.7604814 NA 0.4345608 7 4 9.21s NULL NULL 1.836621, 1.467352, 1.328626, 1.314380, 1.420574, 1.326711, 1.312669, 1.297491, 1.298764, 1.326055 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
parquet 2.09s 2.13s 0.4681122 NA 0.0000000 10 0 21.36s NULL NULL 2.272508, 2.125310, 2.138580, 2.101764, 2.098363, 2.136137, 2.157207, 2.126433, 2.114051, 2.092045 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb 5.15s 6.79s 0.1488356 NA 0.0000000 10 0 1.12m NULL NULL 5.153599, 6.783404, 6.725932, 6.822941, 6.752167, 6.830131, 6.721660, 6.902978, 6.793911, 7.701503 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb_copy 8.33s 8.42s 0.1179548 NA 0.0000000 10 0 1.41m NULL NULL 8.416875, 8.519529, 8.361855, 8.623358, 8.420646, 8.392340, 8.334223, 8.698937, 8.412829, 8.597660 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb_copy2 6.47s 7.15s 0.1311416 NA 0.0145713 9 1 1.14m NULL NULL 8.828867, 8.365383, 8.991698, 8.479016, 6.709015, 6.474999, 6.769448, 7.149739, 6.826154, 6.742778 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
pg_foreign 19.36s 25.09s 0.0431928 NA 0.0047992 9 1 3.47m NULL NULL 21.76691, 19.67458, 19.53117, 19.52908, 19.36306, 25.09045, 26.06149, 26.48230, 26.44349, 26.19228 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
pg_copy 22.02s 22.07s 0.0452669 NA 0.0050297 9 1 3.31m NULL NULL 22.01945, 22.23197, 22.09389, 22.03953, 22.18977, 22.06471, 22.11497, 22.01474, 22.03280, 22.13380 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{bench} details for small data
expression min median itr/sec mem_alloc gc/sec n_itr n_gc total_time result memory time gc
arrow 12.5s 16.27s 0.0658646 NA 0.0000000 10 0 2.53m NULL NULL 14.26708, 12.57470, 12.76333, 12.50327, 16.12645, 16.42233, 16.75937, 16.68181, 16.63274, 17.09561 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
parquet 26.76s 27.37s 0.0362535 NA 0.0000000 10 0 4.6m NULL NULL 28.99158, 27.96898, 26.95793, 26.76077, 27.31003, 27.87820, 27.42645, 26.93932, 28.47725, 27.12517 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb 1.15m 1.42m 0.0120938 NA 0.0013438 9 1 12.4m NULL NULL 96.78378, 85.15403, 85.53791, 85.18231, 68.82363, 73.90329, 85.23597, 86.54822, 87.04882, 86.74796 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb_copy 1.19m 1.23m 0.0129047 NA 0.0000000 10 0 12.91m NULL NULL 86.58523, 87.16100, 71.94344, 71.47411, 72.32618, 72.20082, 73.62987, 73.58680, 75.32237, 90.67873 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb_copy2 1.26m 1.51m 0.0116235 NA 0.0000000 10 0 14.34m NULL NULL 90.50274, 91.91102, 91.19395, 91.24822, 91.36625, 92.39341, 84.57364, 75.66488, 75.84726, 75.62813 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
pg_foreign 2.98m 4m 0.0045031 NA 0.0005003 9 1 33.31m NULL NULL 233.7513, 240.4563, 239.7256, 196.6095, 202.6231, 241.5512, 241.1700, 186.1951, 178.4919, 240.6790 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
pg_copy 2.72m 3.55m 0.0049937 NA 0.0000000 10 0 33.38m NULL NULL 219.6726, 215.4499, 163.2316, 184.3710, 221.4437, 219.7841, 185.1851, 163.6718, 210.6791, 219.0370 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Interestingly, when the data are somewhat larger, at some point DuckDB’s performance degrades and falls behind Postgres. While the other relative times remain in the same ballpark (1.6x parquet, ~12x Postgres), now it takes almost 25x longer for DuckDB to ingest the data, compared to arrow. In absolute numbers, while Arrow takes less than 2 minutes to ingest the data, DuckDB takes about 40 minutes (and Postgres less than 20 minutes).

expression min median total_time n_itr n_gc
arrow 1.58m 1.61m 13.02m 8 4
parquet 2.85m 2.98m 30.54m 10 0
duckdb 37.6m 39.75m 6.66h 10 0
duckdb_copy 37.43m 39.21m 6.5h 10 0
duckdb_copy2 38.52m 39.34m 5.93h 9 1
pg_foreign 18.77m 19.26m 2.91h 9 1
pg_copy 17.64m 17.89m 2.75h 9 1
{bench} details for mid data
expression min median itr/sec mem_alloc gc/sec n_itr n_gc total_time result memory time gc
arrow 1.58m 1.61m 0.0102385 NA 0.0051192 8 4 13.02m NULL NULL 115.88758, 103.35154, 100.03381, 97.96673, 101.98887, 97.18950, 95.27126, 94.86438, 95.85733, 94.87548 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
parquet 2.85m 2.98m 0.0054566 NA 0.0000000 10 0 30.54m NULL NULL 175.7699, 183.1974, 209.0024, 176.9291, 171.2705, 176.0256, 171.7868, 191.0403, 181.2983, 196.3368 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb 37.6m 39.75m 0.0004170 NA 0.0000000 10 0 6.66h NULL NULL 2497.150, 2255.792, 2297.318, 2384.956, 2308.614, 2379.006, 2589.718, 2469.575, 2384.735, 2415.811 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb_copy 37.43m 39.21m 0.0004274 NA 0.0000000 10 0 6.5h NULL NULL 2354.093, 2389.514, 2415.575, 2351.088, 2245.535, 2364.025, 2281.866, 2282.393, 2388.673, 2321.990 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
duckdb_copy2 38.52m 39.34m 0.0004216 NA 0.0000468 9 1 5.93h NULL NULL 2311.412, 2445.555, 2360.389, 2359.646, 2393.704, 2371.317, 2354.011, 2467.153, 2351.584, 2379.501 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
pg_foreign 18.77m 19.26m 0.0008593 NA 0.0000955 9 1 2.91h NULL NULL 1249.875, 1142.859, 1145.987, 1125.913, 1171.138, 1182.262, 1204.312, 1131.437, 1214.214, 1155.513 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
pg_copy 17.64m 17.89m 0.0009091 NA 0.0001010 9 1 2.75h NULL NULL 1062.921, 1107.936, 1058.463, 1080.978, 1109.508, 1061.997, 1071.106, 1073.457, 1138.315, 1216.212 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

The results ingesting a whole round of data 5 (~486 million rows in a single csv file, 82GB) are mostly consistent with the times shown above for mid data; DuckDB took more than 5 hours to ingest the whole file, followed by PostgreSQL that took a bit less than 3 hours, followed by parquet that took about 1 hour and the fastest is, again, Arrow taking only a bit more than 16 minutes.


Session Info
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] conflicted_1.0.4  tarchetypes_0.4.0 targets_0.9.0     odbc_1.3.2       
##  [5] arrow_6.0.0.2     duckdb_0.3.0      DBI_1.1.2         ggplot2_3.3.5    
##  [9] tidyr_1.1.4       dplyr_1.0.7       magrittr_2.0.1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        compiler_4.1.2    pillar_1.6.4      tools_4.1.2      
##  [5] digest_0.6.29     bit_4.0.4         lifecycle_1.0.1   tibble_3.1.6     
##  [9] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.12      igraph_1.2.10    
## [13] rstudioapi_0.13   cli_3.1.0         yaml_2.2.1        xfun_0.29        
## [17] fastmap_1.1.0     knitr_1.37        withr_2.4.3       fs_1.5.2         
## [21] generics_0.1.1    vctrs_0.3.8       hms_1.1.1         bit64_4.0.5      
## [25] grid_4.1.2        tidyselect_1.1.1  data.table_1.14.2 glue_1.6.0       
## [29] R6_2.5.1          processx_3.5.2    fansi_0.5.0       pacman_0.5.1     
## [33] purrr_0.3.4       callr_3.7.0       blob_1.2.2        codetools_0.2-18 
## [37] scales_1.1.1      ps_1.6.0          ellipsis_0.3.2    assertthat_0.2.1 
## [41] colorspace_2.0-2  utf8_1.2.2        munsell_0.5.0     cachem_1.0.6     
## [45] crayon_1.4.2

Wrap-up

My take home messages

  • Arrow is indeed (unbelievably) fast to parse a csv and ingest data, particularly, into the arrow file format.
  • DuckDB looked fine at the beginning, but got slow when the data got larger. Would need to check why that happens (it should be faster than Postgres).
  • Despite being fast, Arrow did struggle with encoding.
  • I could not quickly find how to tell arrow to read/ingest just some rows and not the whole csv file. It seems there is only support for skip_rows before column names and skip_rows_after_names, but not something like max_rows. This would be handy to inspect huge files and maybe write and test queries on a small sample, before send it on the whole data. In DuckDB, even though read_csv() does not have such an option, you can work around this by using a LIMIT statement while querying the csv file (but watch out if you are trying to avoid invalid lines, that approach can bite you).

Some non-exaustive TODOs:

  • Try to find the underlying cause of DuckDB getting slower
  • Perhaps fiddle with config options for each backend and ingest method. So far I used only the defaults.
  • Take a look at the gains that would bring setting optimal data types
  • Compare size on disk
  • Compare aggregate queries
  • Compare joins
  • Sometime, try and run this kind of thing on a not-so-poorman’s laptop
  • What about an alter table add column, in a columnar format?

  1. meaning it does not fit comfortably in my laptop’s 32GB of RAM. ↩︎

  2. Vaex has also been in my watch-list, and include most of these features. But it is Python-centric. ↩︎

  3. So this is not really a formal benchmark. Both DuckDB and Arrow have extensive testing and routinely benchmark their software, and this exercise by no means aims to oppose or replace that. ↩︎

  4. The whole data comprise 12 csv files, each with 400M - 900M rows and uncompressed size between 50GB and 140GB. ↩︎

  5. No bench::mark() here, though, ’cause it would have taken ~4-5 days to run. ↩︎

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy