Back
Featured image of post A second look at DuckDB's data ingestion

A second look at DuckDB's data ingestion

Cleaning-up seems to be the underlying issue

In the first post of this (already a) series, there was some performance degradation ingesting larger files into DuckDB. For larger csv files, DuckDB ended up being ~2x slower than Postgres, even though it was ~4x-5x faster in the somewhat smaller files.

So let’s take a closer look and try a couple of things to dx the issue.

  • Read more carefully the docs/source code
  • Try DuckDB’s latest version 0.3.1
  • Try in not-so-poorman’s laptop
  • Is it related to the timestamp issue?
  • Take a look at query plan and profiling info.

Recap

The figure below summarizes the findings of the first post, showing how many seconds it takes to ingest 1 million rows using each backend and data size (tiny, small, mid, year). There is evidently a sizeable performance degradation in DuckDB moving from small data (~10M rows) to mid data (~60M rows). While the performance of postgres, for example, remains stable around 20 secs per million rows ingested, in the case of DuckDB, it jumps from ~8 secs to more than 40 secs per million rows ingested, resulting in a whooping ~6hours to ingest ~483M rows (year data).

So, what is going on?

Any clue in the docs?

Taking a closer look at the read_csv and copy docs, I cannot find an evident reason for the slow-down / nor there seems to be any promising argument to tweak. Of course, choosing optimal data types could help (using, for example, enum, or casting to date columns loaded as timestamp), but I’ll leave that for other time.

Perhaps tweaking DuckDB’s config parameters (see configuration and pragmas) such as threads, memory_limit or checkpoint_threshold, wal_autocheckpoint could help. Not sure though if the threads arg applies for multi-threading read the csv, or it’s only for the query engine once the data are already in DuckDB.

Quickly browsing the source code, the only hint I get is that DuckDB uses chunks of 1024 rows, and that may have something to do with it. But this argument is not exposed in the read_csv (only sample_size is exposed, but it only applies to the sample used for automatic detection of delimiter, data type and so on) or pragmas.

So, I am still kind of clueless about what is going on. So let’s try couple of things.

DuckDB latest version and other laptop

Originally the test in the first post ran using DuckDB 0.3, but 0.3.1 is already available. In addition, it also ran on a very old laptop. So let’s also try in a not-so-old laptop 1 using the latest available version of DuckDB. The figure below summarizes the results and shows:

  1. Roughly similar times (in poorman’s laptop) as the original exercise, so no major differences due to DuckDB’s version.
  2. Running the thing in a not-so-poor laptop helps; it’s faster overall, and the performance degradation occurs mostly in the year data, and not right away in the mid data, as in the poorman’s laptop.
  3. But performance degradation is still there. So, it does not seem that old-hardware or outdated version of DuckDB are the main causes.

Is it the timestamp issue?

In the second and third posts of this series, an issue with timestamp columns came up; count aggregate queries on timestamp columns were way slower than other data types (and ended up being slower than postgres as well). So I wonder if that may have to do with the performance degradation observed ingesting the data. To check this, I tried comparing auto-detect, explicitly ingesting those columns as timestamp and ingesting those columns as varchar, so they are stored like any other text. The figure below shows the results: long story short, the timestamp issue does not seem to have anything to do with the performance drop ingesting the data. In fact, it actually helps ingesting as timestamp compared to ingesting as text (which is expected, because the timestamp has an underlying numeric representation).

Query plan and profiling info

The query execution plan is key to understand the performance of a SQL query and the first step to try and optimize it. EXPLAIN and EXPLAIN ANALYZE are your best friends. But for this particular case, I did not think it was actually useful. After all, it should be a pretty straightforward query plan: a sequential scan of the csv file and dump that data into DuckDB’s data structures. But let’s try anyway.

DuckDB supports the explain statement2.

import duckdb
con = duckdb.connect(database = '/backend/duckdb/backend.duckdb')
drop_res = con.execute("DROP TABLE IF EXISTS tmp;")
print(con.execute("""
  EXPLAIN 
  CREATE TABLE tmp AS 
  SELECT * 
  FROM read_csv_auto('D:/tiny_data.txt')
  ;
""").fetchall()[0][1])
## ┌───────────────────────────┐
## │      CREATE_TABLE_AS      │
## └─────────────┬─────────────┘                             
## ┌─────────────┴─────────────┐
## │       READ_CSV_AUTO       │
## └───────────────────────────┘
con.close()

And as expected, nothing really useful here.

For the “run-time profiling” -similar to Postgres’ EXPLAIN ANALYZE-, you need to enable that via PRAGMA statement (PRAGMA enable_profiling;).

import duckdb
con = duckdb.connect(database = '/backend/duckdb/backend.duckdb')
drop_res = con.execute("DROP TABLE IF EXISTS tmp;")
con.execute("""
  PRAGMA enable_profiling;
  CREATE TABLE tmp AS 
  SELECT * 
  FROM read_csv_auto('D:/tiny_data.txt')
  ;
""")
con.close()
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
   PRAGMA enable_profiling;   CREATE TABLE tmp AS    SELECT *    FROM read_csv_auto('D:/tiny_data.txt')   ; 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 2.00s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      CREATE_TABLE_AS      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.82s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_CSV_AUTO       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           999999          │
│          (1.17s)          │
└───────────────────────────┘                             

Ok, that can be interesting. You can see which operator (create_table, read_csv) takes more time, and compare the total time measured by the profiler with the observed time used so far -elapsed time between sending the query until it returns control to the caller-.

Taking a look at the profiling information for the combinations of data/laptop, the read_csv operator takes most of the time. But the interesting part is that the total time reported in the profiling information is always greater than the time experienced by the client. And that seems to explain the performance degradation (see figure below). The time reported by the profiler remains relatively stable as the data size increases, while the time the client actually has to wait before DuckDB returning control, as seen before, skyrockets after some data size threshold.

So it seems to be the cleanup work that takes much longer for larger datasets. Apparently DuckDB finishes reading and ingesting the data, then it goes ahead and writes the profiling file, but keeps busy cleaning up after itself, removing temporary files3 and so on. For the year data in the poorman’s laptop, that means reading and creating the table take about 1.5 hours, but the whole process until DuckDB is done, takes almost 6 hours!.

Wrap-up

My take home messages

  • Clean-up work seems to be the root of the performance degradation. That may involve removing temp files or moving data from WAL to the final DuckDB file. I wonder if there is an equivalent in DuckDB to the UNLOGGED table in postgres.

Some non-exaustive TODOs:

  • Set optimal data types for all columns (including using ENUM)

  1. Not a high-end device, though. i7-7500U, 32GB DDR-4 and a somewhat faster SSD drive (but still SATA). Also, it does not have much disk space available, so I had to put the raw data in an usb. ↩︎

  2. I have been running all this from R, mostly using the {DBI} package to interact with DuckDB. But, “by default, profiling information is printed to the console” and it seems {DBI} does not capture that at all. You can also send profiling info to a file, but, it’s seems just easier to use python here. ↩︎

  3. In one of those lengthy runs, the laptop restarted half-way and there were almost 1 million temp files (256k size). Just deleting them takes quite some time. That’s why I think that’s what’s keeping DuckDB busy. ↩︎

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy