Item 44107493

spenczar5 • 6 days ago

There is a lot to like here, but once metadata is in the novel Ducklake format, it is hard to picture how you can get good query parallelism, which you need for large datasets. Iceberg already is well supported by lots of heavy-duty query engines and that support is important once you have lots and lots and lots of data.

buremba • 6 days ago

You don't need to store the metadata in DuckDB; it can live in your own PostgreSQL/MySQL, similar to Iceberg REST Catalog. They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.

They don't focus on solving the scalability problem in the metadata layer; you might need to scale your PostgreSQL independently as you have many DuckDB compute nodes running on the edge.

1 reply

spenczar5 • 6 days ago

Even though it's in your own SQL DB, there's still some sort of layout for the metadata. That's the thing that trino/bigquery/whatever won't understand (yet?).

> They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.

Hmm, I don't understand this one. How do you horizontally scale a query that scans all data to do `select count(*), col from huge_table group by col`, for example? In a traditional map-reduce engine, that turns into parallel execution over chunks of data, which get later merged for the final result. In DuckDB, doesn't that necessarily get done by a single node which has to inspect every row all by itself?

2 replies

buremba • 5 days ago

You are right, you can only scale vertically for single query execution but given that cloud providers now have beefy machines available with TBs of memory and hundreds of CPUs, it’s not a blocker unless you are querying petabyte level raw data. Trino/Spark is still unbeatable in that way but in most cases, you partition the data and use predicate pushdown anyways.

You can scale horizontally if you have many concurrent queries pretty well, that’s what I was referring to.

nojvek • 6 days ago

you're correct that duckdb doesn't do any multi-node map-reduce, however duckdb utilizes all available cores on a node quite effectively to parallelize scanning. And node sizes nowadays get upto 192 vCPUs.

A single node can scan through several gigabytes of data per second. When the column data is compressed through various algorithms, this means billions of rows / sec.

formalreconfirm • 6 days ago

Someone correct me if I'm wrong but from my understanding, DuckDB will always be the query engine, thus I suppose you will have access to DuckDB query parallelism (single node but multithreaded with disk spilling etc) + statistics-based optimizations like file pruning, predicate pushdown etc offered by DuckLake. I think DuckLake is heavily coupled to DuckDB (Which is good for our use case). Again, this is my understanding, correct me if wrong.

3 replies

anentropic • 6 days ago

It seems to me that by publishing the spec other non-DuckDB implementations could be built?

It's currently only DuckDB specific because the initial implementation that supports this new catalog is a DuckDB extension

memhole • 6 days ago

From my perspective the issue is analytics support. You’ll need a step that turns it into something supported by BI tools. Obviously if something like Trino picks up the format it’s not an issue

1 reply

mrbungie • 5 days ago

DuckDB has support for ODBC/ADBC and even JDBC (so Trino should be easy, https://trino.io/docs/current/connector/duckdb.html).

spenczar5 • 6 days ago

I agree with everything you said. I just mean that a single node may be slow when processing those parquet files in a complex aggregation, bottlenecked on network IO or CPU or available memory.

If the thesis here is that most datasets are small, fair enough - but then why use a lake instead of big postgres, yknow?

1 reply

formalreconfirm • 6 days ago

That's the part I don't really get. In the Manifesto they are talking about scaling to hundreds of terabytes and thousands of compute nodes. But DuckDB compute nodes, even if they are very performant, at the end are single nodes, so even if your lakehouse contains TB of data, you will be limited to your biggest client capacity (I know DuckDB works well with data bigger than memory, but still, I suppose it can reach limits at some point). At the end I think DuckLake is aimed at lakehouses of "reasonable" size the same way DuckDB is intended for data of "reasonable" size.

2 replies

dkdcio • 6 days ago

Huge "it depends", but typically organizations are not querying all of their data at once. Usually, they're processing it in some time-based increments.

Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.

mrbungie • 5 days ago

You can run a fleet of DuckDB instances and process data in a partitioned way.

Yes, there must be some use cases where you need all the data loaded up and addressable seamlessly across a cluster, but those are rare and typically FAANG-class problems.