Item 44107756

spenczar5 • 6 days ago

I agree with everything you said. I just mean that a single node may be slow when processing those parquet files in a complex aggregation, bottlenecked on network IO or CPU or available memory.

If the thesis here is that most datasets are small, fair enough - but then why use a lake instead of big postgres, yknow?

formalreconfirm • 6 days ago

That's the part I don't really get. In the Manifesto they are talking about scaling to hundreds of terabytes and thousands of compute nodes. But DuckDB compute nodes, even if they are very performant, at the end are single nodes, so even if your lakehouse contains TB of data, you will be limited to your biggest client capacity (I know DuckDB works well with data bigger than memory, but still, I suppose it can reach limits at some point). At the end I think DuckLake is aimed at lakehouses of "reasonable" size the same way DuckDB is intended for data of "reasonable" size.

2 replies

dkdcio • 6 days ago

Huge "it depends", but typically organizations are not querying all of their data at once. Usually, they're processing it in some time-based increments.

Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.

mrbungie • 5 days ago

You can run a fleet of DuckDB instances and process data in a partitioned way.

Yes, there must be some use cases where you need all the data loaded up and addressable seamlessly across a cluster, but those are rare and typically FAANG-class problems.