spenczar5 6 days ago

Even though it's in your own SQL DB, there's still some sort of layout for the metadata. That's the thing that trino/bigquery/whatever won't understand (yet?).

> They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.

Hmm, I don't understand this one. How do you horizontally scale a query that scans all data to do `select count(*), col from huge_table group by col`, for example? In a traditional map-reduce engine, that turns into parallel execution over chunks of data, which get later merged for the final result. In DuckDB, doesn't that necessarily get done by a single node which has to inspect every row all by itself?

2
buremba 5 days ago

You are right, you can only scale vertically for single query execution but given that cloud providers now have beefy machines available with TBs of memory and hundreds of CPUs, it’s not a blocker unless you are querying petabyte level raw data. Trino/Spark is still unbeatable in that way but in most cases, you partition the data and use predicate pushdown anyways.

You can scale horizontally if you have many concurrent queries pretty well, that’s what I was referring to.

nojvek 6 days ago

you're correct that duckdb doesn't do any multi-node map-reduce, however duckdb utilizes all available cores on a node quite effectively to parallelize scanning. And node sizes nowadays get upto 192 vCPUs.

A single node can scan through several gigabytes of data per second. When the column data is compressed through various algorithms, this means billions of rows / sec.