Item 44107670

szarnyasg • 6 days ago

Great!

> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?

Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.

jonstewart • 5 days ago

Ah, drat. I have an application that uses DuckDB to output parquet files. That application is by necessity disconnected from any sense of a data lake. But, I would love to have a good way of then pushing them up to S3 and integrating into a data lake. I’ve been looking into Iceberg and I’ve had the thought, “this is great but I hate the idea of what all these little metadata files will do to latency.”

tracnar • 4 days ago

I was also thinking about this use case when reading the announcement. Let's say you have a bunch of parquet files already (on local FS, HTTPS, S3, ...) that you can assume are immutable (or maybe append-only). It would be great if you could attach them to the DuckLake without copying them! From the design doc, it seems it should essentially work, you would read those parquet files to compute the metadata, and insert a reference to the parquet file instead of copying them to the storage you manage. Basically you want to create the catalog independently from the underlying data.