It looks very promising, especially knowing DuckDB team is behind it. However I really don't understand how to insert data in it. Are we supposed to use DuckDB INSERT statement with any function to read external files or any other data ? Looks very cool though.
Yes, you can use standard SQL constructs such as INSERT statements and COPY to load data into DuckLake.
(diclaimer: I work at DuckDB Labs)
Thank you for your work ! We use DuckDB with dbt-duckdb in production (because on-prem and because we don't need ten thousands nodes) and we love it ! About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ? From my understanding DuckLake was responsible for managing the files on the storage layer.
Great!
> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?
Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.
Ah, drat. I have an application that uses DuckDB to output parquet files. That application is by necessity disconnected from any sense of a data lake. But, I would love to have a good way of then pushing them up to S3 and integrating into a data lake. I’ve been looking into Iceberg and I’ve had the thought, “this is great but I hate the idea of what all these little metadata files will do to latency.”
I was also thinking about this use case when reading the announcement. Let's say you have a bunch of parquet files already (on local FS, HTTPS, S3, ...) that you can assume are immutable (or maybe append-only). It would be great if you could attach them to the DuckLake without copying them! From the design doc, it seems it should essentially work, you would read those parquet files to compute the metadata, and insert a reference to the parquet file instead of copying them to the storage you manage. Basically you want to create the catalog independently from the underlying data.