Maybe I'm misunderstanding something about how ducklake works, but isn't that the purpose of the 'catalog database'? To store the metadata about all the files to optimize the query?
In theory, going off of the schema diagram they have, all your files are listed in `data_file`, the timestamp range for that file would be in `file_column_stats`, and that information could be used to decide what files to _actually_ read based on your query.
Whether duckdb's query engine takes advantage of this is a different story, but even if it doesn't Yet it should be possible to do so Eventually.
Yes, and this is how basically every “lake” thing works. But all the lake solutions add a lot more complexity than just improving the parquet filename scheme, and all of them require that all the readers and all the writers agree on a particular “lake”.
That's fair! I guess I see it as trading technical complexity with the human complexity of getting everyone on board with an update to the standard, and getting that standard implemented across the board. It's a lot easier to get my coworkers to just use duckdb as a reader/writer with ducklake than to change the system.
Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like
> Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like
Maybe convince DuckDB and/or clickhouse-local and/or polars.scan_parquet to implement it as a pilot? If it's a success, other tools might follow suit.
Or maybe something like DuckLake could have an option to put column statistics in the filenames. I raised this as a discussion:
I'm not super sure about it being in the filename, if only because my understanding is that some of the lakes use it for partitioning and other metadata (metameta-data?).
Imo range is probably the most useful statistic in a folder/file name anyways for partitioning purposes. My vote would be for `^` as the range separator to minimize risk of collision and confusion. i.e. `timestamp=2025-03-27T00:00:00-0800^2025-03-30-0700` or `hour=0^12`,`hour=12^24`. `^` is valid across all systems, and I'd be very surprised if it was commonly used as a property/column name. Only collision I can think of is that its start-of-line in regex
Too late to edit buuut
There's a standard! (for time intervals, and I could see it working here)[0]
> Section 3.2.6 of ISO 8601-1:2019 notes that "A solidus may be replaced by a double hyphen ["--"] by mutual agreement of the communicating partners",
So forget what I said; why exacerbate the standards problem?[1]
[0]https://en.wikipedia.org/wiki/ISO_8601#Time_intervals [1]https://xkcd.com/927/