Item 44110981

amluto • 5 days ago

Because the footer metadata is in the Parquet file, which is already far too late to give an efficient query.

If I have an S3 bucket containing five years worth of Parquet files, each covering a few days worth of rows, and I tell my favorite query tool (DuckDB, etc) about that bucket, then the tool will need to do a partial read (which is multiple operations, I think, since it will need to find the footer and then read the footer) of ~500 files just to find out which ones contain the data of interest. A good query plan would be to do a single list operation on the bucket to find the file names and then to read the file or files needed to answer my query.

Iceberg and Delta Lake (I think -- I haven't actually tried it) can do this, but plain Parquet plus Hive partitioning can't, and I'm not aware of any other lightweight scheme that is well supported that can do it. My personal little query tool (which predates Parquet) can do it just fine by the simple expedient of reading directory names.

Jarwain • 4 days ago

Maybe I'm misunderstanding something about how ducklake works, but isn't that the purpose of the 'catalog database'? To store the metadata about all the files to optimize the query?

In theory, going off of the schema diagram they have, all your files are listed in `data_file`, the timestamp range for that file would be in `file_column_stats`, and that information could be used to decide what files to _actually_ read based on your query.

Whether duckdb's query engine takes advantage of this is a different story, but even if it doesn't Yet it should be possible to do so Eventually.

1 reply

amluto • 4 days ago

Yes, and this is how basically every “lake” thing works. But all the lake solutions add a lot more complexity than just improving the parquet filename scheme, and all of them require that all the readers and all the writers agree on a particular “lake”.

1 reply

Jarwain • 3 days ago

That's fair! I guess I see it as trading technical complexity with the human complexity of getting everyone on board with an update to the standard, and getting that standard implemented across the board. It's a lot easier to get my coworkers to just use duckdb as a reader/writer with ducklake than to change the system.

Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

1 reply

amluto • 3 days ago

> Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

Maybe convince DuckDB and/or clickhouse-local and/or polars.scan_parquet to implement it as a pilot? If it's a success, other tools might follow suit.

Or maybe something like DuckLake could have an option to put column statistics in the filenames. I raised this as a discussion:

https://github.com/duckdb/ducklake/discussions/92

1 reply

Jarwain • 3 days ago

I'm not super sure about it being in the filename, if only because my understanding is that some of the lakes use it for partitioning and other metadata (metameta-data?).

Imo range is probably the most useful statistic in a folder/file name anyways for partitioning purposes. My vote would be for `^` as the range separator to minimize risk of collision and confusion. i.e. `timestamp=2025-03-27T00:00:00-0800^2025-03-30-0700` or `hour=0^12`,`hour=12^24`. `^` is valid across all systems, and I'd be very surprised if it was commonly used as a property/column name. Only collision I can think of is that its start-of-line in regex

1 reply

Jarwain • 2 days ago

Too late to edit buuut

There's a standard! (for time intervals, and I could see it working here)[0]

> Section 3.2.6 of ISO 8601-1:2019 notes that "A solidus may be replaced by a double hyphen ["--"] by mutual agreement of the communicating partners",

So forget what I said; why exacerbate the standards problem?[1]

[0]https://en.wikipedia.org/wiki/ISO_8601#Time_intervals [1]https://xkcd.com/927/