Item 44109784

bitbang • 5 days ago

Why is the footer metadata not sufficient for this need? The metadata should contain the min and max timestamp values from the respective column of interest, so that when executing a query, the query tool should be optimizing its query by reading the metadata to determine if that parquet file should be read or not depending on what time range is in the query.

amluto • 5 days ago

Because the footer metadata is in the Parquet file, which is already far too late to give an efficient query.

If I have an S3 bucket containing five years worth of Parquet files, each covering a few days worth of rows, and I tell my favorite query tool (DuckDB, etc) about that bucket, then the tool will need to do a partial read (which is multiple operations, I think, since it will need to find the footer and then read the footer) of ~500 files just to find out which ones contain the data of interest. A good query plan would be to do a single list operation on the bucket to find the file names and then to read the file or files needed to answer my query.

Iceberg and Delta Lake (I think -- I haven't actually tried it) can do this, but plain Parquet plus Hive partitioning can't, and I'm not aware of any other lightweight scheme that is well supported that can do it. My personal little query tool (which predates Parquet) can do it just fine by the simple expedient of reading directory names.

1 reply

Jarwain • 5 days ago

Maybe I'm misunderstanding something about how ducklake works, but isn't that the purpose of the 'catalog database'? To store the metadata about all the files to optimize the query?

In theory, going off of the schema diagram they have, all your files are listed in `data_file`, the timestamp range for that file would be in `file_column_stats`, and that information could be used to decide what files to _actually_ read based on your query.

Whether duckdb's query engine takes advantage of this is a different story, but even if it doesn't Yet it should be possible to do so Eventually.

1 reply

amluto • 4 days ago

Yes, and this is how basically every “lake” thing works. But all the lake solutions add a lot more complexity than just improving the parquet filename scheme, and all of them require that all the readers and all the writers agree on a particular “lake”.

1 reply

Jarwain • 3 days ago

That's fair! I guess I see it as trading technical complexity with the human complexity of getting everyone on board with an update to the standard, and getting that standard implemented across the board. It's a lot easier to get my coworkers to just use duckdb as a reader/writer with ducklake than to change the system.

Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

1 reply

amluto • 3 days ago

> Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

Maybe convince DuckDB and/or clickhouse-local and/or polars.scan_parquet to implement it as a pilot? If it's a success, other tools might follow suit.

Or maybe something like DuckLake could have an option to put column statistics in the filenames. I raised this as a discussion:

https://github.com/duckdb/ducklake/discussions/92

1 reply

Jarwain • 3 days ago

I'm not super sure about it being in the filename, if only because my understanding is that some of the lakes use it for partitioning and other metadata (metameta-data?).

Imo range is probably the most useful statistic in a folder/file name anyways for partitioning purposes. My vote would be for `^` as the range separator to minimize risk of collision and confusion. i.e. `timestamp=2025-03-27T00:00:00-0800^2025-03-30-0700` or `hour=0^12`,`hour=12^24`. `^` is valid across all systems, and I'd be very surprised if it was commonly used as a property/column name. Only collision I can think of is that its start-of-line in regex

1 reply

Jarwain • 2 days ago

Too late to edit buuut

There's a standard! (for time intervals, and I could see it working here)[0]

> Section 3.2.6 of ISO 8601-1:2019 notes that "A solidus may be replaced by a double hyphen ["--"] by mutual agreement of the communicating partners",

So forget what I said; why exacerbate the standards problem?[1]

[0]https://en.wikipedia.org/wiki/ISO_8601#Time_intervals [1]https://xkcd.com/927/

dugmartin • 5 days ago

This can also be done using row group metadata within the parquet file. The row group metadata can include the range values of ordinals so you can "partition" on timestamps without having to have a file per time range.

2 replies

amluto • 5 days ago

But I want a file per range! I’m already writing out an entire chunk of rows, and that chunk is a good size for a Parquet file, and that chunk doesn’t overlap the previous chunk.

Sure, metadata in the Parquet file handles this, but a query planner has to read that metadata, whereas a sensible way to stick the metadata in the file path would allow avoiding reading the file at all.

1 reply

mrlongroots • 5 days ago

I have the same gripe. You want a canonical standard that's like "hive partitioning" but defines the range [val1, val2) as column=val1_val2. It's a trivial addition on top of Parquet.

1 reply

amluto • 5 days ago

That would do the trick, as would any other spelling of the same thing.

simlevesque • 5 days ago

I wish we had more control of the row group metadata when writing Parquet files with DuckDB.