Jarwain 3 days ago

That's fair! I guess I see it as trading technical complexity with the human complexity of getting everyone on board with an update to the standard, and getting that standard implemented across the board. It's a lot easier to get my coworkers to just use duckdb as a reader/writer with ducklake than to change the system.

Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

1
amluto 3 days ago

> Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

Maybe convince DuckDB and/or clickhouse-local and/or polars.scan_parquet to implement it as a pilot? If it's a success, other tools might follow suit.

Or maybe something like DuckLake could have an option to put column statistics in the filenames. I raised this as a discussion:

https://github.com/duckdb/ducklake/discussions/92

Jarwain 3 days ago

I'm not super sure about it being in the filename, if only because my understanding is that some of the lakes use it for partitioning and other metadata (metameta-data?).

Imo range is probably the most useful statistic in a folder/file name anyways for partitioning purposes. My vote would be for `^` as the range separator to minimize risk of collision and confusion. i.e. `timestamp=2025-03-27T00:00:00-0800^2025-03-30-0700` or `hour=0^12`,`hour=12^24`. `^` is valid across all systems, and I'd be very surprised if it was commonly used as a property/column name. Only collision I can think of is that its start-of-line in regex

Jarwain 2 days ago

Too late to edit buuut

There's a standard! (for time intervals, and I could see it working here)[0]

> Section 3.2.6 of ISO 8601-1:2019 notes that "A solidus may be replaced by a double hyphen ["--"] by mutual agreement of the communicating partners",

So forget what I said; why exacerbate the standards problem?[1]

[0]https://en.wikipedia.org/wiki/ISO_8601#Time_intervals [1]https://xkcd.com/927/