Item 44114328

Dowwie • 5 days ago

Time series data is naturally difficult to work with, but avoidable. One solution is to not query raw time series data files. Instead, segment your time series data before you store it, normalizing the timestamps as part of event processing. Sliding window observations will help you find where the event begins and then you adjust the offset until you find where the time series returns to its zero-like position. That's your event.

amluto • 5 days ago

Segmenting data is exactly what writing it into non-overlapping Parquet files is. My point is that many tools can read a bucket full of these segments, and most of them can handle a scheme where each file corresponds to a single value of a column, but none of them can agree on how to efficiently segment the data where each segment contains a range, unless a new column is invented for the purpose and all queries add complexity to map onto this type of segmentation.

There’s nothing conceptually or algorithmically difficult about what I want to do. All that’s needed is to encode a range of times into the path of a segment. But Hive didn’t do this, and everyone implemented Hive’s naming scheme, and that’s the status quo now.