[HN Gopher] Time-Series Compression Algorithms ___________________________________________________________________ Time-Series Compression Algorithms Author : todsacerdoti Score : 114 points Date : 2022-05-14 16:08 UTC (6 hours ago) (HTM) web link (www.timescale.com) (TXT) w3m dump (www.timescale.com) | [deleted] | infogulch wrote: | I found TerminuDB's white paper on graph compression interesting | and I think could be useful to time series compression: | | Succinct Data Structures and Delta Encoding for Modern Databases | - https://assets.terminusdb.com/research/succinct-data-structu... | sujayakar wrote: | Navarro's book Compact Data Structures is an excellent survey | of this literature! | https://www.cambridge.org/core/books/compact-data-structures... | | I love the perspective of computing the information theoretic | lower bounds for a data structure and then trying to achieve | that with as little overhead as possible while still supporting | efficient operations on the structure. | naturalplasmoid wrote: | One notable omission from this piece is a technique to compress | integer time series with both positive and _negative_ values. | | If you naively apply bit-packing using the Simple8b algorithm, | you'll find that negative integers are not compressed. This is | due to how signed integers are represented in modern computers: | negative integers will have their most significant bit set [1]. | | Zigzag encoding is a neat transform that circumvents this issue. | It works by mapping signed integers to unsigned integers so that | numbers with a small absolute value can be encoded using a small | number of bits. Put another way, it encodes negative numbers | using the least significant bit for sign. [2] | | If you're looking for a quick way to experiment with various time | series compression algorithm I highly recommend Daniel Lemire's | FastPFor repository [3] (as linked in the article). I've used the | Python bindings [4] to quickly evaluate various compression | algorithms with great success. | | Finally I'd like to humbly mention my own tiny contribution [5], | an adaptation of Lemire's C++ Simple8b implementation (including | basic methods for delta & zigzag encoding/decoding). | | I used C++ templates to make the encoding and decoding routines | generic over integer bit-width, which expands support up to 64 | bit integers, and offers efficient usage with smaller integers | (eg 16 bit). I made a couple other minor tweaks including support | for arrays up to 2^64 in length, and tweaking the API/method | signatures so they can be used in a more functional style. This | implementation is slightly simpler to invoke via FFI, and I | intend to add examples showing how to compile for usage via JS | (WebAssembly), Python, and C#. I threw my code up quickly in | order to share with you all, hopefully someone finds it useful. I | intend to expand on usage examples/test cases/etc, and am looking | forward to any comments or contributions. | | [1] https://en.wikipedia.org/wiki/Signed_number_representation | | [2] https://en.wikipedia.org/wiki/Variable- | length_quantity#Zigza... | | [3] https://github.com/lemire/FastPFor | | [4] https://github.com/searchivarius/PyFastPFor | | [5] https://github.com/naturalplasmoid/simple8b-timeseries- | compr... | jart wrote: | I love zigzag encoding. The way it uses shift and xor is | beautiful. It gives it a performance edge of 3 cycles on my cpu | versus signed-leb encoding. One thing I did once, to encode | UNICODE lookup tables in a smaller form (to reduce Python | binary size) is I found out it can be very profitable to (1) | delta encode; then (2) zig zag encode it; and finally (3) apply | deflate. For example: | _PyUnicode_PhrasebookOffset2: size is 178,176 | bytes _PyUnicode_PhrasebookOffset2: packed size is | 100,224 bytes _PyUnicode_PhrasebookOffset2: rle size | is 282,216 bytes _PyUnicode_PhrasebookOffset2: | deflate size is 52,200 bytes | _PyUnicode_PhrasebookOffset2: bz2 size is 76,876 | bytes _PyUnicode_PhrasebookOffset2: dleb size is | 47,198 bytes _PyUnicode_PhrasebookOffset2: dzd size | is 12,748 bytes | | It's so cool. | spullara wrote: | For Wavefront I got this down to about 1.5 bytes per metric | reported (on average across customers) where each metric was a | tuple of: | | name, source, timestamp (s), value (int) and a set of name/value | pairs | | Key to optimizing it was having a database that supported | different compression strategies based on the data. | londons_explore wrote: | That actually sounds pretty bad to me, when many metrics will | be constant or very rarely changing, or perfectly correlated | with another metric. | | Were you arithmetic coding stuff, or still doing bit-twiddling | tricks? | spullara wrote: | If you think that sounds bad, I would love to see what you | think is good and show your work then they can just add it to | one of the options. It does basically all the tricks in the | article and then some. This is a mean, not a median. The | median is likely much lower but no one cares about the median | when calculating how much space you need. | | RE: Correlated with another metric. That would be extremely | expensive one to calculate. You need an algorithm that can | operate when ingesting millions of points per second 24/7. | awild wrote: | Question for people who've implemented these kind of compression | schemes: some of these mechanisms require local context (rle and | s8b) how do you handle random access in these cases? Is the | metadata embedded as a sort of key frame to which we have to | advance before decoding the value? Or as a separate block in the | headers? | | RLE especially sounds like it could degenerate into a binary | search per lookup, maybe aided by caching the bounds of the last | run and assuming locality and linear-ish usage? | | This might sound obvious, but wasn't mentioned in the article, | you can also apply integer based compression schemes on | dictionary encoded data. | | And floating point values don't always need those 64bits when the | use case and input data don't require that level of precision. | nu11ptr wrote: | I was thinking the same thing as I was reading this. I doubt | you could retain 100% random access. Instead, I think you would | generally create "blocks" that are compressed in a time range | that is compatible with your application (ie. 1 day blocks or 1 | week blocks, etc.) and then when picking date/times you take X | blocks and then further refine the query after compression | (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At | least, that is what I have done in the past in my own home | grown apps. Essentially each block then starts and terminates | compression - # of blocks is a trade off in compression | efficiency vs. granularity. | arinlen wrote: | > (...) how do you handle random access in these cases? | | Given we're discussing time series, random access is something | that's not required at all. There is no use case where you're | looking for a single, specific data point of a time series. | Time series are used to store and retrieve batches of data | points in order to process and/or analize/visualize in batches. | Time series queries consists mainly of "get me all the data | points of a time series reported between timestamps A and B". | [deleted] | londons_explore wrote: | Most applications don't require true random access. | | Instead you're typically reading a range of data, and then you | can decompress just the blocks required for the data you want | to see. | | Caching of partial queries can also help substantially. For | example, if many queries involve querying the max() of some | per-second data grouped by minute, it is well worth caching | that rather than reading the source data every time to | calculate the max(). | | Typically the query engine can keep counts of every subquery | and how frequently it's used and how many data points it | involves to decide how long to cache it for. As far as I'm | aware no opensource tsdb does this, despite it being a massive | simple win, especially for alerting systems and dashboards that | run very similar queries frequently. | anonymoushn wrote: | Are there similar algorithms useful for compressing stuff like "a | list of the top 100 scores in a game?" We already store these as | diffs of the list, but we could probably do better than zstd-ing | the list-diffs. | pizza wrote: | You could look into multiset sequence compression methods, | where the multiset elements are i sequences max_score(i, t) for | at time t for each rank i https://arxiv.org/abs/1401.6410 | tyingq wrote: | This discussion about a Barclay's Bank json list of 74,000 | numbers might have some ideas: | https://news.ycombinator.com/item?id=28326036 | | I was able to get it from a 1.3mb json file (uncompressed) to a | 151k uncompressed but 4k compressed file using mostly deltas. | https://news.ycombinator.com/item?id=28348965 | anonymoushn wrote: | This is great! I'll have to think about how to combine | "storing deltas within the list" and "storing deltas of the | list," but there should be some significant gains available ___________________________________________________________________ (page generated 2022-05-14 23:00 UTC)