[HN Gopher] Time-Series Compression Algorithms
       ___________________________________________________________________
        
       Time-Series Compression Algorithms
        
       Author : todsacerdoti
       Score  : 114 points
       Date   : 2022-05-14 16:08 UTC (6 hours ago)
        
 (HTM) web link (www.timescale.com)
 (TXT) w3m dump (www.timescale.com)
        
       | [deleted]
        
       | infogulch wrote:
       | I found TerminuDB's white paper on graph compression interesting
       | and I think could be useful to time series compression:
       | 
       | Succinct Data Structures and Delta Encoding for Modern Databases
       | - https://assets.terminusdb.com/research/succinct-data-structu...
        
         | sujayakar wrote:
         | Navarro's book Compact Data Structures is an excellent survey
         | of this literature!
         | https://www.cambridge.org/core/books/compact-data-structures...
         | 
         | I love the perspective of computing the information theoretic
         | lower bounds for a data structure and then trying to achieve
         | that with as little overhead as possible while still supporting
         | efficient operations on the structure.
        
       | naturalplasmoid wrote:
       | One notable omission from this piece is a technique to compress
       | integer time series with both positive and _negative_ values.
       | 
       | If you naively apply bit-packing using the Simple8b algorithm,
       | you'll find that negative integers are not compressed. This is
       | due to how signed integers are represented in modern computers:
       | negative integers will have their most significant bit set [1].
       | 
       | Zigzag encoding is a neat transform that circumvents this issue.
       | It works by mapping signed integers to unsigned integers so that
       | numbers with a small absolute value can be encoded using a small
       | number of bits. Put another way, it encodes negative numbers
       | using the least significant bit for sign. [2]
       | 
       | If you're looking for a quick way to experiment with various time
       | series compression algorithm I highly recommend Daniel Lemire's
       | FastPFor repository [3] (as linked in the article). I've used the
       | Python bindings [4] to quickly evaluate various compression
       | algorithms with great success.
       | 
       | Finally I'd like to humbly mention my own tiny contribution [5],
       | an adaptation of Lemire's C++ Simple8b implementation (including
       | basic methods for delta & zigzag encoding/decoding).
       | 
       | I used C++ templates to make the encoding and decoding routines
       | generic over integer bit-width, which expands support up to 64
       | bit integers, and offers efficient usage with smaller integers
       | (eg 16 bit). I made a couple other minor tweaks including support
       | for arrays up to 2^64 in length, and tweaking the API/method
       | signatures so they can be used in a more functional style. This
       | implementation is slightly simpler to invoke via FFI, and I
       | intend to add examples showing how to compile for usage via JS
       | (WebAssembly), Python, and C#. I threw my code up quickly in
       | order to share with you all, hopefully someone finds it useful. I
       | intend to expand on usage examples/test cases/etc, and am looking
       | forward to any comments or contributions.
       | 
       | [1] https://en.wikipedia.org/wiki/Signed_number_representation
       | 
       | [2] https://en.wikipedia.org/wiki/Variable-
       | length_quantity#Zigza...
       | 
       | [3] https://github.com/lemire/FastPFor
       | 
       | [4] https://github.com/searchivarius/PyFastPFor
       | 
       | [5] https://github.com/naturalplasmoid/simple8b-timeseries-
       | compr...
        
         | jart wrote:
         | I love zigzag encoding. The way it uses shift and xor is
         | beautiful. It gives it a performance edge of 3 cycles on my cpu
         | versus signed-leb encoding. One thing I did once, to encode
         | UNICODE lookup tables in a smaller form (to reduce Python
         | binary size) is I found out it can be very profitable to (1)
         | delta encode; then (2) zig zag encode it; and finally (3) apply
         | deflate. For example:
         | _PyUnicode_PhrasebookOffset2: size         is      178,176
         | bytes         _PyUnicode_PhrasebookOffset2: packed size  is
         | 100,224 bytes         _PyUnicode_PhrasebookOffset2: rle size
         | is      282,216 bytes         _PyUnicode_PhrasebookOffset2:
         | deflate size is       52,200 bytes
         | _PyUnicode_PhrasebookOffset2: bz2 size     is       76,876
         | bytes         _PyUnicode_PhrasebookOffset2: dleb size    is
         | 47,198 bytes         _PyUnicode_PhrasebookOffset2: dzd size
         | is       12,748 bytes
         | 
         | It's so cool.
        
       | spullara wrote:
       | For Wavefront I got this down to about 1.5 bytes per metric
       | reported (on average across customers) where each metric was a
       | tuple of:
       | 
       | name, source, timestamp (s), value (int) and a set of name/value
       | pairs
       | 
       | Key to optimizing it was having a database that supported
       | different compression strategies based on the data.
        
         | londons_explore wrote:
         | That actually sounds pretty bad to me, when many metrics will
         | be constant or very rarely changing, or perfectly correlated
         | with another metric.
         | 
         | Were you arithmetic coding stuff, or still doing bit-twiddling
         | tricks?
        
           | spullara wrote:
           | If you think that sounds bad, I would love to see what you
           | think is good and show your work then they can just add it to
           | one of the options. It does basically all the tricks in the
           | article and then some. This is a mean, not a median. The
           | median is likely much lower but no one cares about the median
           | when calculating how much space you need.
           | 
           | RE: Correlated with another metric. That would be extremely
           | expensive one to calculate. You need an algorithm that can
           | operate when ingesting millions of points per second 24/7.
        
       | awild wrote:
       | Question for people who've implemented these kind of compression
       | schemes: some of these mechanisms require local context (rle and
       | s8b) how do you handle random access in these cases? Is the
       | metadata embedded as a sort of key frame to which we have to
       | advance before decoding the value? Or as a separate block in the
       | headers?
       | 
       | RLE especially sounds like it could degenerate into a binary
       | search per lookup, maybe aided by caching the bounds of the last
       | run and assuming locality and linear-ish usage?
       | 
       | This might sound obvious, but wasn't mentioned in the article,
       | you can also apply integer based compression schemes on
       | dictionary encoded data.
       | 
       | And floating point values don't always need those 64bits when the
       | use case and input data don't require that level of precision.
        
         | nu11ptr wrote:
         | I was thinking the same thing as I was reading this. I doubt
         | you could retain 100% random access. Instead, I think you would
         | generally create "blocks" that are compressed in a time range
         | that is compatible with your application (ie. 1 day blocks or 1
         | week blocks, etc.) and then when picking date/times you take X
         | blocks and then further refine the query after compression
         | (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At
         | least, that is what I have done in the past in my own home
         | grown apps. Essentially each block then starts and terminates
         | compression - # of blocks is a trade off in compression
         | efficiency vs. granularity.
        
         | arinlen wrote:
         | > (...) how do you handle random access in these cases?
         | 
         | Given we're discussing time series, random access is something
         | that's not required at all. There is no use case where you're
         | looking for a single, specific data point of a time series.
         | Time series are used to store and retrieve batches of data
         | points in order to process and/or analize/visualize in batches.
         | Time series queries consists mainly of "get me all the data
         | points of a time series reported between timestamps A and B".
        
           | [deleted]
        
         | londons_explore wrote:
         | Most applications don't require true random access.
         | 
         | Instead you're typically reading a range of data, and then you
         | can decompress just the blocks required for the data you want
         | to see.
         | 
         | Caching of partial queries can also help substantially. For
         | example, if many queries involve querying the max() of some
         | per-second data grouped by minute, it is well worth caching
         | that rather than reading the source data every time to
         | calculate the max().
         | 
         | Typically the query engine can keep counts of every subquery
         | and how frequently it's used and how many data points it
         | involves to decide how long to cache it for. As far as I'm
         | aware no opensource tsdb does this, despite it being a massive
         | simple win, especially for alerting systems and dashboards that
         | run very similar queries frequently.
        
       | anonymoushn wrote:
       | Are there similar algorithms useful for compressing stuff like "a
       | list of the top 100 scores in a game?" We already store these as
       | diffs of the list, but we could probably do better than zstd-ing
       | the list-diffs.
        
         | pizza wrote:
         | You could look into multiset sequence compression methods,
         | where the multiset elements are i sequences max_score(i, t) for
         | at time t for each rank i https://arxiv.org/abs/1401.6410
        
         | tyingq wrote:
         | This discussion about a Barclay's Bank json list of 74,000
         | numbers might have some ideas:
         | https://news.ycombinator.com/item?id=28326036
         | 
         | I was able to get it from a 1.3mb json file (uncompressed) to a
         | 151k uncompressed but 4k compressed file using mostly deltas.
         | https://news.ycombinator.com/item?id=28348965
        
           | anonymoushn wrote:
           | This is great! I'll have to think about how to combine
           | "storing deltas within the list" and "storing deltas of the
           | list," but there should be some significant gains available
        
       ___________________________________________________________________
       (page generated 2022-05-14 23:00 UTC)