[HN Gopher] Making all your integers positive with zigzag encoding
       ___________________________________________________________________
        
       Making all your integers positive with zigzag encoding
        
       Author : jjgreen
       Score  : 44 points
       Date   : 2022-11-25 17:55 UTC (5 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | bugfix-66 wrote:
       | A really tremendous varint/VLQ encoder (using a zig-zag encoding
       | and an generalized base):
       | 
       | https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16...
       | 
       | and the decoder:
       | 
       | https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e...
       | 
       | The common varint, which you see in applications everywhere
       | (e.g., git), is just base-128 version of the above general
       | scheme!
       | 
       | But base-32 or base-8 or base-64 varints can be a big win for
       | some purposes. Remove the zig-zag encoding if negative integers
       | don't occur.
        
       | bo1024 wrote:
       | I have to fill out a captcha just to read an article / blog post?
       | (Edit: and enable cookies)
        
         | pwg wrote:
         | Or install Ublock Origin in Firefox blocking the javascript and
         | read the entire article without a captcha getting in the way.
        
           | bo1024 wrote:
           | I had uBlock Origin and uMatrix installed. Not sure what
           | changed, but I can access the article now.
        
       | nynx wrote:
       | Used in the [postcard](https://docs.rs/postcard/latest/postcard/)
       | crate for encoding variably sized signed integers.
        
         | phoboslab wrote:
         | I first encountered this in the Flac source[1]. They do the bit
         | twiddling a bit differently than in this article:
         | uval = (val<<1) ^ (val>>31);
         | 
         | Variations of this have probably been used countless of times
         | in other libraries.
         | 
         | [1]
         | https://github.com/LordJZ/libflac/blob/master/src/libFLAC/bi...
        
           | jamesmunns wrote:
           | Author of postcard here, I don't remember exactly where I got
           | my bit twiddling example from, lemme see if I have notes from
           | that.
           | 
           | postcard's zigzag encoding matches phoboslab's psuedocode.
           | 
           | Edit, not totally sure, but this wiki page rings a bell, and
           | is probably where I got my impl from:
           | https://en.wikipedia.org/wiki/Variable-
           | length_quantity#Zigza...
           | 
           | Edit 2: I also explain why I do this (it compresses better)
           | in my wire format specification:
           | https://postcard.jamesmunns.com/wire-format.html#signed-
           | inte...
        
       | deleterofworlds wrote:
       | A nice bijection from integers to natural numbers. Mappings exist
       | for rationals -> natural numbers and other countable sets, but
       | may be less practical.
        
         | fluoridation wrote:
         | Easy enough. If you have a rational class represented as a pair
         | of irreducible int32_ts, you can simply do ((u64)numerator <<
         | 32) | (u64)denominator.
        
           | Someone wrote:
           | That either doesn't map to rationals (making 1/1 and 2/2 two
           | different numbers) or not a bijection (0x0000000100000001 and
           | 0x0000000200000002 both map back to 1) and doesn't work on
           | "the integers".
        
             | fluoridation wrote:
             | Most people just care about having a lossless round-trip
             | conversion, from the source representation to a storage
             | representation and then back to the source representation,
             | not about having every possible bit sequence in the storage
             | representation being valid. Univocally mapping the
             | rationals onto a set of contiguous naturals is extremely
             | tricky, computationally.
        
       | bob1029 wrote:
       | It's pretty cool how a simple encoding scheme can have profound
       | effects on compression. This is a big part of what makes DCT-
       | style block compression work as well as it does. You rig the game
       | such that all of your higher frequency components wind up at the
       | end - ideally as zeroes after quantization. So, a simple RLE
       | scheme can do a lot more work than it otherwise could.
        
         | benj111 wrote:
         | Sorry. Are you inferring this is how DCT works or are you
         | making a comparison?
         | 
         | Ive just read up about DCT and don't really have a clue, but it
         | doesn't seems applicable?
         | 
         | That being so. I'm not sure how RLE at the bit level would
         | help. Surely encoding run lengths of 5ish bits isn't going to
         | compress much of anything.
        
           | bob1029 wrote:
           | Zigzag is a phase _after_ DCT+quantization where you iterate
           | through the resulting matrix in a particular order. Nothing
           | to do with the DCT itself.
           | 
           | https://commons.wikimedia.org/wiki/File:JPEG_example_zigzag..
           | ..
        
           | oakwhiz wrote:
           | The idea behind this is to scan diagonally by frequency so
           | that big numbers are grouped together at the start. You can
           | choose to zero out some of the bits near the end since those
           | only contain high frequency components that are harder to
           | notice. RLE then has the ability to work on the trailing
           | zeroes, or other methods can be used instead.
        
       | IanWard1 wrote:
       | reminds me of this video on using "-2" as a base for numbers (can
       | represent all positive and negative integers without a sign bit)
       | https://www.youtube.com/watch?v=GWA_NraxOUw
        
       | raphlinus wrote:
       | The `fast_decode` as written is undefined behavior for some
       | inputs (INT_MAX for example) because of the integer overflow in
       | `2*x`. This can be fixed by casting to an unsigned int first.
       | 
       | Also a warning, right shifting a negative number gives the
       | expected result (arithmetic right shift) in C++20, but is still
       | implementation defined even in the C23 draft spec. Likely you're
       | good, but the spec doesn't guarantee it.
        
       | eloff wrote:
       | Protobuf was the first major project I was aware of that uses
       | zigzag encoding for integers. They then encode those integers
       | using the typical 7 bit encoding scheme where the msb indicates
       | there's another byte following. I'm sure they weren't the first
       | by any means though.
       | 
       | I'm currently using this in a binary format for serializing
       | dynamic JSON-like values that I invented and am implementing in
       | Rust. I will release it as open source sometime next year.
        
         | tikhonj wrote:
         | I first ran into this encoding scheme in Avro, which you could
         | also describe as a format for dynamic JSON-like values :)
        
         | [deleted]
        
         | ramranch wrote:
         | Does your protocol provide any advantages over bincode?
        
       | DontchaKnowit wrote:
       | Found this recently in propreitary code base for (in abstract
       | terms) storing enormous numbers of price values. Very interesting
        
       | shoo wrote:
       | Can anyone suggest high-quality resources for learning about
       | compression?
        
       | k0k0r0 wrote:
       | They use a variant of that in most Modern SAT solvers for the
       | literal encoding by the way, i.e. in order to encode negation of
       | variables, which traditionally are represented with negative
       | integers. Mostly, because they then use the encoded literal as
       | indeces in an array. Historically, this also had performance
       | reasons if I remember correctly. I feel it has no benefit
       | anymore, and people simply got used to it (and never touch that
       | part of the code anyway). But I might be wrong. I did not yet
       | benchmarked any alternative, but it is on my todo-list (woth low
       | priority).
        
         | Ferrotin wrote:
         | Wouldn't they use a plain bitmask instead, the lsb indicating
         | negation, with 3 being the negation of 2? That's zigzag
         | encoding with a negative zero.
        
           | k0k0r0 wrote:
           | If I understand you correctly, then that exactly what I am
           | talking about. You might see this as bitmask. That's correct.
        
       ___________________________________________________________________
       (page generated 2022-11-25 23:00 UTC)