[HN Gopher] Making all your integers positive with zigzag encoding ___________________________________________________________________ Making all your integers positive with zigzag encoding Author : jjgreen Score : 44 points Date : 2022-11-25 17:55 UTC (5 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | bugfix-66 wrote: | A really tremendous varint/VLQ encoder (using a zig-zag encoding | and an generalized base): | | https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16... | | and the decoder: | | https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e... | | The common varint, which you see in applications everywhere | (e.g., git), is just base-128 version of the above general | scheme! | | But base-32 or base-8 or base-64 varints can be a big win for | some purposes. Remove the zig-zag encoding if negative integers | don't occur. | bo1024 wrote: | I have to fill out a captcha just to read an article / blog post? | (Edit: and enable cookies) | pwg wrote: | Or install Ublock Origin in Firefox blocking the javascript and | read the entire article without a captcha getting in the way. | bo1024 wrote: | I had uBlock Origin and uMatrix installed. Not sure what | changed, but I can access the article now. | nynx wrote: | Used in the [postcard](https://docs.rs/postcard/latest/postcard/) | crate for encoding variably sized signed integers. | phoboslab wrote: | I first encountered this in the Flac source[1]. They do the bit | twiddling a bit differently than in this article: | uval = (val<<1) ^ (val>>31); | | Variations of this have probably been used countless of times | in other libraries. | | [1] | https://github.com/LordJZ/libflac/blob/master/src/libFLAC/bi... | jamesmunns wrote: | Author of postcard here, I don't remember exactly where I got | my bit twiddling example from, lemme see if I have notes from | that. | | postcard's zigzag encoding matches phoboslab's psuedocode. | | Edit, not totally sure, but this wiki page rings a bell, and | is probably where I got my impl from: | https://en.wikipedia.org/wiki/Variable- | length_quantity#Zigza... | | Edit 2: I also explain why I do this (it compresses better) | in my wire format specification: | https://postcard.jamesmunns.com/wire-format.html#signed- | inte... | deleterofworlds wrote: | A nice bijection from integers to natural numbers. Mappings exist | for rationals -> natural numbers and other countable sets, but | may be less practical. | fluoridation wrote: | Easy enough. If you have a rational class represented as a pair | of irreducible int32_ts, you can simply do ((u64)numerator << | 32) | (u64)denominator. | Someone wrote: | That either doesn't map to rationals (making 1/1 and 2/2 two | different numbers) or not a bijection (0x0000000100000001 and | 0x0000000200000002 both map back to 1) and doesn't work on | "the integers". | fluoridation wrote: | Most people just care about having a lossless round-trip | conversion, from the source representation to a storage | representation and then back to the source representation, | not about having every possible bit sequence in the storage | representation being valid. Univocally mapping the | rationals onto a set of contiguous naturals is extremely | tricky, computationally. | bob1029 wrote: | It's pretty cool how a simple encoding scheme can have profound | effects on compression. This is a big part of what makes DCT- | style block compression work as well as it does. You rig the game | such that all of your higher frequency components wind up at the | end - ideally as zeroes after quantization. So, a simple RLE | scheme can do a lot more work than it otherwise could. | benj111 wrote: | Sorry. Are you inferring this is how DCT works or are you | making a comparison? | | Ive just read up about DCT and don't really have a clue, but it | doesn't seems applicable? | | That being so. I'm not sure how RLE at the bit level would | help. Surely encoding run lengths of 5ish bits isn't going to | compress much of anything. | bob1029 wrote: | Zigzag is a phase _after_ DCT+quantization where you iterate | through the resulting matrix in a particular order. Nothing | to do with the DCT itself. | | https://commons.wikimedia.org/wiki/File:JPEG_example_zigzag.. | .. | oakwhiz wrote: | The idea behind this is to scan diagonally by frequency so | that big numbers are grouped together at the start. You can | choose to zero out some of the bits near the end since those | only contain high frequency components that are harder to | notice. RLE then has the ability to work on the trailing | zeroes, or other methods can be used instead. | IanWard1 wrote: | reminds me of this video on using "-2" as a base for numbers (can | represent all positive and negative integers without a sign bit) | https://www.youtube.com/watch?v=GWA_NraxOUw | raphlinus wrote: | The `fast_decode` as written is undefined behavior for some | inputs (INT_MAX for example) because of the integer overflow in | `2*x`. This can be fixed by casting to an unsigned int first. | | Also a warning, right shifting a negative number gives the | expected result (arithmetic right shift) in C++20, but is still | implementation defined even in the C23 draft spec. Likely you're | good, but the spec doesn't guarantee it. | eloff wrote: | Protobuf was the first major project I was aware of that uses | zigzag encoding for integers. They then encode those integers | using the typical 7 bit encoding scheme where the msb indicates | there's another byte following. I'm sure they weren't the first | by any means though. | | I'm currently using this in a binary format for serializing | dynamic JSON-like values that I invented and am implementing in | Rust. I will release it as open source sometime next year. | tikhonj wrote: | I first ran into this encoding scheme in Avro, which you could | also describe as a format for dynamic JSON-like values :) | [deleted] | ramranch wrote: | Does your protocol provide any advantages over bincode? | DontchaKnowit wrote: | Found this recently in propreitary code base for (in abstract | terms) storing enormous numbers of price values. Very interesting | shoo wrote: | Can anyone suggest high-quality resources for learning about | compression? | k0k0r0 wrote: | They use a variant of that in most Modern SAT solvers for the | literal encoding by the way, i.e. in order to encode negation of | variables, which traditionally are represented with negative | integers. Mostly, because they then use the encoded literal as | indeces in an array. Historically, this also had performance | reasons if I remember correctly. I feel it has no benefit | anymore, and people simply got used to it (and never touch that | part of the code anyway). But I might be wrong. I did not yet | benchmarked any alternative, but it is on my todo-list (woth low | priority). | Ferrotin wrote: | Wouldn't they use a plain bitmask instead, the lsb indicating | negation, with 3 being the negation of 2? That's zigzag | encoding with a negative zero. | k0k0r0 wrote: | If I understand you correctly, then that exactly what I am | talking about. You might see this as bitmask. That's correct. ___________________________________________________________________ (page generated 2022-11-25 23:00 UTC)