[HN Gopher] Parallelising Huffman decoding and x86 disassembly b... ___________________________________________________________________ Parallelising Huffman decoding and x86 disassembly by synchronising prefix codes Author : hasheddan Score : 51 points Date : 2022-07-30 16:47 UTC (6 hours ago) (HTM) web link (dougallj.wordpress.com) (TXT) w3m dump (dougallj.wordpress.com) | kyleplum wrote: | Without taking a concrete stance on the current state of patents | - I developed something similar using GPU compute shaders while | employed at AMD/ATI | | https://patentimages.storage.googleapis.com/ef/d2/2c/b54bbdf... | | Shameless self promotion aside, I think the patent application | includes some helpful diagrams to understand what is being done | here - specifically page 25/26. | londons_explore wrote: | Could the same be done with arithmetic codes? | | Wouldn't they also self synchronize but after far more bits? | hinkley wrote: | These days arithmetic coding has been surpassed by asymmetric | numerical systems (ANS) except that those are slower still. | | Parallelizing ANS would be huge. | | Notable uses of ANS: zstandard, JPEG XL. Apple and Dropbox also | use it. | jhj wrote: | ANS is super fast and trivially parallelizable, faster than | Huffman or especially arithmetic encoding, and is superior to | Huffman at least as far as compression ratios are concerned | (symbol probabilities are not restricted to powers of 2, but | unlike arithmetic encoding one is usually limited to | something close to multiples of 2^-9 to 2^-13 for symbol | probabilities, the limitation being due to required table | sizes for rANS or tANS in high speed memories or in L1 | cache). | | It is fast because it can be machine word oriented (you can | read/write whole machine word sizes at a time, not | arbitrary/variable bit length sequences), and as a result you | can interleave any number of independent (parallel) encoders | in the same stream with just a (parallel) prefix sum to | figure out where to write state values (as whole machine | words) when renormalization needs to occur in individual | encoders. It is super SIMD friendly as a result (see | https://arxiv.org/pdf/1402.3392.pdf). | | I for one got up to 400 GB/s throughput on A100 GPUs in my | implementation (https://github.com/facebookresearch/dietgpu), | which to my knowledge is/was the fastest entropy coder out | there on a commercially available hardware platform, so fast | that we're using it to losslessly compress data before | sending over PCIe/NVLink/Infiniband/etc during large scale | neural network training while still being a win. Nvidia's | nvCOMP library also recently added similar techniques. | | ANS can also self-synchronize as well, but chunking data into | fixed >=4 KiB segments has fairly minimal compression size | overhead (you need an index to tell you where each variable | sized compressed chunk begins) and is faster. | londons_explore wrote: | When decoding a large data block (ie. Megabytes), it really | doesn't matter if a few tens of bytes get decoded twice to find a | sync point. | | An obvious solution would appear to be to simd-decode with say 16 | lanes, each starting decoding 1/16th of the input data. When each | decoder gets to the end of its chunk, continue a little further. | Then use non-simd code to verify that every chunk has reached a | sync point with the following point and splice together all the | decided data. | hinkley wrote: | Before I gave up compression as a hobby my last big observation | was that many algorithms are trying to juggle several concerns | at once and wouldn't it be better to double down on the bzip2 | (of for that matter, optimizing compiler) strategy of making | multiple passes. That's not strictly speaking parallel decoding | but it can be modeled by coroutines or streams, and involve | quite a few cores if you so choose. | | And these days we are not as concerned about memory usage as we | were when computer with modems might have only 384K of RAM, so | a block structure with a length prefix is not such a bad | strategy. Though should your blocks all be the same size or | should you start a block every time the window needs to be | reset? One is smaller, but the other has more predictable | performance. | londons_explore wrote: | Does parallelizing Huffman decoding on a modern CPU actually | speed up decoding? | | I would assume that the memory read and write bandwidth of a CPU | would be the limiting factor for any vaguely efficient | implementation, not the actual processing. | [deleted] | foota wrote: | A single core is able to load less memory per second than the | whole cpu because there is a limited number of resources per | core for loading memory. | mananaysiempre wrote: | It would be, if only you could load the CPU properly. Huffman | decoding is _highly_ serialized--an iteration per symbol in the | FSM implementation is somewhat better than the loop iteration | per bit in the naive one, but every iteration still depends on | basically everything in the previous one; you're using a | fraction of the scalar processing power of the core. Thus | modern compressors resorting to format-incompatible tricks like | interleaved streams, but how much to interleave depends on | which class of CPU you'll want to decode on. | wolf550e wrote: | Modern compression formats are optimized for decoding huffman | streams in parallel. See for example Fabian Giesen explaining | about RAD's codecs and also about Yann Collet's zstandard: | https://fgiesen.wordpress.com/2021/08/30/entropy-coding-in-o... ___________________________________________________________________ (page generated 2022-07-30 23:01 UTC)