[HN Gopher] Parallelising Huffman decoding and x86 disassembly b...
       ___________________________________________________________________
        
       Parallelising Huffman decoding and x86 disassembly by synchronising
       prefix codes
        
       Author : hasheddan
       Score  : 51 points
       Date   : 2022-07-30 16:47 UTC (6 hours ago)
        
 (HTM) web link (dougallj.wordpress.com)
 (TXT) w3m dump (dougallj.wordpress.com)
        
       | kyleplum wrote:
       | Without taking a concrete stance on the current state of patents
       | - I developed something similar using GPU compute shaders while
       | employed at AMD/ATI
       | 
       | https://patentimages.storage.googleapis.com/ef/d2/2c/b54bbdf...
       | 
       | Shameless self promotion aside, I think the patent application
       | includes some helpful diagrams to understand what is being done
       | here - specifically page 25/26.
        
       | londons_explore wrote:
       | Could the same be done with arithmetic codes?
       | 
       | Wouldn't they also self synchronize but after far more bits?
        
         | hinkley wrote:
         | These days arithmetic coding has been surpassed by asymmetric
         | numerical systems (ANS) except that those are slower still.
         | 
         | Parallelizing ANS would be huge.
         | 
         | Notable uses of ANS: zstandard, JPEG XL. Apple and Dropbox also
         | use it.
        
           | jhj wrote:
           | ANS is super fast and trivially parallelizable, faster than
           | Huffman or especially arithmetic encoding, and is superior to
           | Huffman at least as far as compression ratios are concerned
           | (symbol probabilities are not restricted to powers of 2, but
           | unlike arithmetic encoding one is usually limited to
           | something close to multiples of 2^-9 to 2^-13 for symbol
           | probabilities, the limitation being due to required table
           | sizes for rANS or tANS in high speed memories or in L1
           | cache).
           | 
           | It is fast because it can be machine word oriented (you can
           | read/write whole machine word sizes at a time, not
           | arbitrary/variable bit length sequences), and as a result you
           | can interleave any number of independent (parallel) encoders
           | in the same stream with just a (parallel) prefix sum to
           | figure out where to write state values (as whole machine
           | words) when renormalization needs to occur in individual
           | encoders. It is super SIMD friendly as a result (see
           | https://arxiv.org/pdf/1402.3392.pdf).
           | 
           | I for one got up to 400 GB/s throughput on A100 GPUs in my
           | implementation (https://github.com/facebookresearch/dietgpu),
           | which to my knowledge is/was the fastest entropy coder out
           | there on a commercially available hardware platform, so fast
           | that we're using it to losslessly compress data before
           | sending over PCIe/NVLink/Infiniband/etc during large scale
           | neural network training while still being a win. Nvidia's
           | nvCOMP library also recently added similar techniques.
           | 
           | ANS can also self-synchronize as well, but chunking data into
           | fixed >=4 KiB segments has fairly minimal compression size
           | overhead (you need an index to tell you where each variable
           | sized compressed chunk begins) and is faster.
        
       | londons_explore wrote:
       | When decoding a large data block (ie. Megabytes), it really
       | doesn't matter if a few tens of bytes get decoded twice to find a
       | sync point.
       | 
       | An obvious solution would appear to be to simd-decode with say 16
       | lanes, each starting decoding 1/16th of the input data. When each
       | decoder gets to the end of its chunk, continue a little further.
       | Then use non-simd code to verify that every chunk has reached a
       | sync point with the following point and splice together all the
       | decided data.
        
         | hinkley wrote:
         | Before I gave up compression as a hobby my last big observation
         | was that many algorithms are trying to juggle several concerns
         | at once and wouldn't it be better to double down on the bzip2
         | (of for that matter, optimizing compiler) strategy of making
         | multiple passes. That's not strictly speaking parallel decoding
         | but it can be modeled by coroutines or streams, and involve
         | quite a few cores if you so choose.
         | 
         | And these days we are not as concerned about memory usage as we
         | were when computer with modems might have only 384K of RAM, so
         | a block structure with a length prefix is not such a bad
         | strategy. Though should your blocks all be the same size or
         | should you start a block every time the window needs to be
         | reset? One is smaller, but the other has more predictable
         | performance.
        
       | londons_explore wrote:
       | Does parallelizing Huffman decoding on a modern CPU actually
       | speed up decoding?
       | 
       | I would assume that the memory read and write bandwidth of a CPU
       | would be the limiting factor for any vaguely efficient
       | implementation, not the actual processing.
        
         | [deleted]
        
         | foota wrote:
         | A single core is able to load less memory per second than the
         | whole cpu because there is a limited number of resources per
         | core for loading memory.
        
         | mananaysiempre wrote:
         | It would be, if only you could load the CPU properly. Huffman
         | decoding is _highly_ serialized--an iteration per symbol in the
         | FSM implementation is somewhat better than the loop iteration
         | per bit in the naive one, but every iteration still depends on
         | basically everything in the previous one; you're using a
         | fraction of the scalar processing power of the core. Thus
         | modern compressors resorting to format-incompatible tricks like
         | interleaved streams, but how much to interleave depends on
         | which class of CPU you'll want to decode on.
        
         | wolf550e wrote:
         | Modern compression formats are optimized for decoding huffman
         | streams in parallel. See for example Fabian Giesen explaining
         | about RAD's codecs and also about Yann Collet's zstandard:
         | https://fgiesen.wordpress.com/2021/08/30/entropy-coding-in-o...
        
       ___________________________________________________________________
       (page generated 2022-07-30 23:01 UTC)