[HN Gopher] Computing Adler32 Checksums at 41 GB/s
       ___________________________________________________________________
        
       Computing Adler32 Checksums at 41 GB/s
        
       Author : wooosh
       Score  : 66 points
       Date   : 2022-08-07 16:17 UTC (6 hours ago)
        
 (HTM) web link (wooo.sh)
 (TXT) w3m dump (wooo.sh)
        
       | pizza wrote:
       | Ooh now that is very interesting. I would really love to see how
       | this speeds up the run-time of fpng as a whole, if you have any
       | numbers. It looks like fjxl [0] and fpnge [1] (which also uses
       | AVX2) are at the Pareto front for lossless image compression
       | right now [2], but if this speeds things significantly then it's
       | possible there'll be a huge shakeup!
       | 
       | [0]
       | https://github.com/libjxl/libjxl/tree/main/experimental/fast...
       | 
       | [1] https://github.com/veluca93/fpnge
       | 
       | [2] https://twitter.com/richgel999/status/1485976101692358656
        
         | bob1029 wrote:
         | If image encode/decode speed is the _only_ concern,
         | libjpegturbo is going to be orders of magnitude faster than any
         | of these lossless schemes. With jpeg, you could encode 1080p
         | bitmaps in  <10 milliseconds (per thread) on any consumer PC
         | made in the last decade.
         | 
         | The frequency domain is a really powerful place to operate in
         | when you are dealing with this amount of data.
        
           | pizza wrote:
           | That's not true. libjpeg-turbo is ~50 MB/s last I tried -
           | plus it's not lossless. fjxl and fpnge are basically an order
           | of magnitude faster than that. libjpeg-turbo isn't even the
           | fastest jpeg codec - you should check out the (relatively
           | obscure) libmango - roughly 1 gbps decode on a 2020 macbook
           | pro - or nvJPEG for GPU-based JPEG decoding. Supposedly
           | there's even faster GPU-based decoders than nvJPEG, too.
        
             | bob1029 wrote:
             | > GPU-based
             | 
             | How does this impact the overall latency of encoding a
             | single image?
        
               | pizza wrote:
               | Probably quite a bit, I don't know. The typical use case
               | is to load up thousands of JPEGs at once to get good
               | throughput despite copy overhead. You can see here the
               | benchmark against jpeg-turbo:
               | https://developer.nvidia.com/blog/leveraging-hardware-
               | jpeg-d...
        
               | averne_ wrote:
               | I've written an open-source driver for the decoding side
               | of the nvjpg module found in the Tegra X1 (ie. earlier
               | hardware revision than the one in the A100).
               | 
               | I did some quick benchmarks against libjpeg-turbo, if
               | that can give you an idea. I expect encoding performance
               | would be similar.
               | 
               | https://github.com/averne/oss-nvjpg#performance
        
         | wooosh wrote:
         | Unfortunately I haven't had the time to do a proper benchmark,
         | and the fpng test executable only decodes/encodes a single
         | image which produces very noisy/inconclusive results. However,
         | I'm under the impression that it doesn't make a large
         | difference in terms of overall time.
         | 
         | fpnge (which I wasn't aware of until now) appears to already be
         | using a very similar (identical?) algorithm, so I suspect the
         | relative performance of fpng and fpnge would not be
         | significantly impacted by this change.
        
           | Nyan wrote:
           | As someone who has been recently optimising fpnge, Adler32
           | computation is pretty much negligible regarding overall
           | runtime. The Huffman coding and filter search take up most of
           | the time. (IIRC fpng doesn't do any filter search, but
           | Huffman encoding isn't vectorized, so I'd expect that to
           | dominate fpng's runtime)
        
       | dougall wrote:
       | Nice! (I've been meaning to write up this Apple M1 ~60GB/s
       | version, which I think is similar:
       | https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )
        
       | jiggawatts wrote:
       | I hope this brilliant work has been merged into the relevant open
       | source libraries.
       | 
       | Something that's unfair about the world is that work like this
       | could reach billions of people and save a million dollars worth
       | of time and electricity annually but is being done gratis.
       | 
       | It would be amazing if there were charities that rewarded high-
       | impact open source contributions like this proportionally to the
       | benefits to humanity...
        
       | daniel-cussen wrote:
       | I love this kind of writeup. This is my idea of fun: speedups.
        
       | TAForObvReasons wrote:
       | While micro-optimizations are interesting, there are two
       | questions left unanswered:
       | 
       | - Does this change noticeably affect the total runtime? The
       | checksum seems simple enough that the slight difference here
       | wouldn't show up in PNG benchmarks.
       | 
       | - The proposed solution uses AVX2, which is not currently used in
       | the original codebase. Would any other part of the processing
       | benefit from using newer instructions?
        
         | londons_explore wrote:
         | If checksum calculation was any substantial portion of image
         | decoding, I think that would be a strong case for simply not
         | checking the checksum.
         | 
         | If you put corrupted data into a PNG decoder, I don't think
         | it's awfully important to most users whether they get a decode
         | error or a garbled image out.
        
           | wooosh wrote:
           | This was actually considered, and other libraries do ignore
           | checksums, or at least have options to:
           | 
           | https://github.com/richgel999/fpng/issues/9
        
       ___________________________________________________________________
       (page generated 2022-08-07 23:00 UTC)