[HN Gopher] Computing Adler32 Checksums at 41 GB/s ___________________________________________________________________ Computing Adler32 Checksums at 41 GB/s Author : wooosh Score : 66 points Date : 2022-08-07 16:17 UTC (6 hours ago) (HTM) web link (wooo.sh) (TXT) w3m dump (wooo.sh) | pizza wrote: | Ooh now that is very interesting. I would really love to see how | this speeds up the run-time of fpng as a whole, if you have any | numbers. It looks like fjxl [0] and fpnge [1] (which also uses | AVX2) are at the Pareto front for lossless image compression | right now [2], but if this speeds things significantly then it's | possible there'll be a huge shakeup! | | [0] | https://github.com/libjxl/libjxl/tree/main/experimental/fast... | | [1] https://github.com/veluca93/fpnge | | [2] https://twitter.com/richgel999/status/1485976101692358656 | bob1029 wrote: | If image encode/decode speed is the _only_ concern, | libjpegturbo is going to be orders of magnitude faster than any | of these lossless schemes. With jpeg, you could encode 1080p | bitmaps in <10 milliseconds (per thread) on any consumer PC | made in the last decade. | | The frequency domain is a really powerful place to operate in | when you are dealing with this amount of data. | pizza wrote: | That's not true. libjpeg-turbo is ~50 MB/s last I tried - | plus it's not lossless. fjxl and fpnge are basically an order | of magnitude faster than that. libjpeg-turbo isn't even the | fastest jpeg codec - you should check out the (relatively | obscure) libmango - roughly 1 gbps decode on a 2020 macbook | pro - or nvJPEG for GPU-based JPEG decoding. Supposedly | there's even faster GPU-based decoders than nvJPEG, too. | bob1029 wrote: | > GPU-based | | How does this impact the overall latency of encoding a | single image? | pizza wrote: | Probably quite a bit, I don't know. The typical use case | is to load up thousands of JPEGs at once to get good | throughput despite copy overhead. You can see here the | benchmark against jpeg-turbo: | https://developer.nvidia.com/blog/leveraging-hardware- | jpeg-d... | averne_ wrote: | I've written an open-source driver for the decoding side | of the nvjpg module found in the Tegra X1 (ie. earlier | hardware revision than the one in the A100). | | I did some quick benchmarks against libjpeg-turbo, if | that can give you an idea. I expect encoding performance | would be similar. | | https://github.com/averne/oss-nvjpg#performance | wooosh wrote: | Unfortunately I haven't had the time to do a proper benchmark, | and the fpng test executable only decodes/encodes a single | image which produces very noisy/inconclusive results. However, | I'm under the impression that it doesn't make a large | difference in terms of overall time. | | fpnge (which I wasn't aware of until now) appears to already be | using a very similar (identical?) algorithm, so I suspect the | relative performance of fpng and fpnge would not be | significantly impacted by this change. | Nyan wrote: | As someone who has been recently optimising fpnge, Adler32 | computation is pretty much negligible regarding overall | runtime. The Huffman coding and filter search take up most of | the time. (IIRC fpng doesn't do any filter search, but | Huffman encoding isn't vectorized, so I'd expect that to | dominate fpng's runtime) | dougall wrote: | Nice! (I've been meaning to write up this Apple M1 ~60GB/s | version, which I think is similar: | https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... ) | jiggawatts wrote: | I hope this brilliant work has been merged into the relevant open | source libraries. | | Something that's unfair about the world is that work like this | could reach billions of people and save a million dollars worth | of time and electricity annually but is being done gratis. | | It would be amazing if there were charities that rewarded high- | impact open source contributions like this proportionally to the | benefits to humanity... | daniel-cussen wrote: | I love this kind of writeup. This is my idea of fun: speedups. | TAForObvReasons wrote: | While micro-optimizations are interesting, there are two | questions left unanswered: | | - Does this change noticeably affect the total runtime? The | checksum seems simple enough that the slight difference here | wouldn't show up in PNG benchmarks. | | - The proposed solution uses AVX2, which is not currently used in | the original codebase. Would any other part of the processing | benefit from using newer instructions? | londons_explore wrote: | If checksum calculation was any substantial portion of image | decoding, I think that would be a strong case for simply not | checking the checksum. | | If you put corrupted data into a PNG decoder, I don't think | it's awfully important to most users whether they get a decode | error or a garbled image out. | wooosh wrote: | This was actually considered, and other libraries do ignore | checksums, or at least have options to: | | https://github.com/richgel999/fpng/issues/9 ___________________________________________________________________ (page generated 2022-08-07 23:00 UTC)