[HN Gopher] The QOI File Format Specification ___________________________________________________________________ The QOI File Format Specification Author : Aissen Score : 164 points Date : 2021-12-20 14:10 UTC (8 hours ago) (HTM) web link (phoboslab.org) (TXT) w3m dump (phoboslab.org) | Aissen wrote: | The experiment that led to this file format (QOI: Lossless Image | Compression in O(n) Time) was discussed on HN: | | https://news.ycombinator.com/item?id=29328750 | pcwalton wrote: | Neat, this is impressive! | | The first thing that comes to mind is that it's not very amenable | to parallel encoding or decoding, even less so than PNG. The fact | that QOI_OP_INDEX could reach arbitrarily far back basically | eliminates parallel decompression. But I guess the simplest | workaround for that would be to just break up your images into | tiles that are each decoded in parallel. You could imagine a | simple wrapper format that specifies how to reconstruct a larger | image out of such tiles, which would be amenable to parallelism | (albeit still not easily implementable on GPU, though I guess you | could wring a little bit of parallelism out of it from by using | one GPU SIMD lane per RGBA channel). | | Of course, it's entirely possible that your decompression goes so | fast it doesn't matter that it's sequential and you're | bottlenecked on I/O in any reasonable use case... | CodesInChaos wrote: | The end marker feels awkward: | | * It shouldn't be necessary, since having encoded all pixels | already implies you're at the end. | | * It complicates the encoding/decoding of the indexing operation. | edflsafoiewq wrote: | It doesn't really complicate the encoder. Any real encoder | would never emit 7 OP_INDEX to the same index, it would use a | RLE instead. | formerly_proven wrote: | This is very nice. | | If I had to nitpick something, I'd say choosing all-big-endian is | kinda odd in 2021 but it's probably not a biggy even for | something like this. | phoboslab wrote: | Well, it's easier to read in a hex editor and seemed like the | right thing to do for a "wire format". | | The only places where it matters is the header's width/height | and order of RGB(A) data. Storing RGBA as ABGR is not a great | choice, imho. If you want to be endian independent, you have to | read one byte at a time anyway (or swap). | wheybags wrote: | Storing rgba as abgr is a _terrible_ choice, but are you sure | they 're actually doing that? | phoboslab wrote: | Sorry, I see how my previous comment was misleading. To | clarify: No, QOI is not doing that. QOI stores RGBA as | R,G,B,A because it is big endian. | | Little endian would imply that RGBA is stored as A,B,G,R. | Which was my argument against using LE for a file format. | Taywee wrote: | I don't believe it would imply that, because those are | independent elements. RGBA would still be stored as R, G, | B, A unless you're considering each pixel as a single | 32-bit integer. A C structure for the colors would even | live in memory in the order you expect. Just like a | string is still stored byte-by-byte even in little-endian | contexts. | | LE vs BE would only affect multi-byte integers and | floats, not arrays or structures of single-byte elements. | jandrese wrote: | ARGB would be nice if you want to integrate with Cairo. | mgaunard wrote: | why are channels interleaved at all if speed of decompression | was the goal? | user-the-name wrote: | Because almost everywhere you would use data will expect it | as interleaved. Separating the channels would require a | second pass to re-interleave the data. | gpvos wrote: | The bytes in the compressed data aren't going to be aligned to | 2-, 4- or 8-byte offsets, so the overhead is going to be | nonexistent or negligible. | thechao wrote: | Support for 16/32b and fp-formats would've been nice -- the | compressor is actually orthogonal to the bit width of the | channels. Custom fp-compression usually just unvolves moving | the sign bit into the LSB, and delta encoding the mantissa | separately from the exponent -- although the latter isn't as | important. | adgjlsfhk1 wrote: | This was my initial thought, but given how the format is | designed, you would pretty much have to make a whole new set | of tags for 16 or 32 bit images. | thechao wrote: | Yeah... seems like a small family would be in the right | vein. I'm immediately drawn to a block format, but that'd | probably double the complexity. | MisterTea wrote: | Honestly no one should worry about endianess in 2021. | cassepipe wrote: | https://justine.lol/endian.html | MisterTea wrote: | Yup, as soon as you see anything that checks platform | endianess, uses memcpy, or assumes memory layout, run. | jonsneyers wrote: | See also: | https://twitter.com/jonsneyers/status/1472959101416226823?t=... | | On my laptop, QOI encodes a test corpus at about 78 MP/s to a | density of about 8.0 bits per pixel. Fast png encoding with fpng | is 71 MP/s and reaches about 7.8 bpp. Fast lossless jxl encoding | can be done at 162 MP/s (single-core, faster with more cores) and | reaches about 6.8 bpp. It can also be decoded in parallel, unlike | QOI and PNG. It doesn't have a single-page spec though :) | adgjlsfhk1 wrote: | Note that the reference QOI implementation is not speed | focused. Some of the other ones are almost twice as fast. | IshKebab wrote: | This is pretty cool. Though... | | > all data is now encoded as big-endian. | | Why? If the goal is to be fast and simple then surely little | endian would make more sense given that basically all processors | are little endian. Big endian just means you need to swap | everything twice. | | Edit: Seems like I'm not the only person that thought this was a | super weird choice: https://github.com/phoboslab/qoi/issues/36 | | The reasoning is: | | > The file format for QOI will not change anymore. | mahkoh wrote: | Looking at the spec, the only multi-byte data are the | width/height in the header. So only two 4-byte swaps per file. | politician wrote: | The link contains a link to the discussion on QOI v2, | https://github.com/nigeltao/qoi2-bikeshed/issues. | Zababa wrote: | I think refusing a 5% speedup to ensure that the format will | not change fits really well with the goals of the project. | ape4 wrote: | Since its simple and there have been vulnerabilities in other | image readers... a mathematically proven reader might be a nice | thing to sell it. | Zababa wrote: | There is, I think, one thing missing: a comprehensive test suite | that checks all the possible edge cases. | | Edit: as you can see below, the benchmark also checks for | correctness so you can disregard this. | phoboslab wrote: | What exactly do you mean with "edge cases"? For what it's | worth, the decoder has been fuzzed[1] and the benchmark | suite[2] with 2800 images en-/ and decodes without losing any | pixels, as checked by qoibench[3]. | | [1] https://github.com/phoboslab/qoi/blob/master/qoifuzz.c | | [2] https://qoiformat.org/benchmark/ | | [3] | https://github.com/phoboslab/qoi/blob/master/qoibench.c#L440... | Zababa wrote: | Oh, I didn't realize that the benchmark also checked for | correctness. | fleabitdev wrote: | I've dealt with image decoding for a game engine before. The | images in question were large pixel-art texture atlases, stored | as PNG files and loaded at startup. Their slow decoding speed | caught me by surprise, given that the file format is 25 years | old! | | The reason turned out to be that Deflate's design makes it very | hard to parallelise or SIMD-accelerate. Even the best available | decoders are basically forced to process a byte at a time, in a | straight line from start to finish, which obviously isn't a good | match for modern CPUs. The 3x to 4x decompression speedup here is | nice, but I can't help but feel a little sad about how poorly | this format is taking advantage of available compute resources. | (The ultimate dream would be a format which is so parallel- | friendly that you can just send the binary blob to the GPU and | decompress it in a compute shader!) | | Even a rule like "each row is compressed separately, with a table | of row lengths at the beginning of the file" might have been | enough - this would have made compression ratios slightly worse, | but complexity wouldn't have suffered too much. With only six | different chunk types, we could perhaps imagine a branchless | decoder where each row's decoding state is stored in its own SIMD | lane, and the results for several rows are all emitted at the | same time... | asmar wrote: | An interesting article regarding SIMD in libPNG (although not | in deflate as complained about): https://optidash.ai/blog/the- | case-of-the-missing-simd-code | wongarsu wrote: | Apple apparently ships a PNG encoder that completely flushes | the deflate stream every now and then, giving you additional | offsets where you can start decompressing without dependencies | on previous data. The offsets are then saved as a separate | chunk in the PNG. Decoders aware of this can parallelize | decoding using these offsets, everyone else can read it as | normal. | | The proper solution would of course be to use something more | modern than deflate, the PNG format even has a metadata field | that specifies the used compression algorithm. But anything but | deflate wouldn't be backwards compatible, so nobody seems to | use that option. | ByThyGrace wrote: | > you can just send the binary blob to the GPU and decompress | it in a compute shader | | Surely this exists already? | wolf550e wrote: | I think texture compression is always lossy, so it's not | directly comparable to this or to PNG. So I don't think it | exists. See ASTC and BC7. | | Texture compression can be very advanced: | https://cbloomrants.blogspot.com/2020/06/oodle-texture- | slash... | fleabitdev wrote: | Reading that blog post, I was surprised to learn that many | modern games dedicate more than half of their filesize to | textures. I haven't played an AAA game in more than a | decade, but I would have thought that meshes and | (particularly) audio would use up more space. | | It sounds like developers are stuck between a rock and a | hard place. They need one of the specific compressed pixel | formats which can be efficiently read by the GPU, but those | formats are about ten times larger than a similar JPEG | file, they don't losslessly compress well (modulo RAD Game | Tools' discoveries here), and recompressing raw pixels to a | GPU-addressable format at load time would be orders-of- | magnitude too slow. | | RAD Game Tools' approach here is clever, but it feels like | a bit of a hack. The obvious next step would be a lossy | compressed image format which can decompress directly to | BC7, exploiting spatial-correlation tricks similar to those | which PNG uses to get better results than a gzipped BMP | file. Has anybody tried this already? | modeless wrote: | I wouldn't call Oodle Texture a hack. But there's also | https://github.com/BinomialLLC/crunch and | https://github.com/BinomialLLC/basis_universal. The | latter has the advantage that it can be decoded to | multiple different compressed texture formats, so that | all GPUs can be supported from a single file (different | GPUs support different formats so you can't necessarily | send the same compressed texture to any GPU). | fleabitdev wrote: | Exactly what I was looking for, thanks! :) | pcwalton wrote: | > The 3x to 4x decompression speedup here is nice, but I can't | help but feel a little sad about how poorly this format is | taking advantage of available compute resources. (The ultimate | dream would be a format which is so parallel-friendly that you | can just send the binary blob to the GPU and decompress it in a | compute shader!) | | You can't usefully decompress the _entire_ thing on the GPU, | but you can do half of it there. You can decompress DEFLATE | right before sending the data over the bus and do the PNG | prediction (filtering) in a compute shader. I actually | implemented this once (PNG prediction is amenable to wavefront | parallelism) and it worked fine, though it wasn 't any faster | than using SIMD on the CPU because the fact that each row can | specify a different prediction method means that you end up | with a lot of divergence in the shader. Still, it could get | some load off the CPU if your loading is CPU-bound. | robalni wrote: | When looking at the benchmark result and image size, it seems | like this format is good at images that have horizontal | structures or where the same color is repeated often. It's not | good at vertical structures, including images with enlarged | pixels (where one row is identical to the one above). I think the | main reason for that is that PNG has a filter pass that can look | at the row of pixels above, while this format only looks to the | left. | | Example of image with vertical gradients: | https://qoiformat.org/benchmark/images/icon_512/apps-prefere... - | qoi compresses to 186kb and libpng to 25kb | | Example of image with horizontal structure and repeated colors: | https://qoiformat.org/benchmark/images/textures_pk/mod_walld... - | qoi compresses to 21kb and libpng to 33kb | adgjlsfhk1 wrote: | One thing worth noting is that QOI can be pretty easily wrapped | with something else to change pixel traversal order. If you | want higher compression, I'd recommend wrapping QOI with ZSTD | (or similar) compression on the output, and something to change | the pixel traversal order on the input. | Akronymus wrote: | > and something to change the pixel traversal order on the | input. | | I could see a variety of space filling curves working for | that. | ErikCorry wrote: | I experimented with adding this to the embedded system Toit. I | wrote up the experience here | https://medium.com/@erik_68861/trying-out-the-quite-ok-image... | and also used it for this little demo on a 2 inch screen: | https://twitter.com/erikcorry/status/1470023885026467848 | | It was pretty good, but in the end I don't think I'll be making | it part of the graphics system. I would really like a bit of | random access into the image without having to divide it up into | multiple tiles. And the spec took a direction that didn't really | suit me - i wanted to use it for Gui-like textures which have | slabs of colours and anti-aliased edges with varying alpha. In | the final version of the spec there's no way to code "the alpha | changed, but the color is the same". without spending 5 bytes on | a single pixel. Previously that took 2 bytes. | | So I'm still looking for a format with: | | * Very small amount of decode state (GIF's 16k is probably too | much - PNG's 32k is certainly too much). | | * Fast decode (these embedded chips are not as fast as desktops). | | * Alpha channel (not just on-off like GIF) | | * A bit of random access like the parallel PNG, but hopefully | less bolted-on. | FullyFunctional wrote: | All these points are spot on, but I want to reorder the | priority. I cannot believe that a format is introduced in this | millennium that doesn't have a single thought to aiding | parallel access/decode. The single-thread focus will not age | well; even low-end uProcessors are going multicore today, and | going forward that will only accellerate. | kevingadd wrote: | Even ignoring parallelism, images increasingly need to go | straight to the GPU, so improving the performance of access | on-GPU (I.E. using a GPU compressed texture format) can have | a bigger positive impact on the user experience than making | the file on disk smaller. Adopting Basis to compress your | images could in practice be much better than adopting QOI, | and it should be benchmarked against options like that along | with PNG or JPEG. | modeless wrote: | Have you considered compressed texture formats, perhaps DXT5? | Fixed compression ratio (1 byte per pixel for DXT5), arbitrary | random access, decompress the pixels you need on the fly. Can | be further compressed for efficient storage to about 1-2 bits | per pixel with https://github.com/BinomialLLC/crunch. | choeger wrote: | This is going to be super useful for educational purposes. I'd | really like to see more file formats like this. | | There's already a bunch of minimal programming languages and even | some operating systems. It's great value for learning stuff when | you can implement a quite OK version of something in dozens of | hours or less. | ape4 wrote: | I like how simple it is. Of course, getting a new format adopted | is pretty difficult. ___________________________________________________________________ (page generated 2021-12-20 23:00 UTC)