[HN Gopher] The QOI File Format Specification
       ___________________________________________________________________
        
       The QOI File Format Specification
        
       Author : Aissen
       Score  : 164 points
       Date   : 2021-12-20 14:10 UTC (8 hours ago)
        
 (HTM) web link (phoboslab.org)
 (TXT) w3m dump (phoboslab.org)
        
       | Aissen wrote:
       | The experiment that led to this file format (QOI: Lossless Image
       | Compression in O(n) Time) was discussed on HN:
       | 
       | https://news.ycombinator.com/item?id=29328750
        
       | pcwalton wrote:
       | Neat, this is impressive!
       | 
       | The first thing that comes to mind is that it's not very amenable
       | to parallel encoding or decoding, even less so than PNG. The fact
       | that QOI_OP_INDEX could reach arbitrarily far back basically
       | eliminates parallel decompression. But I guess the simplest
       | workaround for that would be to just break up your images into
       | tiles that are each decoded in parallel. You could imagine a
       | simple wrapper format that specifies how to reconstruct a larger
       | image out of such tiles, which would be amenable to parallelism
       | (albeit still not easily implementable on GPU, though I guess you
       | could wring a little bit of parallelism out of it from by using
       | one GPU SIMD lane per RGBA channel).
       | 
       | Of course, it's entirely possible that your decompression goes so
       | fast it doesn't matter that it's sequential and you're
       | bottlenecked on I/O in any reasonable use case...
        
       | CodesInChaos wrote:
       | The end marker feels awkward:
       | 
       | * It shouldn't be necessary, since having encoded all pixels
       | already implies you're at the end.
       | 
       | * It complicates the encoding/decoding of the indexing operation.
        
         | edflsafoiewq wrote:
         | It doesn't really complicate the encoder. Any real encoder
         | would never emit 7 OP_INDEX to the same index, it would use a
         | RLE instead.
        
       | formerly_proven wrote:
       | This is very nice.
       | 
       | If I had to nitpick something, I'd say choosing all-big-endian is
       | kinda odd in 2021 but it's probably not a biggy even for
       | something like this.
        
         | phoboslab wrote:
         | Well, it's easier to read in a hex editor and seemed like the
         | right thing to do for a "wire format".
         | 
         | The only places where it matters is the header's width/height
         | and order of RGB(A) data. Storing RGBA as ABGR is not a great
         | choice, imho. If you want to be endian independent, you have to
         | read one byte at a time anyway (or swap).
        
           | wheybags wrote:
           | Storing rgba as abgr is a _terrible_ choice, but are you sure
           | they 're actually doing that?
        
             | phoboslab wrote:
             | Sorry, I see how my previous comment was misleading. To
             | clarify: No, QOI is not doing that. QOI stores RGBA as
             | R,G,B,A because it is big endian.
             | 
             | Little endian would imply that RGBA is stored as A,B,G,R.
             | Which was my argument against using LE for a file format.
        
               | Taywee wrote:
               | I don't believe it would imply that, because those are
               | independent elements. RGBA would still be stored as R, G,
               | B, A unless you're considering each pixel as a single
               | 32-bit integer. A C structure for the colors would even
               | live in memory in the order you expect. Just like a
               | string is still stored byte-by-byte even in little-endian
               | contexts.
               | 
               | LE vs BE would only affect multi-byte integers and
               | floats, not arrays or structures of single-byte elements.
        
           | jandrese wrote:
           | ARGB would be nice if you want to integrate with Cairo.
        
           | mgaunard wrote:
           | why are channels interleaved at all if speed of decompression
           | was the goal?
        
             | user-the-name wrote:
             | Because almost everywhere you would use data will expect it
             | as interleaved. Separating the channels would require a
             | second pass to re-interleave the data.
        
         | gpvos wrote:
         | The bytes in the compressed data aren't going to be aligned to
         | 2-, 4- or 8-byte offsets, so the overhead is going to be
         | nonexistent or negligible.
        
         | thechao wrote:
         | Support for 16/32b and fp-formats would've been nice -- the
         | compressor is actually orthogonal to the bit width of the
         | channels. Custom fp-compression usually just unvolves moving
         | the sign bit into the LSB, and delta encoding the mantissa
         | separately from the exponent -- although the latter isn't as
         | important.
        
           | adgjlsfhk1 wrote:
           | This was my initial thought, but given how the format is
           | designed, you would pretty much have to make a whole new set
           | of tags for 16 or 32 bit images.
        
             | thechao wrote:
             | Yeah... seems like a small family would be in the right
             | vein. I'm immediately drawn to a block format, but that'd
             | probably double the complexity.
        
         | MisterTea wrote:
         | Honestly no one should worry about endianess in 2021.
        
           | cassepipe wrote:
           | https://justine.lol/endian.html
        
             | MisterTea wrote:
             | Yup, as soon as you see anything that checks platform
             | endianess, uses memcpy, or assumes memory layout, run.
        
       | jonsneyers wrote:
       | See also:
       | https://twitter.com/jonsneyers/status/1472959101416226823?t=...
       | 
       | On my laptop, QOI encodes a test corpus at about 78 MP/s to a
       | density of about 8.0 bits per pixel. Fast png encoding with fpng
       | is 71 MP/s and reaches about 7.8 bpp. Fast lossless jxl encoding
       | can be done at 162 MP/s (single-core, faster with more cores) and
       | reaches about 6.8 bpp. It can also be decoded in parallel, unlike
       | QOI and PNG. It doesn't have a single-page spec though :)
        
         | adgjlsfhk1 wrote:
         | Note that the reference QOI implementation is not speed
         | focused. Some of the other ones are almost twice as fast.
        
       | IshKebab wrote:
       | This is pretty cool. Though...
       | 
       | > all data is now encoded as big-endian.
       | 
       | Why? If the goal is to be fast and simple then surely little
       | endian would make more sense given that basically all processors
       | are little endian. Big endian just means you need to swap
       | everything twice.
       | 
       | Edit: Seems like I'm not the only person that thought this was a
       | super weird choice: https://github.com/phoboslab/qoi/issues/36
       | 
       | The reasoning is:
       | 
       | > The file format for QOI will not change anymore.
        
         | mahkoh wrote:
         | Looking at the spec, the only multi-byte data are the
         | width/height in the header. So only two 4-byte swaps per file.
        
         | politician wrote:
         | The link contains a link to the discussion on QOI v2,
         | https://github.com/nigeltao/qoi2-bikeshed/issues.
        
         | Zababa wrote:
         | I think refusing a 5% speedup to ensure that the format will
         | not change fits really well with the goals of the project.
        
       | ape4 wrote:
       | Since its simple and there have been vulnerabilities in other
       | image readers... a mathematically proven reader might be a nice
       | thing to sell it.
        
       | Zababa wrote:
       | There is, I think, one thing missing: a comprehensive test suite
       | that checks all the possible edge cases.
       | 
       | Edit: as you can see below, the benchmark also checks for
       | correctness so you can disregard this.
        
         | phoboslab wrote:
         | What exactly do you mean with "edge cases"? For what it's
         | worth, the decoder has been fuzzed[1] and the benchmark
         | suite[2] with 2800 images en-/ and decodes without losing any
         | pixels, as checked by qoibench[3].
         | 
         | [1] https://github.com/phoboslab/qoi/blob/master/qoifuzz.c
         | 
         | [2] https://qoiformat.org/benchmark/
         | 
         | [3]
         | https://github.com/phoboslab/qoi/blob/master/qoibench.c#L440...
        
           | Zababa wrote:
           | Oh, I didn't realize that the benchmark also checked for
           | correctness.
        
       | fleabitdev wrote:
       | I've dealt with image decoding for a game engine before. The
       | images in question were large pixel-art texture atlases, stored
       | as PNG files and loaded at startup. Their slow decoding speed
       | caught me by surprise, given that the file format is 25 years
       | old!
       | 
       | The reason turned out to be that Deflate's design makes it very
       | hard to parallelise or SIMD-accelerate. Even the best available
       | decoders are basically forced to process a byte at a time, in a
       | straight line from start to finish, which obviously isn't a good
       | match for modern CPUs. The 3x to 4x decompression speedup here is
       | nice, but I can't help but feel a little sad about how poorly
       | this format is taking advantage of available compute resources.
       | (The ultimate dream would be a format which is so parallel-
       | friendly that you can just send the binary blob to the GPU and
       | decompress it in a compute shader!)
       | 
       | Even a rule like "each row is compressed separately, with a table
       | of row lengths at the beginning of the file" might have been
       | enough - this would have made compression ratios slightly worse,
       | but complexity wouldn't have suffered too much. With only six
       | different chunk types, we could perhaps imagine a branchless
       | decoder where each row's decoding state is stored in its own SIMD
       | lane, and the results for several rows are all emitted at the
       | same time...
        
         | asmar wrote:
         | An interesting article regarding SIMD in libPNG (although not
         | in deflate as complained about): https://optidash.ai/blog/the-
         | case-of-the-missing-simd-code
        
         | wongarsu wrote:
         | Apple apparently ships a PNG encoder that completely flushes
         | the deflate stream every now and then, giving you additional
         | offsets where you can start decompressing without dependencies
         | on previous data. The offsets are then saved as a separate
         | chunk in the PNG. Decoders aware of this can parallelize
         | decoding using these offsets, everyone else can read it as
         | normal.
         | 
         | The proper solution would of course be to use something more
         | modern than deflate, the PNG format even has a metadata field
         | that specifies the used compression algorithm. But anything but
         | deflate wouldn't be backwards compatible, so nobody seems to
         | use that option.
        
         | ByThyGrace wrote:
         | > you can just send the binary blob to the GPU and decompress
         | it in a compute shader
         | 
         | Surely this exists already?
        
           | wolf550e wrote:
           | I think texture compression is always lossy, so it's not
           | directly comparable to this or to PNG. So I don't think it
           | exists. See ASTC and BC7.
           | 
           | Texture compression can be very advanced:
           | https://cbloomrants.blogspot.com/2020/06/oodle-texture-
           | slash...
        
             | fleabitdev wrote:
             | Reading that blog post, I was surprised to learn that many
             | modern games dedicate more than half of their filesize to
             | textures. I haven't played an AAA game in more than a
             | decade, but I would have thought that meshes and
             | (particularly) audio would use up more space.
             | 
             | It sounds like developers are stuck between a rock and a
             | hard place. They need one of the specific compressed pixel
             | formats which can be efficiently read by the GPU, but those
             | formats are about ten times larger than a similar JPEG
             | file, they don't losslessly compress well (modulo RAD Game
             | Tools' discoveries here), and recompressing raw pixels to a
             | GPU-addressable format at load time would be orders-of-
             | magnitude too slow.
             | 
             | RAD Game Tools' approach here is clever, but it feels like
             | a bit of a hack. The obvious next step would be a lossy
             | compressed image format which can decompress directly to
             | BC7, exploiting spatial-correlation tricks similar to those
             | which PNG uses to get better results than a gzipped BMP
             | file. Has anybody tried this already?
        
               | modeless wrote:
               | I wouldn't call Oodle Texture a hack. But there's also
               | https://github.com/BinomialLLC/crunch and
               | https://github.com/BinomialLLC/basis_universal. The
               | latter has the advantage that it can be decoded to
               | multiple different compressed texture formats, so that
               | all GPUs can be supported from a single file (different
               | GPUs support different formats so you can't necessarily
               | send the same compressed texture to any GPU).
        
               | fleabitdev wrote:
               | Exactly what I was looking for, thanks! :)
        
         | pcwalton wrote:
         | > The 3x to 4x decompression speedup here is nice, but I can't
         | help but feel a little sad about how poorly this format is
         | taking advantage of available compute resources. (The ultimate
         | dream would be a format which is so parallel-friendly that you
         | can just send the binary blob to the GPU and decompress it in a
         | compute shader!)
         | 
         | You can't usefully decompress the _entire_ thing on the GPU,
         | but you can do half of it there. You can decompress DEFLATE
         | right before sending the data over the bus and do the PNG
         | prediction (filtering) in a compute shader. I actually
         | implemented this once (PNG prediction is amenable to wavefront
         | parallelism) and it worked fine, though it wasn 't any faster
         | than using SIMD on the CPU because the fact that each row can
         | specify a different prediction method means that you end up
         | with a lot of divergence in the shader. Still, it could get
         | some load off the CPU if your loading is CPU-bound.
        
       | robalni wrote:
       | When looking at the benchmark result and image size, it seems
       | like this format is good at images that have horizontal
       | structures or where the same color is repeated often. It's not
       | good at vertical structures, including images with enlarged
       | pixels (where one row is identical to the one above). I think the
       | main reason for that is that PNG has a filter pass that can look
       | at the row of pixels above, while this format only looks to the
       | left.
       | 
       | Example of image with vertical gradients:
       | https://qoiformat.org/benchmark/images/icon_512/apps-prefere... -
       | qoi compresses to 186kb and libpng to 25kb
       | 
       | Example of image with horizontal structure and repeated colors:
       | https://qoiformat.org/benchmark/images/textures_pk/mod_walld... -
       | qoi compresses to 21kb and libpng to 33kb
        
         | adgjlsfhk1 wrote:
         | One thing worth noting is that QOI can be pretty easily wrapped
         | with something else to change pixel traversal order. If you
         | want higher compression, I'd recommend wrapping QOI with ZSTD
         | (or similar) compression on the output, and something to change
         | the pixel traversal order on the input.
        
           | Akronymus wrote:
           | > and something to change the pixel traversal order on the
           | input.
           | 
           | I could see a variety of space filling curves working for
           | that.
        
       | ErikCorry wrote:
       | I experimented with adding this to the embedded system Toit. I
       | wrote up the experience here
       | https://medium.com/@erik_68861/trying-out-the-quite-ok-image...
       | and also used it for this little demo on a 2 inch screen:
       | https://twitter.com/erikcorry/status/1470023885026467848
       | 
       | It was pretty good, but in the end I don't think I'll be making
       | it part of the graphics system. I would really like a bit of
       | random access into the image without having to divide it up into
       | multiple tiles. And the spec took a direction that didn't really
       | suit me - i wanted to use it for Gui-like textures which have
       | slabs of colours and anti-aliased edges with varying alpha. In
       | the final version of the spec there's no way to code "the alpha
       | changed, but the color is the same". without spending 5 bytes on
       | a single pixel. Previously that took 2 bytes.
       | 
       | So I'm still looking for a format with:
       | 
       | * Very small amount of decode state (GIF's 16k is probably too
       | much - PNG's 32k is certainly too much).
       | 
       | * Fast decode (these embedded chips are not as fast as desktops).
       | 
       | * Alpha channel (not just on-off like GIF)
       | 
       | * A bit of random access like the parallel PNG, but hopefully
       | less bolted-on.
        
         | FullyFunctional wrote:
         | All these points are spot on, but I want to reorder the
         | priority. I cannot believe that a format is introduced in this
         | millennium that doesn't have a single thought to aiding
         | parallel access/decode. The single-thread focus will not age
         | well; even low-end uProcessors are going multicore today, and
         | going forward that will only accellerate.
        
           | kevingadd wrote:
           | Even ignoring parallelism, images increasingly need to go
           | straight to the GPU, so improving the performance of access
           | on-GPU (I.E. using a GPU compressed texture format) can have
           | a bigger positive impact on the user experience than making
           | the file on disk smaller. Adopting Basis to compress your
           | images could in practice be much better than adopting QOI,
           | and it should be benchmarked against options like that along
           | with PNG or JPEG.
        
         | modeless wrote:
         | Have you considered compressed texture formats, perhaps DXT5?
         | Fixed compression ratio (1 byte per pixel for DXT5), arbitrary
         | random access, decompress the pixels you need on the fly. Can
         | be further compressed for efficient storage to about 1-2 bits
         | per pixel with https://github.com/BinomialLLC/crunch.
        
       | choeger wrote:
       | This is going to be super useful for educational purposes. I'd
       | really like to see more file formats like this.
       | 
       | There's already a bunch of minimal programming languages and even
       | some operating systems. It's great value for learning stuff when
       | you can implement a quite OK version of something in dozens of
       | hours or less.
        
       | ape4 wrote:
       | I like how simple it is. Of course, getting a new format adopted
       | is pretty difficult.
        
       ___________________________________________________________________
       (page generated 2021-12-20 23:00 UTC)