[HN Gopher] Zstandard Worked Example ___________________________________________________________________ Zstandard Worked Example Author : stargrave Score : 54 points Date : 2022-05-17 15:00 UTC (1 days ago) (HTM) web link (nigeltao.github.io) (TXT) w3m dump (nigeltao.github.io) | dietrichepp wrote: | Nice writeup! FSE, one of the parts of Zstandard, is a very | elegant entropy coder. If you're following this section of the | article, you might be wondering how it's encoded, since the | encoding process is somewhat less obvious than for Huffman | coding. My intuition tells me that there's probably a | homomorphism from Huffman coding to FSE that preserves the | encoded size exactly, but I haven't done the math to check. | | What causes trouble is simply, "Once you've emitted a symbol, how | do you know that the next symbol is in the range [BL, BL+2^NB)?" | The answer is that the encoding is performed backwards, starting | from the end of the input, and the table is constructed so that | for any symbol emitted, you can find a _previous_ symbol so that | the _current_ symbol is in the previous symbol 's [BL, BL+2^NB) | range. | | There's another writeup about FSE here, which goes into more | depth about FSE (rather than talking about Zstandard in general): | | http://fastcompression.blogspot.com/2013/12/finite-state-ent... | keyle wrote: | Interesting, I've never heard of it before. I was aware that | there were better standards than zip, however adoption was always | an issue with licenses. If this is as open, is it purely an issue | of wide adoption? | | We're stuck in the C++ paradigm (I'm sure there is a better | name), where everyone agree this isn't "ideal", that there is | better, but not widely adopted enough? | learndeeply wrote: | Your questions are answered in the first sentence of the | article. | | > Zstd or Zstandard (RFC 8478, first released in 2015) is a | popular modern compression algorithm. It's smaller (better | compression ratio) and faster than the ubiquitous Zlib/Deflate | (RFC 1950 and RFC 1951, first released in 1995). | c0balt wrote: | The article already mentions zstd but for an IMO simple and | accurate overview take a look at the recommendations from | https://jolynch.github.io/posts/use_fast_data_algorithms/ . | wolf550e wrote: | Unless you have to use deflate/zlib/gzip/zip for compatibility | or you're very constrained in amount of memory, you should | always prefer zstd over deflate - it's always both faster and | compresses better, so there is no tradeoff and using deflate is | just wasteful. | | Google's brotli is a close competitor and got supported in | Chrome first, so HTTP uses that more than it uses zstd. | | zstd has a very wide range of compression levels to tradeoff | compression ratio for CPU time, much wider than the difference | between gzip -1, gzip -6 (default) and gzip -9. zstd -3 is | default. zstd -1 is very fast, zstd -19 compresses really well. | | zstd also has a long range mode, for archives that contain the | same file a gigabyte away. gzip's window is only 32KB in size, | so it can't notice any repetitions more than that far away. | | LZ4 by the same author as zstd aims at even faster compression, | at the cost of worse compression ratio. It makes sense to use | over fast network. Google snappy and LZO are the usual | competitors. | | When you want better compression than zstd (at the cost of CPU | time, both during compression and during decompression), use | PPMD for natural language text or source code and use LZMA (xz) | for other things. | londons_explore wrote: | This post talksabout literals still being encoded with huffman | tables and stuff. | | However, it appears this must be optional, because short strings | compressed with zstd pass through 'unencoded', apart from a | header and footer added: $ echo "hello world" | | zstd | hexdump -C 00000000 28 b5 2f fd 04 58 61 00 00 | 68 65 6c 6c 6f 20 77 |(./..Xa..hello w| 00000010 6f 72 | 6c 64 0a 8c 6d 7d 20 |orld..m} | | | As an aside, I'm fairly disappointed that zstd didn't use a | single byte header for uncompressible strings, so that they could | guarantee that the compressed data will never be more than 1 byte | larger than the source data. That property is very useful where | lots of tiny strings are being compressed, such as in a database. | wolf550e wrote: | This is an uncompressed block. | | For database records or log lines, you should use dictionary | trained on likely data. Then short records compress well. | | See what facebook does with zstd (they employ its author): | https://engineering.fb.com/2018/12/19/core-data/zstandard/ | pcwalton wrote: | The first 4 bytes are the magic number and the last 4 bytes are | the checksum [1] which you could always just chop off if you | wanted (it's legal to omit the checksum, see the spec). That | would get the total overhead down to 5 bytes. | | [1]: | https://github.com/facebook/zstd/blob/dev/doc/zstd_compressi... ___________________________________________________________________ (page generated 2022-05-18 23:00 UTC)