[HN Gopher] Data Compression Drives the Internet. Here's How It ...
       ___________________________________________________________________
        
       Data Compression Drives the Internet. Here's How It Works
        
       Author : the-mitr
       Score  : 82 points
       Date   : 2023-06-09 00:36 UTC (22 hours ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | fguerraz wrote:
       | Quanta magazine and all these pop science websites need to be
       | stopped.
       | 
       | (Not because popular science is bad, but because they do it
       | badly, and the clickbait is insufferable)
        
       | sgt101 wrote:
       | Caches drive the internet.
       | 
       | They are orders of magnitude more important than compression.
       | 99.99% of requests hit a local cache. (1)
       | 
       | Compression is important too.
       | 
       | (1) I worked in a telco. Check it out for yourself!
        
         | Solvency wrote:
         | Ok how do they work then?
        
           | parl_match wrote:
           | There's lot of information about this on the open web, and
           | tons of resources for varying levels of skill.
        
       | joshsabol46 wrote:
       | Please pardon my ignorance...
       | 
       | My understanding is that there are 1000s of different compression
       | algorithms, each with their own pros/cons dependent on the type
       | and characteristics of the file. And yet we still try to pick the
       | "generically best" codec for a given file (ex. PNG) and then use
       | that everywhere.
       | 
       | Why don't we have context-dependent compression instead?
       | 
       | I'm imagining a system that scans objects before compression,
       | selects the optimal algorithm, and then encodes the file. The
       | selected compression algorithm could be prefixed for easy
       | decompression.
       | 
       | Compare a single black image that's 1x1 to one that's 1000x1000.
       | PNGs are 128bytes and 6KB, respectively. However, Run Length
       | Encoding would compress the latter to a comparable size as the
       | former.
        
         | deepsun wrote:
         | > selects the optimal algorithm
         | 
         | Here's the catch: how does the "system" know which algorithm
         | would be the best? It could try encoding it with multiple
         | algorithms and see which one is shorter, but that's extra CPU.
         | 
         | And the "system" can be called acompression algorithm itself.
        
           | U2EF1 wrote:
           | I mean yeah that's basically what high compression solutions
           | like paq have done, depending on the compression level
           | desired apply increasingly speculative and computationally
           | intensive models to the block and pick whichever one worked
           | the best.
        
       | denvaar wrote:
       | > The first step involves yet another clever strategy for
       | identifying repetition and thereby compressing message size, but
       | the second step is to take the resulting compressed message and
       | run it through the Huffman process.
       | 
       | I wonder if this "first step" is Burrows-Wheeler Transform?
       | 
       | Side note: In Silicon Valley (the show), I'm pretty sure that
       | Richard has a picture of David Huffman by his bedside.
        
         | Scaevolus wrote:
         | No, BWT is largely unused in modern compression codecs.
         | 
         | Lempel-Ziv is the basis for almost all modern general purpose
         | compression, and works more like having a hash table mapping 3
         | or 4 byte fragments to their positions, and walking through the
         | input byte by byte checking the hash table for matches and
         | inserting the latest fragments&positions into the hash table.
         | 
         | BWT has nearly identical speed compressing and decompressing,
         | but searching for matches to compress is _much_ slower than
         | simply copying data according to instructions to decompress.
        
         | pseudotrash wrote:
         | I came here to find Silicon Valley references and wasn't
         | disappointed.
        
           | devsegal wrote:
           | Both of us have been satisfied
        
             | hinkley wrote:
             | at the same time?
        
           | ThrowawayTestr wrote:
           | I don't like that show because whenever I watch it I start
           | imagining a world with inside-out compression and I get sad
           | we'll never have it.
        
       | gfody wrote:
       | missed opportunity to explain how compression and prediction are
       | related, and that the better you can predict the next token the
       | better your compression gets, then your article gets to mention
       | GPT hey
        
         | optimalsolver wrote:
         | Apparently compression and intelligence are synonymous:
         | 
         | https://mattmahoney.net/dc/rationale.html
        
           | valenterry wrote:
           | I think in a sense compression is worse - because not only
           | you want to correctly predict the next token, you also want
           | to do it fast, with a minimal but efficient algorithm that
           | also doesn't require much space / a big dictionary.
           | 
           | You could think of it as taking a "snapshot" if an AI and
           | then optimizing the hell out of it for a specific case and
           | you end up with a good compression algorithm.
        
       | hackcasual wrote:
       | Huffman trees are really cool and a neat little example program
       | to put together.
        
       ___________________________________________________________________
       (page generated 2023-06-09 23:01 UTC)