[HN Gopher] How to get your backup to half of its size - ZSTD su...
       ___________________________________________________________________
        
       How to get your backup to half of its size - ZSTD support
        
       Author : marceloaltmann
       Score  : 35 points
       Date   : 2022-11-17 20:02 UTC (2 hours ago)
        
 (HTM) web link (www.percona.com)
 (TXT) w3m dump (www.percona.com)
        
       | kkielhofner wrote:
       | A while back I switched from lz4 to zstd compression with borg
       | for backups. I've always appreciated zstd but the deduplication
       | of borg + zstd is such a dramatic difference in storage space. My
       | home "server" is a tiny NUC-like AMD Ryzen 7 4800U and it flies
       | with this configuration.
       | 
       | Impressive!
        
       | bostonsre wrote:
       | I don't say this often but... thank you facebook.
       | 
       | https://facebook.github.io/zstd/
        
         | glogla wrote:
         | This and Presto are about the only things that Facebook did
         | that benefited the world.
         | 
         | Maybe React too, I'm not a front-end person.
        
           | zitterbewegung wrote:
           | Their research in transformers for conversational AI is good.
           | See https://parl.ai/
        
           | orangepurple wrote:
           | "All right, but apart from the sanitation, the medicine,
           | education, wine, public order, irrigation, roads, a fresh
           | water system, and public health, what have the Romans ever
           | done for us?"
        
             | glogla wrote:
             | Pros: zstd, presto, react
             | 
             | Cons: created a post-truth world which destroyed democracy
             | and killed millions of people by spreading antivaxx
             | propaganda
             | 
             | Eh.
        
           | sneak wrote:
           | So far, the societal benefits from Facebook, WhatsApp, and
           | Instagram have been net positive, because the massive amounts
           | of data they're collecting have not yet been bulk misused for
           | mass torture, society-wide oppression, or a large-scale war.
           | 
           | The game isn't over yet, however.
        
             | glogla wrote:
             | Not in the West, but check Facebook and Myanmar, for
             | example.
        
       | donatj wrote:
       | So we've got a system that backs up to an XML file nightly. I
       | noticed a couple weeks ago the bucket was using a not-
       | insignificant amount of disk space.
       | 
       | These XML files are full backups rather than deltas - so each one
       | contains the full previous file plus the new additional data.
       | 
       | My assumption was if I grabbed just a swath of the old ones, say
       | a years worth and compressed them together I would get really
       | decent savings.
       | 
       | I was pretty disappointed - gzip knocked 10 gb off of about 100
       | gb of data. I started doing some research and found people saying
       | 7zip and it's sliding dictionary size options were the answer.
       | After multiple tries, each run of 7zip taking multiple days I was
       | able to get it down to about 70 gigabytes from 100. Better then
       | gzip but frankly nowhere near what I would expect.
       | 
       | Does there exist a compression that could better handle this sort
       | of expanding documents?
        
         | mkl wrote:
         | > Does there exist a compression that could better handle this
         | sort of expanding documents?
         | 
         | Yes: storing deltas. Git might work well. A backup system that
         | does subfile deduplication may work too (e.g. Restic, Borg).
        
           | bombcar wrote:
           | That would be my idea, store the files in git and then backup
           | the whole git repository.
        
             | pavel_lishin wrote:
             | They mention that the files _start_ at 700mb:
             | https://news.ycombinator.com/item?id=33646345
             | 
             | I was under the impression that git didn't do well with
             | large files.
        
               | glogla wrote:
               | Yeah. We used to store PowerDesigner files in git a while
               | back, which are XMLs usually smaller than 100 MB, and
               | that was pretty much a disaster of hour taking clones
               | already. I can't imagine 700 MB working.
        
               | bombcar wrote:
               | Good point - I wonder if the XML really is "diffable" or
               | if it's basically encoded binary data and has to be
               | stored as a blob anyway.
        
         | lazide wrote:
         | That doesn't make much sense, window size or not. Even base64
         | encoded random data would be about that bad.
         | 
         | Is the XML wrapping a bunch of other random data or something?
        
           | cdavid wrote:
           | base64 random data would be that bad only because of base64.
           | Random data does not compress at all on average.
           | 
           | As an example, I have zstd enabled on some zfs pool. The
           | client-side encrypted time machine backups does not even
           | compress 1 %, as expected.
        
             | lazide wrote:
             | Yes, that's why I said that?
             | 
             | You can't embed binary data in xml raw, so a common pattern
             | is embedding it as base64, or similar type of wrapping.
             | 
             | Technically, it's possible to escape it in other ways, but
             | it's error prone.
             | 
             | Either way, XML which has text strings or whatever typical
             | XML document data that ISN'T something like base64 encoded
             | random data should compress dramatically better than what
             | the poster was talking about.
             | 
             | So, what the hell is in your XML anyway?
        
         | 323 wrote:
         | You need what's called "solid compression", meaning that the
         | files are concatenated together before compression. Because
         | otherwise compression restarts from zero for each new file -
         | https://en.wikipedia.org/wiki/Solid_compression
         | 
         | Some compressors have options for solid compression, or you can
         | use tar to first concatenate the files. For both, you need to
         | sort the files first, so that they are compressed in
         | chronological order, otherwise the common information will drop
         | out of the dictionary.
         | 
         | And how big is one file and how much RAM you have? Because you
         | might need increasing the "dictionary size" of the compression
         | algorithm, both 7zip and zstd support multi-gigabyte sizes.
        
           | donatj wrote:
           | The files start at around 700mb and approach 1.8gb by the end
           | of the series.
           | 
           | I've got 64gb of ram and 10 cores to work with, but it seems
           | single core compression might be in order?
        
             | 323 wrote:
             | 7zip can use 2 cores (maybe more now), zstd can use all of
             | them, but zstd doesn't support solid compression - you need
             | to use tar with it.
        
           | guipsp wrote:
           | In addition to this, you can pretrain a dictionary on sample
           | data, and then use that when compressing files individually
        
             | cogman10 wrote:
             | zstd supports using a dict across compression actions,
             | which is what you'd want for this.
        
       | PeterZaitsev wrote:
       | Zstd Rocks. Probably best Universal compression algorithm. Yes
       | LZ4 can be faster, Brotl can offer better compression but all
       | have other tradeoffs.
        
       | mips_avatar wrote:
       | What's surprising to me is that zstd beat out lz4 on
       | compress/decompress speed. LZ4 is supposed to be purpose built to
       | optimize on those metrics. Great job zstd developers to get perf
       | that good!
        
         | ctur wrote:
         | The lz4 developers are the zstd developers :) In this case I
         | suspect the IO reductions on the output are what matter --
         | while lz4 is faster, it also produces more output (of course)
         | which you then have to write to disk. This can make the wall
         | time take longer... basically you become IO bound on output, be
         | it network or disk.
         | 
         | zstd also can tune itself and adjust ratios to saturate output
         | bandwidth, which is pretty cool (--adapt).
        
           | marceloaltmann wrote:
           | That is exactly it. The difference comes from compression
           | ratio. ZSTD is compressing the data more, so you need to
           | write less back to disk. Also when talking about streaming,
           | the difference is even more as it has to go over wan to S3
           | (provider used on blog test).
        
       | marceloaltmann wrote:
       | Today we are glad to introduce support for a new compression
       | algorithm in Percona XtraBackup 8.0.30 - Zstandard (ZSTD).
       | 
       | Results shows that ZSTD not only overcame LZ4 results on all
       | tests, but it also brought backup size to half of its original
       | size.
       | 
       | When streaming is added to the mix is when we see the biggest
       | difference between both algorithms, with ZSTD overcoming LZ4 with
       | an even bigger margin.
       | 
       | This can bring users and organizations a huge amount of savings
       | in backup storage, either on-premises or especially in the cloud
       | - where we are charged for each GB of storage we use.
        
         | woleium wrote:
         | Nice! I've always had a lot of respect for the Percona team. If
         | you haven't checked out the product I suggest you do :)
        
       ___________________________________________________________________
       (page generated 2022-11-17 23:00 UTC)