[HN Gopher] How to get your backup to half of its size - ZSTD su... ___________________________________________________________________ How to get your backup to half of its size - ZSTD support Author : marceloaltmann Score : 35 points Date : 2022-11-17 20:02 UTC (2 hours ago) (HTM) web link (www.percona.com) (TXT) w3m dump (www.percona.com) | kkielhofner wrote: | A while back I switched from lz4 to zstd compression with borg | for backups. I've always appreciated zstd but the deduplication | of borg + zstd is such a dramatic difference in storage space. My | home "server" is a tiny NUC-like AMD Ryzen 7 4800U and it flies | with this configuration. | | Impressive! | bostonsre wrote: | I don't say this often but... thank you facebook. | | https://facebook.github.io/zstd/ | glogla wrote: | This and Presto are about the only things that Facebook did | that benefited the world. | | Maybe React too, I'm not a front-end person. | zitterbewegung wrote: | Their research in transformers for conversational AI is good. | See https://parl.ai/ | orangepurple wrote: | "All right, but apart from the sanitation, the medicine, | education, wine, public order, irrigation, roads, a fresh | water system, and public health, what have the Romans ever | done for us?" | glogla wrote: | Pros: zstd, presto, react | | Cons: created a post-truth world which destroyed democracy | and killed millions of people by spreading antivaxx | propaganda | | Eh. | sneak wrote: | So far, the societal benefits from Facebook, WhatsApp, and | Instagram have been net positive, because the massive amounts | of data they're collecting have not yet been bulk misused for | mass torture, society-wide oppression, or a large-scale war. | | The game isn't over yet, however. | glogla wrote: | Not in the West, but check Facebook and Myanmar, for | example. | donatj wrote: | So we've got a system that backs up to an XML file nightly. I | noticed a couple weeks ago the bucket was using a not- | insignificant amount of disk space. | | These XML files are full backups rather than deltas - so each one | contains the full previous file plus the new additional data. | | My assumption was if I grabbed just a swath of the old ones, say | a years worth and compressed them together I would get really | decent savings. | | I was pretty disappointed - gzip knocked 10 gb off of about 100 | gb of data. I started doing some research and found people saying | 7zip and it's sliding dictionary size options were the answer. | After multiple tries, each run of 7zip taking multiple days I was | able to get it down to about 70 gigabytes from 100. Better then | gzip but frankly nowhere near what I would expect. | | Does there exist a compression that could better handle this sort | of expanding documents? | mkl wrote: | > Does there exist a compression that could better handle this | sort of expanding documents? | | Yes: storing deltas. Git might work well. A backup system that | does subfile deduplication may work too (e.g. Restic, Borg). | bombcar wrote: | That would be my idea, store the files in git and then backup | the whole git repository. | pavel_lishin wrote: | They mention that the files _start_ at 700mb: | https://news.ycombinator.com/item?id=33646345 | | I was under the impression that git didn't do well with | large files. | glogla wrote: | Yeah. We used to store PowerDesigner files in git a while | back, which are XMLs usually smaller than 100 MB, and | that was pretty much a disaster of hour taking clones | already. I can't imagine 700 MB working. | bombcar wrote: | Good point - I wonder if the XML really is "diffable" or | if it's basically encoded binary data and has to be | stored as a blob anyway. | lazide wrote: | That doesn't make much sense, window size or not. Even base64 | encoded random data would be about that bad. | | Is the XML wrapping a bunch of other random data or something? | cdavid wrote: | base64 random data would be that bad only because of base64. | Random data does not compress at all on average. | | As an example, I have zstd enabled on some zfs pool. The | client-side encrypted time machine backups does not even | compress 1 %, as expected. | lazide wrote: | Yes, that's why I said that? | | You can't embed binary data in xml raw, so a common pattern | is embedding it as base64, or similar type of wrapping. | | Technically, it's possible to escape it in other ways, but | it's error prone. | | Either way, XML which has text strings or whatever typical | XML document data that ISN'T something like base64 encoded | random data should compress dramatically better than what | the poster was talking about. | | So, what the hell is in your XML anyway? | 323 wrote: | You need what's called "solid compression", meaning that the | files are concatenated together before compression. Because | otherwise compression restarts from zero for each new file - | https://en.wikipedia.org/wiki/Solid_compression | | Some compressors have options for solid compression, or you can | use tar to first concatenate the files. For both, you need to | sort the files first, so that they are compressed in | chronological order, otherwise the common information will drop | out of the dictionary. | | And how big is one file and how much RAM you have? Because you | might need increasing the "dictionary size" of the compression | algorithm, both 7zip and zstd support multi-gigabyte sizes. | donatj wrote: | The files start at around 700mb and approach 1.8gb by the end | of the series. | | I've got 64gb of ram and 10 cores to work with, but it seems | single core compression might be in order? | 323 wrote: | 7zip can use 2 cores (maybe more now), zstd can use all of | them, but zstd doesn't support solid compression - you need | to use tar with it. | guipsp wrote: | In addition to this, you can pretrain a dictionary on sample | data, and then use that when compressing files individually | cogman10 wrote: | zstd supports using a dict across compression actions, | which is what you'd want for this. | PeterZaitsev wrote: | Zstd Rocks. Probably best Universal compression algorithm. Yes | LZ4 can be faster, Brotl can offer better compression but all | have other tradeoffs. | mips_avatar wrote: | What's surprising to me is that zstd beat out lz4 on | compress/decompress speed. LZ4 is supposed to be purpose built to | optimize on those metrics. Great job zstd developers to get perf | that good! | ctur wrote: | The lz4 developers are the zstd developers :) In this case I | suspect the IO reductions on the output are what matter -- | while lz4 is faster, it also produces more output (of course) | which you then have to write to disk. This can make the wall | time take longer... basically you become IO bound on output, be | it network or disk. | | zstd also can tune itself and adjust ratios to saturate output | bandwidth, which is pretty cool (--adapt). | marceloaltmann wrote: | That is exactly it. The difference comes from compression | ratio. ZSTD is compressing the data more, so you need to | write less back to disk. Also when talking about streaming, | the difference is even more as it has to go over wan to S3 | (provider used on blog test). | marceloaltmann wrote: | Today we are glad to introduce support for a new compression | algorithm in Percona XtraBackup 8.0.30 - Zstandard (ZSTD). | | Results shows that ZSTD not only overcame LZ4 results on all | tests, but it also brought backup size to half of its original | size. | | When streaming is added to the mix is when we see the biggest | difference between both algorithms, with ZSTD overcoming LZ4 with | an even bigger margin. | | This can bring users and organizations a huge amount of savings | in backup storage, either on-premises or especially in the cloud | - where we are charged for each GB of storage we use. | woleium wrote: | Nice! I've always had a lot of respect for the Percona team. If | you haven't checked out the product I suggest you do :) ___________________________________________________________________ (page generated 2022-11-17 23:00 UTC)