[HN Gopher] Why are tar.xz files 15x smaller when using Python's... ___________________________________________________________________ Why are tar.xz files 15x smaller when using Python's tar compared to macOS tar? Author : personjerry Score : 177 points Date : 2021-03-14 21:06 UTC (1 hours ago) (HTM) web link (superuser.com) (TXT) w3m dump (superuser.com) | gwern wrote: | Sorting the files in an archive is a great trick to know for | efficiency ( https://www.gwern.net/Archiving-URLs#sort-key- | compression-tr... ); the more redundant your files are, the | better it works and the savings can be enormous. Two near- | duplicate files next to each other in a compressor's stream are | basically free; space them by a few dozen megabytes, and you pay | full-price... | tayo42 wrote: | Isn't the whole point of compression algorithms to find string | of bytes that are the same? Why does it struggle if its out of | order or not next to each other? Seems simple to detect and | handle? | dmw_ng wrote: | It depends on a lot on the compressor. Ancient junk like | gzip/deflate only has a 32kb window, more modern compressors | like zstd can do 512mb or more. | | Larger window means potentially much more work during | compression, and larger RAM requirements during | decompression, so its not exactly free | bscphil wrote: | Makes me wonder how tar+zstd would perform for the | StackOverflow user. They were using xz, which is hardly | ancient junk. | diegoperini wrote: | Simple but time consuming to detect I presume. | mrec wrote: | Due to limited memory, I think most compression algos have a | window of "stuff they've seen recently" which they use to | detect duplicates. If the duplicate is of something too far | back, it's fallen out of the window and been forgotten. | jtolmar wrote: | Compression formats have to be small themselves, so they make | tradeoffs on what data they can point at to make copies of. | Most formats prioritize recent data, since that tends to be | the most similar. | jtvjan wrote: | The 'window size' dictates how far back the algorithm can | look for redundant data. If you set it too high, people | without enough core memory might not be able to decompress | your archive. | duskwuff wrote: | And the window size for gzip is only 32 kB -- it "forgets" | context pretty quickly. gzip was written in 1992, so it had | to run under much tighter memory constraints than | necessarily make sense today. | | Modern compressors like xz (LZMA) use _much_ more memory | for context. Under extreme settings and when working with | large files, they can use gigabytes of memory. | dmurray wrote: | Is this really true? Surely they should be able to | decompress it with virtual memory - you tell them exactly | where to look for each operation. On the other hand, it | will take you much longer to _compress_ the file if your | lookback window is larger than your available memory. | LeoPanthera wrote: | A good ugly hack is to reverse the filenames before sorting | them, which has the side effect of grouping all the same file | extensions together. | duskwuff wrote: | 7za will do this for you when running with the -mqs option. | 0xdky wrote: | FWIK, compression works by de-duplication. Finding duplicate | patterns is limited to a window. | | If similar patterns are close to each other, there is a higher | probability of finding such duplicates in the window leading to | better compression. | | When the files are not sorted, this randomly distributed files | with similar patterns beyond the compression window leading to | poor compression. | | If there is an option to increase the size of window, that would | be a good experiment. | | This is very similar to `git repack` window and depth parameters. | Larger the window and depth, you get better compressed packs. | | Wonder if a sort based on diffs (group similar files together) | would help get best compression. The cost of such sorting might | outweigh the benefits. | mxcrossb wrote: | There's a lot of research in the field of scientific computing | aimed at exactly this kind of data. You can definitely gain a lot | of your floating point data is varying slowly. | jaynetics wrote: | tl;dr: the python lib adds files to the archive sorted by the | last modification date by default. this happened to be better in | this case, but macOS tar had the same results when using the | appropriate sorting flag. | | Makes me wonder: considering the speed of modern SSDs, would it | make sense for compression tools to try various sortings by | default, or compare files up front? | masklinn wrote: | > macOS tar had the same results when using the appropriate | sorting flag. | | No, macOS has the same results when using gnutar which provides | sorting option. | | bsdtar (which macOS provides out of the box) has no such | option. It just stores files in whichever order you provide | them (on the command-line, via the CLI) or it finds them (for | recursive tar, so whichever order the directory returns them | in). | js2 wrote: | > macOS tar had the same results when using the appropriate | sorting flag | | They had to install and use GNU tar to gain the `--sort` | option. macOS (BSD) tar doesn't have it. (You could emulate the | behavior by using `find` and passing the member names in on the | command-line.) | tomatocracy wrote: | Another option is to use something like lrzip as the compressor | - not sure if it would have helped in this instance but it | incorporates a step which looks for long-range redundancy which | can make a huge difference in these sorts of scenarios. | JulianWasTaken wrote: | Seems better to have some separate higher level metacompression | tool which does the heuristic searching across many possible | compression formats and parameters. | duskwuff wrote: | This is precisely the approach taken by the Hydra | metacompressor in the (closed-source) Oodle compression | library: | | http://cbloomrants.blogspot.com/2017/02/oodle-hydra.html | formerly_proven wrote: | Compression is generally CPU bound for most of the typical | algorithms (LZMA, zlib/deflate) unless your drive is very very | slow (<10 MB/s). There are some quite fast algorithms where | this could make sense, basically using something like LZ4 as a | proxy for compress-ability. This is the approach some tools use | when you tell them to "compress, but only if it makes sense". | userbinator wrote: | _If I compare the two tar archives directly, they seem different_ | | I like the fact that this user considered whether the filesystem | was lying about the size. | | It's interesting to see the problem-solving approach; my next | step would be to hexdump both and see if one isn't actually | compressed as much as it could be, then decompress and inspect | the tars. | | I'm not sure whether it's true here, but one situation that leads | to seemingly randomly-ordered files is when the archiver tries to | be faster by using multiple threads. | | The XZ blocksize is another variable to consider. | TheGuyWhoCodes wrote: | This isn't really specific to tar. For example if you save data | to parquet, sorting the data before persisting it can make a huge | difference in the size of the file AND the predicate pushdown | performance (as you can skip more pages especially if you sort by | the column you filter on) | war1025 wrote: | Unrelated to the issue in the post, but a few years back we ran | into an annoying oddity in the Python zipfile module [1]. | | By default, when you create a new zipfile object to create a .zip | archive with, it stores the data uncompressed. | | Only noticed the issue when a 100MB zip file suddenly became 3MB | after I had edited one of the files and re-saved the archive. | | [1] https://docs.python.org/3/library/zipfile.html | masklinn wrote: | Yes, it's clearly documented but very much unexpected the first | time 'round. | | The reason for the default is probably that python can be built | with none of zlib, bzip2 or lzma support: | | > If ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA is specified but the | corresponding module (zlib, bz2 or lzma) is not available, | RuntimeError is raised. | | in which case only STORED is available, therefore that's the | only possible default, any other would lead to zipfile raising | errors out of the box for some people. | | It would probably have been a better idea to just not use a | default, and require the compression method to always be | specified, at least when opening zipfiles with modes other than | "r". | duskwuff wrote: | TBH, Python's zipfile should probably just require zlib. | Without that module, it's basically useless (can't open most | ZIP files, can't emit compressed ZIP files). | Jiocus wrote: | This isn't an oddity or issue with Python zip creation, it's | expected behaviour as per ISO/IEC WD 21320-1[1]. | | > The compression method shall be either 0 ("stored") or | 8("deflated"). | | The format does not prescribe a compression algorithm (many | could be available for use). It's merely a container. | | [1] ISO/IEC WD 21320-1, Document Container File - Part 1: Core | - | https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39... | bawolff wrote: | What defaults python chooses is an oddity within python. | Jiocus wrote: | Using defaults aimed at compatability with other zip | archivers, today and tomorrow, doesn't strike me as odd. | jasode wrote: | Fyi... not mentioned explicitly in that Q&A is the related | concept : https://en.wikipedia.org/wiki/Solid_compression | | [EDIT to reply] : _> The difference was just the order in which | the files were concatenated._ | | Yes, right. My link just emphasizes the files concatenation | aspect (e.g. tar is one example) for people who base the | intuition on the ZIP format in which files' order doesn't matter | because each file is compressed _separately_. | simias wrote: | Note that in both cases the archive was first put into a single | tar file, then compressed. That's how all tar formats function. | The difference was just the order in which the files were | concatenated. | masklinn wrote: | The point GP is making is that tars are "solid" archives, so | compression can span files, which means putting files with | lots of redundancy between them next to one another will | yield much better compression ratios than orderings which are | essentially random. | | That is somewhat unintuitive to people used to zip archives, | because zip contents are compressed independently, so the | ordering really doesn't matter (at least to the compression | ratio). | nomy99 wrote: | Also a bit unrelated, but at my work place we reimplemented tar | algorithm so we can run it in a lambda/state machine. We used the | s3 multipart upload to help with the file transfer bit. The | performance was phenomenal. On avg we were tar'ing 20gb in 15 | seconds or so, a lot faster than on my pc. | OskarS wrote: | Tar (as opposed to gzip) is almost entirely I/O bound, it's | just reading from the filesystem, adding some headers and then | outputting it again. Whatever speed your I/O runs at, you will | basically tar at that speed as well. | banana_giraffe wrote: | Yep. One of the biggest speed improvements I got out of a | process that creates large tar balls was moving it to use | pbzip2 to allow the compression to use multiple cores. | slivanes wrote: | pigz is another package. | gnulinux wrote: | I don't understand this comment. You can run arbitrary code in | AWS lambda, you can just run "tar". Why did you need to | reimplement it? | 411111111111111 wrote: | It wasn't always like that though. I think they added custom | runtimes around 2yrs ago... Your options were very limited | before that. | rescbr wrote: | I suppose they are streaming the tarball directly to S3 using | multipart uploads, maybe parallelizing this task at the same | time. | | You wouldn't be able to do it using regular tar. | sgt wrote: | Even better when you switch on the --lossy option in gzip. | Aachen wrote: | Was hoping for an Easter egg, but it just says unrecognized | option? At least on my phone it does, iirc that's a Debian | Buster install that I haven't updated in three years. | ObscureScience wrote: | I assume you're joking... I've never heard of lossy compression | in this kinds of formats, as there's no general way to | determine what can be discarded without affecting the utility | of the data. There's no such option in any implementation of | gzip I've found... | dheera wrote: | What version of gzip has a lossy option? | ben509 wrote: | It's a joke. Lossy compression can only work with specific | formats where information loss is acceptable, theoretically | anything but in practice usually audio and video. | | If gzip tried to do lossy compression, not knowing what the | data is, it would have to randomly corrupt data. | gruez wrote: | At that point you might as well store your files in /dev/null | wanderer2323 wrote: | if /dev/null is fast and web scale I will use it. ( | https://devnull-as-a-service.com/ ) | JdeBP wrote: | It's time to mention the name of the NABOB archiver, it | seems. (-: | | * https://news.ycombinator.com/item?id=26460317 | suprfsat wrote: | Does /dev/null support sharding? | fhars wrote: | Maybe try mangodb https://github.com/dcramer/mangodb | [deleted] | LeoPanthera wrote: | Although you are joking, gifsicle does have a "--lossy" switch | which allows changes to a gif file in order to make the LZW | compression more efficient. | masklinn wrote: | An other example is pngquant, which performs "lossy" | compression of PNGs (by quantizing them). | LeoPanthera wrote: | pngquant reduces the size by removing information from the | image _before_ compression. | | gifsicle --lossy actually allows slightly lossy LZW | dictionary matches to improve the compression itself. | zinckiwi wrote: | It was the best and worst of times. | krrrh wrote: | It was the best and best of times. ___________________________________________________________________ (page generated 2021-03-14 23:00 UTC)