[HN Gopher] Why are tar.xz files 15x smaller when using Python's...
       ___________________________________________________________________
        
       Why are tar.xz files 15x smaller when using Python's tar compared
       to macOS tar?
        
       Author : personjerry
       Score  : 177 points
       Date   : 2021-03-14 21:06 UTC (1 hours ago)
        
 (HTM) web link (superuser.com)
 (TXT) w3m dump (superuser.com)
        
       | gwern wrote:
       | Sorting the files in an archive is a great trick to know for
       | efficiency ( https://www.gwern.net/Archiving-URLs#sort-key-
       | compression-tr... ); the more redundant your files are, the
       | better it works and the savings can be enormous. Two near-
       | duplicate files next to each other in a compressor's stream are
       | basically free; space them by a few dozen megabytes, and you pay
       | full-price...
        
         | tayo42 wrote:
         | Isn't the whole point of compression algorithms to find string
         | of bytes that are the same? Why does it struggle if its out of
         | order or not next to each other? Seems simple to detect and
         | handle?
        
           | dmw_ng wrote:
           | It depends on a lot on the compressor. Ancient junk like
           | gzip/deflate only has a 32kb window, more modern compressors
           | like zstd can do 512mb or more.
           | 
           | Larger window means potentially much more work during
           | compression, and larger RAM requirements during
           | decompression, so its not exactly free
        
             | bscphil wrote:
             | Makes me wonder how tar+zstd would perform for the
             | StackOverflow user. They were using xz, which is hardly
             | ancient junk.
        
           | diegoperini wrote:
           | Simple but time consuming to detect I presume.
        
           | mrec wrote:
           | Due to limited memory, I think most compression algos have a
           | window of "stuff they've seen recently" which they use to
           | detect duplicates. If the duplicate is of something too far
           | back, it's fallen out of the window and been forgotten.
        
           | jtolmar wrote:
           | Compression formats have to be small themselves, so they make
           | tradeoffs on what data they can point at to make copies of.
           | Most formats prioritize recent data, since that tends to be
           | the most similar.
        
           | jtvjan wrote:
           | The 'window size' dictates how far back the algorithm can
           | look for redundant data. If you set it too high, people
           | without enough core memory might not be able to decompress
           | your archive.
        
             | duskwuff wrote:
             | And the window size for gzip is only 32 kB -- it "forgets"
             | context pretty quickly. gzip was written in 1992, so it had
             | to run under much tighter memory constraints than
             | necessarily make sense today.
             | 
             | Modern compressors like xz (LZMA) use _much_ more memory
             | for context. Under extreme settings and when working with
             | large files, they can use gigabytes of memory.
        
             | dmurray wrote:
             | Is this really true? Surely they should be able to
             | decompress it with virtual memory - you tell them exactly
             | where to look for each operation. On the other hand, it
             | will take you much longer to _compress_ the file if your
             | lookback window is larger than your available memory.
        
         | LeoPanthera wrote:
         | A good ugly hack is to reverse the filenames before sorting
         | them, which has the side effect of grouping all the same file
         | extensions together.
        
           | duskwuff wrote:
           | 7za will do this for you when running with the -mqs option.
        
       | 0xdky wrote:
       | FWIK, compression works by de-duplication. Finding duplicate
       | patterns is limited to a window.
       | 
       | If similar patterns are close to each other, there is a higher
       | probability of finding such duplicates in the window leading to
       | better compression.
       | 
       | When the files are not sorted, this randomly distributed files
       | with similar patterns beyond the compression window leading to
       | poor compression.
       | 
       | If there is an option to increase the size of window, that would
       | be a good experiment.
       | 
       | This is very similar to `git repack` window and depth parameters.
       | Larger the window and depth, you get better compressed packs.
       | 
       | Wonder if a sort based on diffs (group similar files together)
       | would help get best compression. The cost of such sorting might
       | outweigh the benefits.
        
       | mxcrossb wrote:
       | There's a lot of research in the field of scientific computing
       | aimed at exactly this kind of data. You can definitely gain a lot
       | of your floating point data is varying slowly.
        
       | jaynetics wrote:
       | tl;dr: the python lib adds files to the archive sorted by the
       | last modification date by default. this happened to be better in
       | this case, but macOS tar had the same results when using the
       | appropriate sorting flag.
       | 
       | Makes me wonder: considering the speed of modern SSDs, would it
       | make sense for compression tools to try various sortings by
       | default, or compare files up front?
        
         | masklinn wrote:
         | > macOS tar had the same results when using the appropriate
         | sorting flag.
         | 
         | No, macOS has the same results when using gnutar which provides
         | sorting option.
         | 
         | bsdtar (which macOS provides out of the box) has no such
         | option. It just stores files in whichever order you provide
         | them (on the command-line, via the CLI) or it finds them (for
         | recursive tar, so whichever order the directory returns them
         | in).
        
         | js2 wrote:
         | > macOS tar had the same results when using the appropriate
         | sorting flag
         | 
         | They had to install and use GNU tar to gain the `--sort`
         | option. macOS (BSD) tar doesn't have it. (You could emulate the
         | behavior by using `find` and passing the member names in on the
         | command-line.)
        
         | tomatocracy wrote:
         | Another option is to use something like lrzip as the compressor
         | - not sure if it would have helped in this instance but it
         | incorporates a step which looks for long-range redundancy which
         | can make a huge difference in these sorts of scenarios.
        
         | JulianWasTaken wrote:
         | Seems better to have some separate higher level metacompression
         | tool which does the heuristic searching across many possible
         | compression formats and parameters.
        
           | duskwuff wrote:
           | This is precisely the approach taken by the Hydra
           | metacompressor in the (closed-source) Oodle compression
           | library:
           | 
           | http://cbloomrants.blogspot.com/2017/02/oodle-hydra.html
        
         | formerly_proven wrote:
         | Compression is generally CPU bound for most of the typical
         | algorithms (LZMA, zlib/deflate) unless your drive is very very
         | slow (<10 MB/s). There are some quite fast algorithms where
         | this could make sense, basically using something like LZ4 as a
         | proxy for compress-ability. This is the approach some tools use
         | when you tell them to "compress, but only if it makes sense".
        
       | userbinator wrote:
       | _If I compare the two tar archives directly, they seem different_
       | 
       | I like the fact that this user considered whether the filesystem
       | was lying about the size.
       | 
       | It's interesting to see the problem-solving approach; my next
       | step would be to hexdump both and see if one isn't actually
       | compressed as much as it could be, then decompress and inspect
       | the tars.
       | 
       | I'm not sure whether it's true here, but one situation that leads
       | to seemingly randomly-ordered files is when the archiver tries to
       | be faster by using multiple threads.
       | 
       | The XZ blocksize is another variable to consider.
        
       | TheGuyWhoCodes wrote:
       | This isn't really specific to tar. For example if you save data
       | to parquet, sorting the data before persisting it can make a huge
       | difference in the size of the file AND the predicate pushdown
       | performance (as you can skip more pages especially if you sort by
       | the column you filter on)
        
       | war1025 wrote:
       | Unrelated to the issue in the post, but a few years back we ran
       | into an annoying oddity in the Python zipfile module [1].
       | 
       | By default, when you create a new zipfile object to create a .zip
       | archive with, it stores the data uncompressed.
       | 
       | Only noticed the issue when a 100MB zip file suddenly became 3MB
       | after I had edited one of the files and re-saved the archive.
       | 
       | [1] https://docs.python.org/3/library/zipfile.html
        
         | masklinn wrote:
         | Yes, it's clearly documented but very much unexpected the first
         | time 'round.
         | 
         | The reason for the default is probably that python can be built
         | with none of zlib, bzip2 or lzma support:
         | 
         | > If ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA is specified but the
         | corresponding module (zlib, bz2 or lzma) is not available,
         | RuntimeError is raised.
         | 
         | in which case only STORED is available, therefore that's the
         | only possible default, any other would lead to zipfile raising
         | errors out of the box for some people.
         | 
         | It would probably have been a better idea to just not use a
         | default, and require the compression method to always be
         | specified, at least when opening zipfiles with modes other than
         | "r".
        
           | duskwuff wrote:
           | TBH, Python's zipfile should probably just require zlib.
           | Without that module, it's basically useless (can't open most
           | ZIP files, can't emit compressed ZIP files).
        
         | Jiocus wrote:
         | This isn't an oddity or issue with Python zip creation, it's
         | expected behaviour as per ISO/IEC WD 21320-1[1].
         | 
         | > The compression method shall be either 0 ("stored") or
         | 8("deflated").
         | 
         | The format does not prescribe a compression algorithm (many
         | could be available for use). It's merely a container.
         | 
         | [1] ISO/IEC WD 21320-1, Document Container File - Part 1: Core
         | -
         | https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39...
        
           | bawolff wrote:
           | What defaults python chooses is an oddity within python.
        
             | Jiocus wrote:
             | Using defaults aimed at compatability with other zip
             | archivers, today and tomorrow, doesn't strike me as odd.
        
       | jasode wrote:
       | Fyi... not mentioned explicitly in that Q&A is the related
       | concept : https://en.wikipedia.org/wiki/Solid_compression
       | 
       | [EDIT to reply] : _> The difference was just the order in which
       | the files were concatenated._
       | 
       | Yes, right. My link just emphasizes the files concatenation
       | aspect (e.g. tar is one example) for people who base the
       | intuition on the ZIP format in which files' order doesn't matter
       | because each file is compressed _separately_.
        
         | simias wrote:
         | Note that in both cases the archive was first put into a single
         | tar file, then compressed. That's how all tar formats function.
         | The difference was just the order in which the files were
         | concatenated.
        
           | masklinn wrote:
           | The point GP is making is that tars are "solid" archives, so
           | compression can span files, which means putting files with
           | lots of redundancy between them next to one another will
           | yield much better compression ratios than orderings which are
           | essentially random.
           | 
           | That is somewhat unintuitive to people used to zip archives,
           | because zip contents are compressed independently, so the
           | ordering really doesn't matter (at least to the compression
           | ratio).
        
       | nomy99 wrote:
       | Also a bit unrelated, but at my work place we reimplemented tar
       | algorithm so we can run it in a lambda/state machine. We used the
       | s3 multipart upload to help with the file transfer bit. The
       | performance was phenomenal. On avg we were tar'ing 20gb in 15
       | seconds or so, a lot faster than on my pc.
        
         | OskarS wrote:
         | Tar (as opposed to gzip) is almost entirely I/O bound, it's
         | just reading from the filesystem, adding some headers and then
         | outputting it again. Whatever speed your I/O runs at, you will
         | basically tar at that speed as well.
        
           | banana_giraffe wrote:
           | Yep. One of the biggest speed improvements I got out of a
           | process that creates large tar balls was moving it to use
           | pbzip2 to allow the compression to use multiple cores.
        
             | slivanes wrote:
             | pigz is another package.
        
         | gnulinux wrote:
         | I don't understand this comment. You can run arbitrary code in
         | AWS lambda, you can just run "tar". Why did you need to
         | reimplement it?
        
           | 411111111111111 wrote:
           | It wasn't always like that though. I think they added custom
           | runtimes around 2yrs ago... Your options were very limited
           | before that.
        
           | rescbr wrote:
           | I suppose they are streaming the tarball directly to S3 using
           | multipart uploads, maybe parallelizing this task at the same
           | time.
           | 
           | You wouldn't be able to do it using regular tar.
        
       | sgt wrote:
       | Even better when you switch on the --lossy option in gzip.
        
         | Aachen wrote:
         | Was hoping for an Easter egg, but it just says unrecognized
         | option? At least on my phone it does, iirc that's a Debian
         | Buster install that I haven't updated in three years.
        
         | ObscureScience wrote:
         | I assume you're joking... I've never heard of lossy compression
         | in this kinds of formats, as there's no general way to
         | determine what can be discarded without affecting the utility
         | of the data. There's no such option in any implementation of
         | gzip I've found...
        
         | dheera wrote:
         | What version of gzip has a lossy option?
        
           | ben509 wrote:
           | It's a joke. Lossy compression can only work with specific
           | formats where information loss is acceptable, theoretically
           | anything but in practice usually audio and video.
           | 
           | If gzip tried to do lossy compression, not knowing what the
           | data is, it would have to randomly corrupt data.
        
         | gruez wrote:
         | At that point you might as well store your files in /dev/null
        
           | wanderer2323 wrote:
           | if /dev/null is fast and web scale I will use it. (
           | https://devnull-as-a-service.com/ )
        
           | JdeBP wrote:
           | It's time to mention the name of the NABOB archiver, it
           | seems. (-:
           | 
           | * https://news.ycombinator.com/item?id=26460317
        
           | suprfsat wrote:
           | Does /dev/null support sharding?
        
             | fhars wrote:
             | Maybe try mangodb https://github.com/dcramer/mangodb
        
             | [deleted]
        
         | LeoPanthera wrote:
         | Although you are joking, gifsicle does have a "--lossy" switch
         | which allows changes to a gif file in order to make the LZW
         | compression more efficient.
        
           | masklinn wrote:
           | An other example is pngquant, which performs "lossy"
           | compression of PNGs (by quantizing them).
        
             | LeoPanthera wrote:
             | pngquant reduces the size by removing information from the
             | image _before_ compression.
             | 
             | gifsicle --lossy actually allows slightly lossy LZW
             | dictionary matches to improve the compression itself.
        
         | zinckiwi wrote:
         | It was the best and worst of times.
        
           | krrrh wrote:
           | It was the best and best of times.
        
       ___________________________________________________________________
       (page generated 2021-03-14 23:00 UTC)