[HN Gopher] pigz: A parallel implementation of gzip for multi-co...
       ___________________________________________________________________
        
       pigz: A parallel implementation of gzip for multi-core machines
        
       Author : firloop
       Score  : 113 points
       Date   : 2022-10-17 19:19 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sitkack wrote:
       | If you really want to enable all cores for compression and
       | decompression, give pbzip2 a try. pigz isn't as parallel as
       | pbzip2
       | 
       | http://compression.ca/pbzip2/
       | 
       | *edit, as ac29 mentions below, just use zstdmt. In my quick
       | testing it is approximately 8x faster than pbzip2 and gives
       | better compression ratios. Wall clock time went from 41s to 3.5s
       | for a 3.6GB tar of source, pdfs and images AND the resulting file
       | was smaller.                   megs         3781    test.tar
       | 3041    test.tar.zstd (default compression 3, 3.5s)         3170
       | test.tar.bz2 (default compression, 8 threads, 40s)
        
         | walrus01 wrote:
         | on the other hand, bzip2 is pretty much obsoleted now by xzip
        
           | booi wrote:
           | What is xzip? are you talking about xz?
        
             | walrus01 wrote:
             | yes, xz
             | 
             | section 3.6 here
             | 
             | https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Ma
             | r...
             | 
             | https://en.wikipedia.org/wiki/XZ_Utils
        
               | chasil wrote:
               | The author of lzip has pointed criticism for the design
               | choices of xz.
               | 
               | I generally use lzip for data that is important to me.
               | 
               | https://www.nongnu.org/lzip/xz_inadequate.html
        
         | ac29 wrote:
         | bzip2 is very very slow though. Some types of data compress
         | quite well with bzip, but if high compression is needed, xz is
         | usually as good or better and natively has multithreading
         | available.
         | 
         | For everything else, there's zstd (also natively multithread)
        
           | sitkack wrote:
           | Interesting https://docs.rs/zstd/latest/zstd/stream/write/str
           | uct.Encoder...
        
             | iruoy wrote:
             | Decompression is multithreaded by default. Compression with
             | an argument. However it is built-in.
        
       | ericbarrett wrote:
       | We used this to great effect at Facebook for MySQL backups in the
       | early 2010s. The backup hosts had far more CPU than needed so it
       | was a very nice speed-up over gzip. Eventually we switched to
       | zstd, of course, but pigz never failed us.
        
         | mackman wrote:
         | Hey Eric! Hope you're well!
        
         | antisthenes wrote:
         | Same, except we were at a small e-commerce boutique running
         | Magento circa 2011-2013.
         | 
         | SQL backups were simply a bash script using Pigz, running on a
         | cron job. Simple times!
        
         | Xorlev wrote:
         | Pretty similar to that, we used pigz and netcat to bring up new
         | MySQL read replicas in a chain at line speeds.
         | 
         | I recall learning the technique from Tumblr's eng blog.
         | 
         | https://engineering.tumblr.com/post/7658008285/efficiently-c...
        
           | evanelias wrote:
           | I wrote that Tumblr eng blog post, glad to see it's still
           | making the rounds! I later joined FB's mysql team a few years
           | after that, although I can't quite remember if FB was still
           | using pigz by that time. (also, hi Eric!)
           | 
           | Separately, at Tumblr I vaguely remember examining some
           | alternative to pigz that was consistently faster at the time
           | (11 years ago) because pigz couldn't parallelize
           | decompression. Can't quite remember the name of the
           | alternative, but it had licensing restrictions which made it
           | less attractive than pigz.
           | 
           | Edit: the old fast alternative I was thinking of is qpress,
           | formerly hosted at http://www.quicklz.com/ but that's no
           | longer online. Googling it now, there are some mirrors and
           | also looks like Percona tools used/bundled it. Not sure if
           | they still do or if they've since switched to zstd.
        
       | walrus01 wrote:
       | Would not recommend using this in 2022, use zstandard or xzip
       | instead.
       | 
       | zstandard is faster and slightly better compression at speed
       | selection settings that are equivalent to gzip, in addition to
       | having the ability to compress stuff at a much greater ratio,
       | optionally, if you allow it to take more time and cpu resources.
       | 
       | https://gregoryszorc.com/blog/2017/03/07/better-compression-...
        
         | dspillett wrote:
         | pigz has the advantage of producing output that can be read by
         | standard gzip processing tools (including, of course,
         | gzip/gunzip), which are available by default on just about
         | every OS out there so you get the faster archive creation speed
         | without adding requirements to those who might be accessing the
         | results later.
         | 
         | It works because gzip streams can be tracked together as a
         | single stream, at the start of each block is an instruction to
         | reset the compression dictionary as if it is the start of a
         | file/stream (which in practise it is) so you just have to
         | concatenate the parts coming out of the parallel threads in the
         | right order. These resets cause a small drop in overall
         | compression rates but this is small and can be minimised by
         | using large enough blocks.
        
           | walrus01 wrote:
           | yes, one consideration is whether you're creating archives
           | for your own later use, or internal use where you also have
           | zstandard and xz handling tools. Or to send somewhere else
           | for wider use on unknown platforms.
        
             | dspillett wrote:
             | Aye, pick the right tool for the target audience. If you
             | are the target or you know everyone else who needs to read
             | the output will have the ability to read zstd, go with
             | that. If not consider pigz. If writing a script that others
             | may run, have it default to gzip but use pigz if available
             | (unless you really don't want that small % drop on
             | compression).
        
       | ananonymoususer wrote:
       | I use this all the time. It's a big time saver on multi-core
       | machines (which is pretty much every desktop made in the past 20
       | years). It's available in all the repos, but not included by
       | default (at least in Ubuntu/Mint). It is most useful for
       | compressing disk images on-the-fly while backing them up to
       | network storage. It's usually a good idea to zero unused space
       | first:
       | 
       | (unprivileged commands follow)
       | 
       | dd if=/dev/zero of=~/zeros bs=1M; sync; rm ~/zeros
       | 
       | Compressing on the fly can be slower than your network bandwidth
       | depending on your network speed, your processor(s) speed, and the
       | compression level, so you typically tune the compression level
       | (because the other two variables are not so easy to change).
       | Example backup:
       | 
       | (privileged commands follow)
       | 
       | pv < /dev/sda | pigz -9 | ssh user@remote.system dd
       | of=compressed.sda.gz bs=1M
       | 
       | (Note that on slower systems the ssh encryption can also slow
       | things down.)
       | 
       | Some sharp people may notice that it's not necessarily a good
       | idea to back up a live system this way because the filesystem is
       | changing while the system runs. It's usually just fine on an
       | unloaded system that uses a journaling filesystem.
        
         | CGamesPlay wrote:
         | Alternative way of zeroing unused space without consuming all
         | disk space:
         | https://manpages.ubuntu.com/manpages/trusty/man8/zerofree.8....
        
       | rcarmo wrote:
       | I chuckled at the name, since out-of-order results are a typical
       | output of parallelization. Kudos.
        
         | b33j0r wrote:
         | Ah yes, no guarantee of concurrency or ordering (in the
         | headline, lol).
         | 
         | That'd be a pretty funny compression algorithm. You listen to a
         | .mpfoo file, and you'll hear the whole song, we promise!
        
         | XCSme wrote:
         | I also thought the name was clever, but your comment made it
         | even more interesting. Also, my first thought was, "is this
         | safe to use?", I heard of gzip vulnerabilities before, but a
         | parallel implementation sounds a lot easier to get wrong.
        
           | dspillett wrote:
           | Gzip streams support dictionary resets which means you can
           | concatenate individually commuters blocks together to make a
           | while stream.
           | 
           | This is what pigz is doing: shooting the input into blocks,
           | spreading the compression of these blocks over different
           | threads so multiple cores can be used, then joining the
           | results together in the right order.
           | 
           | It is the very same property of the format that gzip's own
           | --rsyncable option makes use of to stop small changes forcing
           | a full file send when rsync (or similar) is used to transfer
           | updated files.
           | 
           | The idea is as simple as it is clever, one of those "why did
           | I not think about that?" ideas that are obvious once someone
           | else has thought of it, so adds little or no extra risk. A
           | vulnerability that uses gzip (a "compression bomb") or can
           | cause a gzip tool to errantly run arbitrary code, is no more
           | likely to affect pigz than it is the standard gzip builds.
        
             | apetresc wrote:
             | Given that, why wouldn't this just be upstreamed into gzip?
             | If it's a clean, simple solution that's just expanding the
             | use of a technique that's already in the core binary?
        
               | cldellow wrote:
               | gzip is a pretty old, pretty core program, so I imagine
               | it's largely in maintenance mode, and that there is a lot
               | of friction to pushing large changes into it. At one
               | point, pigz required the pthreads library to build. If it
               | still does, the gzip people would need to consider if
               | that was appropriate for them, and if not, rewrite it to
               | be buildable without it.
               | 
               | There are multiple implementations of zlib that are
               | faster than the one that ships with GNU gzip, and yet
               | they haven't been incorporated.
               | 
               | There are also just better algorithms if compatibility
               | with gzip isn't needed. zstd, for example, supports
               | parallel compression, and is both faster and compresses
               | better than gzip.
        
       | bbertelsen wrote:
       | Warning for the uninitiated. Be cautious using this on a
       | production machine. I recently caused a production system to
       | crash because disk throughput was so high that it started
       | delaying read/writes on a PostgreSQL server. There was panic!
        
       | necovek wrote:
       | Any comparative benchmarks or a write-up on the approach (other
       | than "uses zlib and pthreads" from the README)?
        
         | 331c8c71 wrote:
         | I used it and it was noticeably faster. I didn't write down by
         | how much.
        
         | chasil wrote:
         | Single-threaded gzip can outperform pigz, or at least come very
         | close, when used with GNU xargs on separate files with no
         | dependencies.
         | 
         | https://www.linuxjournal.com/content/parallel-shells-xargs-u...
         | 
         | https://news.ycombinator.com/item?id=26178257
        
           | Xorlev wrote:
           | pigz is most useful on a single stream of data, vs. the more
           | obviously parallel case of files without dependencies.
        
       | xfalcox wrote:
       | One interesting trivia is that since ~2020 Docker will
       | transparently use pigz for decompressing container image layers
       | if it's available on the host. This was a nice speedup for us,
       | since we use large container images and automatic scaling for
       | incoming traffic surges.
        
         | chasil wrote:
         | I think dracut also uses pigz to create the initrd when
         | installing a new Linux kernel rpm package.
        
         | danuker wrote:
         | Have you optimized the low-hanging fruit in your image size?
         | 
         | Because compression programs are as high-hanging fruit as you
         | can get, and parallelizing them can only be done once.
        
       | jaimehrubiks wrote:
       | I used this recently with -0 (no compression) to pack* billions
       | of files into a tar file before sending them over the network. It
       | worked amazing.
        
         | anderskaseorg wrote:
         | Why use tar | pigz -0 when you can just use tar?
        
           | jaimehrubiks wrote:
           | I used tar --use-compress-program="pigz" to create the tar
           | out of billions of files
        
             | ac29 wrote:
             | Tar is the archiver here (putting multiple files into one
             | file), pigz with no compression isnt doing anything besides
             | wasting CPU time.
        
             | richard_todd wrote:
             | But what's confusing everyone is that tar cf - will create
             | the tar without any external compression program needed.
        
               | koolba wrote:
               | Even the "f -" option is unneeded as the default is to
               | stream to stdout. Though it's always a bit scary to not
               | explicitly specify the destination in case your finger
               | slips and the first target is itself a writeable file.
        
               | jaimehrubiks wrote:
               | I could definitely be wrong here, apologies for the
               | confusion. I run many of these tasks automated, in some
               | cases I used low compression, in others zero compression.
               | For low compression, that command really shines, for zero
               | compression, I would have bet I also got improvement over
               | regular tar without compression, but again, I could be
               | wrong here. I'll test it again
        
             | donatj wrote:
             | If you're not going to compress at all, you don't need a
             | compressor at all. All you needed was a .tar and not a
             | .tar.gz
        
         | dividuum wrote:
         | Maybe I'm missing something, but why send the tar generated
         | stream through a non-compressing compressor when you could just
         | send the tar directly?
        
           | jaimehrubiks wrote:
           | I didn't have the tar, I created it using:
           | 
           | tar --use-compress-program="pigz -0" ...
        
             | gruez wrote:
             | But if you don't specify the -z flag when using tar, then
             | it won't be compressed. Why type all that out when omitting
             | one flag does the same thing?
        
       | jiggawatts wrote:
       | Funny this comes up again so soon after I needed it! I recently
       | did a proof-of-concept related to bioinformatics (gene assembly,
       | etc...), and one quirk of that space is that they work with
       | _enormous_ text files. Think tens of gigabytes being a  "normal"
       | size. Just compressing and copying these around is a pain.
       | 
       | One trick I discovered is that tools like pigz can be used to
       | both accelerate the compression step and also copy to cloud
       | storage in parallel! E.g.:                   pigz input.fastq -c
       | | azcopy copy --from-to PipeBlob "https://myaccountname.blob.core
       | .windows.net/inputs/input.fastq.gz?..."
       | 
       | There is a similar pipeline available for s3cmd as well with the
       | same benefit of overlapping the compression and the copy.
       | 
       | However, if your tools support zstd, then it's more efficient to
       | use that instead. Try the "zstd -T0" option or the "pzstd" tool
       | for even higher throughputs but with same minor caveats.
       | 
       | PS: In case anyone here is working on the above tools, I have a
       | small request! What would be awesome is to _automatically_ tune
       | the compression ratio to match the available output bandwidth.
       | With the  '-c' output option, this is easy: just keep increasing
       | the compression level by one notch whenever the output buffer is
       | full, and reduce it by one level whenever the output buffer is
       | empty. This will automatically tune the system to get the maximum
       | total throughput given the available CPU performance and network
       | bandwidth.
        
       | ByThyGrace wrote:
       | On Linux would it Just Work(tm) if you aliased pigz to gzip as a
       | drop-in replacement?
        
         | ndsipa_pomu wrote:
         | In theory, most stuff should work as it's 99% compatible, but
         | there might well be something that breaks. Rather than
         | symlinking it or some such, it's better to configure the
         | necessary tools to use the pigz command instead and then you'll
         | at least find out what works.
         | 
         | FWIW, I configure BackupPC to use pigz instead of gzip without
         | any issues.
        
       | _joel wrote:
       | Use this all the time (or did when I was doing more sysadminy
       | stuff). Useful in all sorts of backup pipelines
        
       | josnyder wrote:
       | This was great in 2012. In 2022, most use-cases should be using
       | parallelized zstd.
        
       | lxe wrote:
       | Protip: if you're on a massively-multicore system and need to
       | tar/gzip a directory full of node_modules, use pigz via `tar -I
       | pigz` or a pipe. The performance increase is incredible.
        
       | omoikane wrote:
       | The bit I found most interesting was actually:
       | 
       | https://github.com/madler/pigz/blob/master/try.h
       | 
       | https://github.com/madler/pigz/blob/master/try.c
       | 
       | which implements try/catch for C99.
        
         | dima_vm wrote:
         | But why? Most modern languages try to get rid of exceptions
         | (Go, Kotlin, Rust).
        
           | resoluteteeth wrote:
           | > But why? Most modern languages try to get rid of exceptions
           | (Go, Kotlin, Rust).
           | 
           | All three of those languages actually have exceptions, they
           | just don't encourage catching exceptions as a normal way of
           | error handling
           | 
           | Also, while the trend now seems to be for newer languages to
           | encourage use of things like result types, one of the main
           | reasons for that is that in current languages it is easier to
           | show that functions can potentially failure in the type
           | system using result types rather than exceptions.
           | 
           | Otherwise, there isn't necessarily inherently a strong reason
           | to prefer one or the other, and it's possible that future
           | languages will go back to exceptions but have a way to
           | express that in the type system using effects, etc.
        
           | jallmann wrote:
           | Golang has panic / recover / defer which are functionally
           | similar to exceptions. It's actually a fun exercise to
           | implement a pseudo-syntax for try/catch/finally in terms of
           | those primitives.
        
             | makapuf wrote:
             | Go _has_ exceptions but its definitely not advised to use
             | those as an error mechanism. Recover is really a last
             | chance effort for recovery, not a standard error catching
             | method.
        
           | Genbox wrote:
           | Kotlin does have exceptions[1]
           | 
           | [1] https://kotlinlang.org/docs/exceptions.html#java-
           | interoperab...
        
       ___________________________________________________________________
       (page generated 2022-10-17 23:00 UTC)