[HN Gopher] pigz: A parallel implementation of gzip for multi-co... ___________________________________________________________________ pigz: A parallel implementation of gzip for multi-core machines Author : firloop Score : 113 points Date : 2022-10-17 19:19 UTC (3 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | sitkack wrote: | If you really want to enable all cores for compression and | decompression, give pbzip2 a try. pigz isn't as parallel as | pbzip2 | | http://compression.ca/pbzip2/ | | *edit, as ac29 mentions below, just use zstdmt. In my quick | testing it is approximately 8x faster than pbzip2 and gives | better compression ratios. Wall clock time went from 41s to 3.5s | for a 3.6GB tar of source, pdfs and images AND the resulting file | was smaller. megs 3781 test.tar | 3041 test.tar.zstd (default compression 3, 3.5s) 3170 | test.tar.bz2 (default compression, 8 threads, 40s) | walrus01 wrote: | on the other hand, bzip2 is pretty much obsoleted now by xzip | booi wrote: | What is xzip? are you talking about xz? | walrus01 wrote: | yes, xz | | section 3.6 here | | https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Ma | r... | | https://en.wikipedia.org/wiki/XZ_Utils | chasil wrote: | The author of lzip has pointed criticism for the design | choices of xz. | | I generally use lzip for data that is important to me. | | https://www.nongnu.org/lzip/xz_inadequate.html | ac29 wrote: | bzip2 is very very slow though. Some types of data compress | quite well with bzip, but if high compression is needed, xz is | usually as good or better and natively has multithreading | available. | | For everything else, there's zstd (also natively multithread) | sitkack wrote: | Interesting https://docs.rs/zstd/latest/zstd/stream/write/str | uct.Encoder... | iruoy wrote: | Decompression is multithreaded by default. Compression with | an argument. However it is built-in. | ericbarrett wrote: | We used this to great effect at Facebook for MySQL backups in the | early 2010s. The backup hosts had far more CPU than needed so it | was a very nice speed-up over gzip. Eventually we switched to | zstd, of course, but pigz never failed us. | mackman wrote: | Hey Eric! Hope you're well! | antisthenes wrote: | Same, except we were at a small e-commerce boutique running | Magento circa 2011-2013. | | SQL backups were simply a bash script using Pigz, running on a | cron job. Simple times! | Xorlev wrote: | Pretty similar to that, we used pigz and netcat to bring up new | MySQL read replicas in a chain at line speeds. | | I recall learning the technique from Tumblr's eng blog. | | https://engineering.tumblr.com/post/7658008285/efficiently-c... | evanelias wrote: | I wrote that Tumblr eng blog post, glad to see it's still | making the rounds! I later joined FB's mysql team a few years | after that, although I can't quite remember if FB was still | using pigz by that time. (also, hi Eric!) | | Separately, at Tumblr I vaguely remember examining some | alternative to pigz that was consistently faster at the time | (11 years ago) because pigz couldn't parallelize | decompression. Can't quite remember the name of the | alternative, but it had licensing restrictions which made it | less attractive than pigz. | | Edit: the old fast alternative I was thinking of is qpress, | formerly hosted at http://www.quicklz.com/ but that's no | longer online. Googling it now, there are some mirrors and | also looks like Percona tools used/bundled it. Not sure if | they still do or if they've since switched to zstd. | walrus01 wrote: | Would not recommend using this in 2022, use zstandard or xzip | instead. | | zstandard is faster and slightly better compression at speed | selection settings that are equivalent to gzip, in addition to | having the ability to compress stuff at a much greater ratio, | optionally, if you allow it to take more time and cpu resources. | | https://gregoryszorc.com/blog/2017/03/07/better-compression-... | dspillett wrote: | pigz has the advantage of producing output that can be read by | standard gzip processing tools (including, of course, | gzip/gunzip), which are available by default on just about | every OS out there so you get the faster archive creation speed | without adding requirements to those who might be accessing the | results later. | | It works because gzip streams can be tracked together as a | single stream, at the start of each block is an instruction to | reset the compression dictionary as if it is the start of a | file/stream (which in practise it is) so you just have to | concatenate the parts coming out of the parallel threads in the | right order. These resets cause a small drop in overall | compression rates but this is small and can be minimised by | using large enough blocks. | walrus01 wrote: | yes, one consideration is whether you're creating archives | for your own later use, or internal use where you also have | zstandard and xz handling tools. Or to send somewhere else | for wider use on unknown platforms. | dspillett wrote: | Aye, pick the right tool for the target audience. If you | are the target or you know everyone else who needs to read | the output will have the ability to read zstd, go with | that. If not consider pigz. If writing a script that others | may run, have it default to gzip but use pigz if available | (unless you really don't want that small % drop on | compression). | ananonymoususer wrote: | I use this all the time. It's a big time saver on multi-core | machines (which is pretty much every desktop made in the past 20 | years). It's available in all the repos, but not included by | default (at least in Ubuntu/Mint). It is most useful for | compressing disk images on-the-fly while backing them up to | network storage. It's usually a good idea to zero unused space | first: | | (unprivileged commands follow) | | dd if=/dev/zero of=~/zeros bs=1M; sync; rm ~/zeros | | Compressing on the fly can be slower than your network bandwidth | depending on your network speed, your processor(s) speed, and the | compression level, so you typically tune the compression level | (because the other two variables are not so easy to change). | Example backup: | | (privileged commands follow) | | pv < /dev/sda | pigz -9 | ssh user@remote.system dd | of=compressed.sda.gz bs=1M | | (Note that on slower systems the ssh encryption can also slow | things down.) | | Some sharp people may notice that it's not necessarily a good | idea to back up a live system this way because the filesystem is | changing while the system runs. It's usually just fine on an | unloaded system that uses a journaling filesystem. | CGamesPlay wrote: | Alternative way of zeroing unused space without consuming all | disk space: | https://manpages.ubuntu.com/manpages/trusty/man8/zerofree.8.... | rcarmo wrote: | I chuckled at the name, since out-of-order results are a typical | output of parallelization. Kudos. | b33j0r wrote: | Ah yes, no guarantee of concurrency or ordering (in the | headline, lol). | | That'd be a pretty funny compression algorithm. You listen to a | .mpfoo file, and you'll hear the whole song, we promise! | XCSme wrote: | I also thought the name was clever, but your comment made it | even more interesting. Also, my first thought was, "is this | safe to use?", I heard of gzip vulnerabilities before, but a | parallel implementation sounds a lot easier to get wrong. | dspillett wrote: | Gzip streams support dictionary resets which means you can | concatenate individually commuters blocks together to make a | while stream. | | This is what pigz is doing: shooting the input into blocks, | spreading the compression of these blocks over different | threads so multiple cores can be used, then joining the | results together in the right order. | | It is the very same property of the format that gzip's own | --rsyncable option makes use of to stop small changes forcing | a full file send when rsync (or similar) is used to transfer | updated files. | | The idea is as simple as it is clever, one of those "why did | I not think about that?" ideas that are obvious once someone | else has thought of it, so adds little or no extra risk. A | vulnerability that uses gzip (a "compression bomb") or can | cause a gzip tool to errantly run arbitrary code, is no more | likely to affect pigz than it is the standard gzip builds. | apetresc wrote: | Given that, why wouldn't this just be upstreamed into gzip? | If it's a clean, simple solution that's just expanding the | use of a technique that's already in the core binary? | cldellow wrote: | gzip is a pretty old, pretty core program, so I imagine | it's largely in maintenance mode, and that there is a lot | of friction to pushing large changes into it. At one | point, pigz required the pthreads library to build. If it | still does, the gzip people would need to consider if | that was appropriate for them, and if not, rewrite it to | be buildable without it. | | There are multiple implementations of zlib that are | faster than the one that ships with GNU gzip, and yet | they haven't been incorporated. | | There are also just better algorithms if compatibility | with gzip isn't needed. zstd, for example, supports | parallel compression, and is both faster and compresses | better than gzip. | bbertelsen wrote: | Warning for the uninitiated. Be cautious using this on a | production machine. I recently caused a production system to | crash because disk throughput was so high that it started | delaying read/writes on a PostgreSQL server. There was panic! | necovek wrote: | Any comparative benchmarks or a write-up on the approach (other | than "uses zlib and pthreads" from the README)? | 331c8c71 wrote: | I used it and it was noticeably faster. I didn't write down by | how much. | chasil wrote: | Single-threaded gzip can outperform pigz, or at least come very | close, when used with GNU xargs on separate files with no | dependencies. | | https://www.linuxjournal.com/content/parallel-shells-xargs-u... | | https://news.ycombinator.com/item?id=26178257 | Xorlev wrote: | pigz is most useful on a single stream of data, vs. the more | obviously parallel case of files without dependencies. | xfalcox wrote: | One interesting trivia is that since ~2020 Docker will | transparently use pigz for decompressing container image layers | if it's available on the host. This was a nice speedup for us, | since we use large container images and automatic scaling for | incoming traffic surges. | chasil wrote: | I think dracut also uses pigz to create the initrd when | installing a new Linux kernel rpm package. | danuker wrote: | Have you optimized the low-hanging fruit in your image size? | | Because compression programs are as high-hanging fruit as you | can get, and parallelizing them can only be done once. | jaimehrubiks wrote: | I used this recently with -0 (no compression) to pack* billions | of files into a tar file before sending them over the network. It | worked amazing. | anderskaseorg wrote: | Why use tar | pigz -0 when you can just use tar? | jaimehrubiks wrote: | I used tar --use-compress-program="pigz" to create the tar | out of billions of files | ac29 wrote: | Tar is the archiver here (putting multiple files into one | file), pigz with no compression isnt doing anything besides | wasting CPU time. | richard_todd wrote: | But what's confusing everyone is that tar cf - will create | the tar without any external compression program needed. | koolba wrote: | Even the "f -" option is unneeded as the default is to | stream to stdout. Though it's always a bit scary to not | explicitly specify the destination in case your finger | slips and the first target is itself a writeable file. | jaimehrubiks wrote: | I could definitely be wrong here, apologies for the | confusion. I run many of these tasks automated, in some | cases I used low compression, in others zero compression. | For low compression, that command really shines, for zero | compression, I would have bet I also got improvement over | regular tar without compression, but again, I could be | wrong here. I'll test it again | donatj wrote: | If you're not going to compress at all, you don't need a | compressor at all. All you needed was a .tar and not a | .tar.gz | dividuum wrote: | Maybe I'm missing something, but why send the tar generated | stream through a non-compressing compressor when you could just | send the tar directly? | jaimehrubiks wrote: | I didn't have the tar, I created it using: | | tar --use-compress-program="pigz -0" ... | gruez wrote: | But if you don't specify the -z flag when using tar, then | it won't be compressed. Why type all that out when omitting | one flag does the same thing? | jiggawatts wrote: | Funny this comes up again so soon after I needed it! I recently | did a proof-of-concept related to bioinformatics (gene assembly, | etc...), and one quirk of that space is that they work with | _enormous_ text files. Think tens of gigabytes being a "normal" | size. Just compressing and copying these around is a pain. | | One trick I discovered is that tools like pigz can be used to | both accelerate the compression step and also copy to cloud | storage in parallel! E.g.: pigz input.fastq -c | | azcopy copy --from-to PipeBlob "https://myaccountname.blob.core | .windows.net/inputs/input.fastq.gz?..." | | There is a similar pipeline available for s3cmd as well with the | same benefit of overlapping the compression and the copy. | | However, if your tools support zstd, then it's more efficient to | use that instead. Try the "zstd -T0" option or the "pzstd" tool | for even higher throughputs but with same minor caveats. | | PS: In case anyone here is working on the above tools, I have a | small request! What would be awesome is to _automatically_ tune | the compression ratio to match the available output bandwidth. | With the '-c' output option, this is easy: just keep increasing | the compression level by one notch whenever the output buffer is | full, and reduce it by one level whenever the output buffer is | empty. This will automatically tune the system to get the maximum | total throughput given the available CPU performance and network | bandwidth. | ByThyGrace wrote: | On Linux would it Just Work(tm) if you aliased pigz to gzip as a | drop-in replacement? | ndsipa_pomu wrote: | In theory, most stuff should work as it's 99% compatible, but | there might well be something that breaks. Rather than | symlinking it or some such, it's better to configure the | necessary tools to use the pigz command instead and then you'll | at least find out what works. | | FWIW, I configure BackupPC to use pigz instead of gzip without | any issues. | _joel wrote: | Use this all the time (or did when I was doing more sysadminy | stuff). Useful in all sorts of backup pipelines | josnyder wrote: | This was great in 2012. In 2022, most use-cases should be using | parallelized zstd. | lxe wrote: | Protip: if you're on a massively-multicore system and need to | tar/gzip a directory full of node_modules, use pigz via `tar -I | pigz` or a pipe. The performance increase is incredible. | omoikane wrote: | The bit I found most interesting was actually: | | https://github.com/madler/pigz/blob/master/try.h | | https://github.com/madler/pigz/blob/master/try.c | | which implements try/catch for C99. | dima_vm wrote: | But why? Most modern languages try to get rid of exceptions | (Go, Kotlin, Rust). | resoluteteeth wrote: | > But why? Most modern languages try to get rid of exceptions | (Go, Kotlin, Rust). | | All three of those languages actually have exceptions, they | just don't encourage catching exceptions as a normal way of | error handling | | Also, while the trend now seems to be for newer languages to | encourage use of things like result types, one of the main | reasons for that is that in current languages it is easier to | show that functions can potentially failure in the type | system using result types rather than exceptions. | | Otherwise, there isn't necessarily inherently a strong reason | to prefer one or the other, and it's possible that future | languages will go back to exceptions but have a way to | express that in the type system using effects, etc. | jallmann wrote: | Golang has panic / recover / defer which are functionally | similar to exceptions. It's actually a fun exercise to | implement a pseudo-syntax for try/catch/finally in terms of | those primitives. | makapuf wrote: | Go _has_ exceptions but its definitely not advised to use | those as an error mechanism. Recover is really a last | chance effort for recovery, not a standard error catching | method. | Genbox wrote: | Kotlin does have exceptions[1] | | [1] https://kotlinlang.org/docs/exceptions.html#java- | interoperab... ___________________________________________________________________ (page generated 2022-10-17 23:00 UTC)