# Get More Done at the Same Time with GNU Parallel Do you ever get the funny fealing that your computer isn't quite as fast as it should be? I used to feel that way, and then I found GNU Parallel. GNU Parallel is a shell utility for executing jobs in parallel. It can parse multiple inputs, thereby running your script or command against sets of data at the same time. You can use *all* your CPU at last! If you've ever used `xargs`, then you already know how to use parallel. If you don't, then this article will teach you, along with many other use cases. ## Installing GNU Parallel GNU Parallel may not come pre-installed on your Linux or BSD computer. Install it from your repository or ports collection. For example, on Fedora: $ sudo dnf install parallel Or on NetBSD: # pkg_add parallel If all else fails, refer to the project home page at https://www.gnu.org/software/parallel. ## From serial to parallel As its name suggests, parallel's strength is that it runs jobs in parallel rather than, as many of us still do, sequentially. When you run one command against many objects, you're inherently creating a queue. Some number of objects can be processed by the command, and all the other objects just stand around and wait their turn. It's inefficient. Given enough data, there's always going to be a queue, but instead of having just one queue, why not have lots of small queues? Imagine you have a folder full of images you want to convert from JPEG to PNG. There are many ways to do this. There's the manual way of opening each image in GIMP and exporting it to the new format. That's usually the worst possible way. It's not only time-intensive, it's labour intensive. A pretty neat variation on this theme is the shell-based solution: $ convert 001.jpeg 001.png $ convert 002.jpeg 002.png $ convert 003.jpeg 003.png ... and so on ... It's a neat trick when you first learn it, and at first it's a vast improvement. No need for a GUI and constant clicking. But it's still labour-intensive. Better still: $ for i in *jpeg; do convert $i $i.png ; done This, at least, sets the job(s) in motion and frees you up to do more productive things. The problem is, it's still a serial process. One image gets converted, and then the next one in the queue steps up for conversion, and so on, until the queue has been emptied. With parallel: $ find . -name "*jpeg" | parallel -I% --max-args 1 convert % %.png This is a combination of two commands: the `find` command, which gathers the objects you want to operate on, and the `parallel` command, which sorts through the objects and makes sure everything gets processed as required. * `find . -name "*jpeg"` finds all files in the current directory that end in `jpeg`. * `parallel` invokes GNU Parallel. * `-I%` creates a placeholder, called `%`, to stand in for whatever `find` hands over to parallel. You use this because otherwise you'd have to manually write a new command for each result of `find`, and that's exactly what you're trying to avoid. * `--max-args 1` limits the rate at which parallel requests a new object from the queue. Since the command parallel is running only requires one file, you limit the rate to 1. Were you doing a more complex command that required two files (such as `cat 001.txt 002.txt > new.txt`) then you would limit the rate to 2. * `convert % %.png` is the command you want to run in parallel. The result of that command is that `find` gathers all relevant files and hands them over to parallel, which launches the job and immediately requests the next in line. Parallel continues to do this for as long as it is safe for it to launch new jobs without crippling your computer, and as old jobs are completed it replaces them with new ones, until all the data being provided to it has been processed. What took 10 minutes before might take only 5 or 3 with parallel. ## Multiple inputs The `find` command is an excellent gateway to `parallel` as long as you're familiar with `find` and `xargs` (collectively called GNU Find Utilities, or findutils). It provides a flexible interface that many Linux users are already comfortable with, but that's pretty easy to learn if you're a newcomer. The `find` command is fairly straight-forward: you provide find with a path to a directory that you want to search, and then some portion of the filename you want to search for. You can use wildcard characters to cast your net wider; in this example, the asterisk indicates *anything*, so `find` would find all files that end with the string "searchterm": $ find /path/to/directory -name "*searchterm" By default, `find` returns the results of its search one item at a time, with one item per line: $ find ~/graphics -name "*jpg" /home/seth/graphics/001.jpg /home/seth/graphics/cat.jpg /home/seth/graphics/penguin.jpg /home/seth/graphics/IMG_0135.jpg When you pipe the results of `find` to `parallel`, each item on each line is treated as one argument to the command that `parallel` is arbitrating. If, on the other hand, you need to process more than one argument in one command, you can split up way the data in queue is handed over to `parallel`. Here's a simple, unrealistic example, which I'll later turn into something more useful. You can follow along with this example, as long as you have GNU Parallel installed. Assume you have four files. List them, one per line, to see exactly what you have: $ echo ada > ada ; echo lovelace > lovelace $ echo richard > richard ; echo stallman > stallman $ ls -1 ada lovelace richard stallman You want to combine two files into a third, containing the contents of both files. This requires that parallel has access to two files, so the `-I%` variable won't work in this case. The default behaviour of parallel is basically invisible: $ ls -1 | parallel echo ada lovelace richard stallman Now tell parallel that you want to get two objects per job: $ ls -1 | parallel --max-args=2 echo ada lovelace richard stallman You see now that lines have been combined. Specifically, *two* results from `ls -1` are passed to parallel all at once. That's the right number of arguments for this task, but they're effectively one argument right now; "ada lovelace" and "richard stallman". What you actually want is to get two distinct arguments per job. Luckily, that technicality is parsed by parallel itself. If you set `--jobs` to `2`, you get two variables, `{1}` and `{2}`, representing the first and second parts of the argument: $ ls -1 | parallel --max-args=2 --jobs 2 cat {1} {2} ">" {1}_{2}.person In this command, the variable `{1}` is ada or richard (depending on which job you look at) and `{2}` is either `lovelace` or `stallman`. The contents of the files are redirected with a redirect symbol *in quotes* (the quotes grab the redirect symbol from Bash so that parallel can use it) and placed into new files called ada_lovelace.person and richard_stallman.person. $ ls -1 ada ada_lovelace.person lovelace richard richard_stallman.person stallman $ cat ada_*person ada lovelace $ cat ri*person richard stallman If you spend all day parsing log files that are hundreds of megabytes in size, you might see how parallelized text parsing could be useful to you, but otherwise this is mostly a demonstrative exercise. However, this kind of processing is invaluable for more than just text parsing. Here's a real life example from the film world. Consider a directory of video files and audio files that need to be joined together. $ ls -1 12_LS_establishing-manor.avi 12_wildsound.flac 14_butler-dialogue-mixed.flac 14_MS_butler.avi ...and so on... Using the same principles, a simple command can be created so that the files are combined *in parallel*: $ ls -1 | parallel --max-args=2 --jobs 2 ffmpeg -i {1} -i {2} -vcodec copy -acodec copy {1}.mkv ## Brute. Force. All this fancy input and output parsing isn't to everyone's taste. If you prefer a more direct approach, you can throw commands at parallel and walk away. First, create a text file with one command on each line: $ cat jobs2run bzip2 oldstuff.tar oggenc music.flac opusenc ambiance.wav convert bigfile.tiff small.jpeg ffmepg -i foo.avi -v:b 12000k foo.mp4 xsltproc --output build/tmp.fo style/dm.xsl src/tmp.xml bzip2 archive.tar Then hand the file over to parallel: $ parallel --jobs 6 < jobs2run And now all jobs in your file are run in parallel. If more jobs exist than jobs allowed, a queue is formed and maintained by parallel until all jobs have run. ## Much much more GNU Parallel is a powerful and flexible tool, with far more use cases than can fit into this article. Its man page provides examples of really cool things you can do with it, from remote execution over SSH to incorporating Bash functions into your parallel commands. There's even an extensive demonstration series on [youtube](https://www.youtube.com/watch?v=OpaiGYxkSuQ&list=PL284C9FF2488BC6D1), so you can learn from the GNU Parallel team directly. GNU Parallel has the power to change the way you compute, and if doesn't do that, it will at the very least change the time your computer spends computing. Try it today!