# Get More Done at the Same Time with GNU Parallel

Do you ever get the funny fealing that your computer isn't quite as
fast as it should be? I used to feel that way, and then I found GNU
Parallel.

GNU Parallel is a shell utility for executing jobs in parallel. It can
parse multiple inputs, thereby running your script or command against
sets of data at the same time. You can use *all* your CPU at last!

If you've ever used `xargs`, then you already know how to use
parallel. If you don't, then this article will teach you, along with
many other use cases.


## Installing GNU Parallel

GNU Parallel may not come pre-installed on your Linux or BSD
computer. Install it from your repository or ports collection. For
example, on Fedora:

    $ sudo dnf install parallel

Or on NetBSD:

    # pkg_add parallel

If all else fails, refer to the project home page at
https://www.gnu.org/software/parallel.


## From serial to parallel

As its name suggests, parallel's strength is that it runs jobs in
parallel rather than, as many of us still do, sequentially.

When you run one command against many objects, you're inherently creating a queue. Some number of objects can be processed by the command, and all the other objects just stand around and wait their turn. It's inefficient. Given enough data, there's always going to be a queue, but instead of having just one queue, why not have lots of small queues? 

Imagine you have a folder full of images you want to convert from JPEG to PNG. There are many ways to do this. There's the manual way of opening each image in GIMP and exporting it to the new format. That's usually the worst possible way. It's not only time-intensive, it's labour intensive.

A pretty neat variation on this theme is the shell-based solution:

    $ convert 001.jpeg 001.png
    $ convert 002.jpeg 002.png
    $ convert 003.jpeg 003.png
    ... and so on ...

It's a neat trick when you first learn it, and at first it's a vast improvement. No need for a GUI and constant clicking. But it's still labour-intensive.

Better still:

    $ for i in *jpeg; do convert $i $i.png ; done

This, at least, sets the job(s) in motion and frees you up to do more productive things. The problem is, it's still a serial process. One image gets converted, and then the next one in the queue steps up for conversion, and so on, until the queue has been emptied.

With parallel:

    $ find . -name "*jpeg" | parallel -I% --max-args 1 convert % %.png

This is a combination of two commands: the `find` command, which gathers the objects you want to operate on, and the `parallel` command, which sorts through the objects and makes sure everything gets processed as required.

  * `find . -name "*jpeg"` finds all files in the current directory that end in `jpeg`.

* `parallel` invokes GNU Parallel.

  * `-I%` creates a placeholder, called `%`, to stand in for whatever `find` hands over to parallel. You use this because otherwise you'd have to manually write a new command for each result of `find`, and that's exactly what you're trying to avoid.

  * `--max-args 1` limits the rate at which parallel requests a new object from the queue. Since the command parallel is running only requires one file, you limit the rate to 1. Were you doing a more complex command that required two files (such as `cat 001.txt 002.txt > new.txt`) then you would limit the rate to 2.

  * `convert % %.png` is the command you want to run in parallel.

The result of that command is that `find` gathers all relevant files and hands them over to parallel, which launches the job and immediately requests the next in line. Parallel continues to do this for as long as it is safe for it to launch new jobs without crippling your computer, and as old jobs are completed it replaces them with new ones, until all the data being provided to it has been processed. What took 10 minutes before might take only 5 or 3 with parallel.


## Multiple inputs

The `find` command is an excellent gateway to `parallel` as long as you're familiar with `find` and `xargs` (collectively called GNU Find Utilities, or findutils). It provides a flexible interface that many Linux users are already comfortable with, but that's pretty easy to learn if you're a newcomer.

The `find` command is fairly straight-forward: you provide find with a path to a directory that you want to search, and then some portion of the filename you want to search for. You can use wildcard characters to cast your net wider; in this example, the asterisk indicates *anything*, so `find` would find all files that end with the string "searchterm":

    $ find /path/to/directory -name "*searchterm"

By default, `find` returns the results of its search one item at a time, with one item per line:

    $ find ~/graphics -name "*jpg"
    /home/seth/graphics/001.jpg
    /home/seth/graphics/cat.jpg
    /home/seth/graphics/penguin.jpg
    /home/seth/graphics/IMG_0135.jpg

When you pipe the results of `find` to `parallel`, each item on each line is treated as one argument to the command that `parallel` is arbitrating. If, on the other hand, you need to process more than one argument in one command, you can split up way the data in queue is handed over to `parallel`.

Here's a simple, unrealistic example, which I'll later turn into something more useful. You can follow along with this example, as long as you have GNU Parallel installed.

Assume you have four files. List them, one per line, to see exactly what you have:

    $ echo ada > ada ; echo lovelace > lovelace
    $ echo richard > richard ; echo stallman > stallman
    $ ls -1
    ada
    lovelace
    richard
    stallman

You want to combine two files into a third, containing the contents of both files. This requires that parallel has access to two files, so the `-I%` variable won't work in this case.

The default behaviour of parallel is basically invisible:

    $ ls -1 | parallel echo
    ada
    lovelace
    richard
    stallman

Now tell parallel that you want to get two objects per job:

    $ ls -1 | parallel --max-args=2 echo
    ada lovelace
    richard stallman

You see now that lines have been combined. Specifically, *two* results from `ls -1` are passed to parallel all at once. That's the right number of arguments for this task, but they're effectively one argument right now; "ada lovelace" and "richard stallman". What you actually want is to get two distinct arguments per job.

Luckily, that technicality is parsed by parallel itself. If you set `--jobs` to `2`, you get two variables, `{1}` and `{2}`, representing the first and second parts of the argument: 

    $ ls -1 | parallel --max-args=2 --jobs 2 cat {1} {2} ">" {1}_{2}.person

In this command, the variable `{1}` is ada or richard (depending on which job you look at) and `{2}` is either `lovelace` or `stallman`. The contents of the files are redirected with a redirect symbol *in quotes* (the quotes grab the redirect symbol from Bash so that parallel can use it) and placed into new files called ada_lovelace.person and richard_stallman.person.

    $ ls -1
    ada
    ada_lovelace.person
    lovelace
    richard
    richard_stallman.person
    stallman

    $ cat ada_*person
    ada lovelace
    $ cat ri*person
    richard stallman

If you spend all day parsing log files that are hundreds of megabytes in size, you might see how parallelized text parsing could be useful to you, but otherwise this is mostly a demonstrative exercise.

However, this kind of processing is invaluable for more than just text parsing. Here's a real life example from the film world. Consider a directory of video files and audio files that need to be joined together.

    $ ls -1
    12_LS_establishing-manor.avi
    12_wildsound.flac
    14_butler-dialogue-mixed.flac
    14_MS_butler.avi
    ...and so on...

Using the same principles, a simple command can be created so that the files are combined *in parallel*:

    $ ls -1 | parallel --max-args=2 --jobs 2 ffmpeg -i {1} -i {2} -vcodec copy -acodec copy {1}.mkv


## Brute. Force.

All this fancy input and output parsing isn't to everyone's taste. If you prefer a more direct approach, you can throw commands at parallel and walk away.

First, create a text file with one command on each line:

    $ cat jobs2run
    bzip2 oldstuff.tar
    oggenc music.flac
    opusenc ambiance.wav
    convert bigfile.tiff small.jpeg
    ffmepg -i foo.avi -v:b 12000k foo.mp4
    xsltproc --output build/tmp.fo style/dm.xsl src/tmp.xml
    bzip2 archive.tar

Then hand the file over to parallel:

    $ parallel --jobs 6 < jobs2run

And now all jobs in your file are run in parallel. If more jobs exist than jobs allowed, a queue is formed and maintained by parallel until all jobs have run.


## Much much more

GNU Parallel is a powerful and flexible tool, with far more use cases than can fit into this article. Its man page provides examples of really cool things you can do with it, from remote execution over SSH to incorporating Bash functions into your parallel commands. There's even an extensive demonstration series on [youtube](https://www.youtube.com/watch?v=OpaiGYxkSuQ&list=PL284C9FF2488BC6D1), so you can learn from the GNU Parallel team directly.

GNU Parallel has the power to change the way you compute, and if doesn't do that, it will at the very least change the time your computer spends computing. Try it today!