codemadness.org

(DIR) <- Back

# xargs: an example for parallel batch jobs

Last modification on 2023-12-17

This describes a simple shellscript programming pattern to process a list of
jobs in parallel. This script example is contained in one file.

# Simple but less optimal example

#!/bin/sh
maxjobs=4

# fake program for example purposes.
someprogram() {
echo "Yep yep, I'm totally a real program!"
sleep "$1"
}

# run(arg1, arg2)
run() {
echo "[$1] $2 started" >&2
someprogram "$1" >/dev/null
status="$?"
echo "[$1] $2 done" >&2
return "$status"
}

# process the jobs.
j=1
for f in 1 2 3 4 5 6 7 8 9 10; do
run "$f" "something" &

jm=$((j % maxjobs)) # shell arithmetic: modulo
test "$jm" = "0" && wait
j=$((j+1))
done
wait

# Why is this less optimal

This is less optimal because it waits until all jobs in the same batch are finished
(each batch contain $maxjobs items).

For example with 2 items per batch and 4 total jobs it could be:

* Job 1 is started.
* Job 2 is started.
* Job 2 is done.
* Job 1 is done.
* Wait: wait on process status of all background processes.
* Job 3 in new batch is started.

This could be optimized to:

* Job 1 is started.
* Job 2 is started.
* Job 2 is done.
* Job 3 in new batch is started (immediately).
* Job 1 is done.
* ...

It also does not handle signals such as SIGINT (^C). However the xargs example
below does:

# Example

#!/bin/sh
maxjobs=4

# fake program for example purposes.
someprogram() {
echo "Yep yep, I'm totally a real program!"
sleep "$1"
}

# run(arg1, arg2)
run() {
echo "[$1] $2 started" >&2
someprogram "$1" >/dev/null
status="$?"
echo "[$1] $2 done" >&2
return "$status"
}

# child process job.
if test "$CHILD_MODE" = "1"; then
run "$1" "$2"
exit "$?"
fi

# generate a list of jobs for processing.
list() {
for f in 1 2 3 4 5 6 7 8 9 10; do
printf '%s\0%s\0' "$f" "something"
done
}

# process jobs in parallel.
list | CHILD_MODE="1" xargs -r -0 -P "${maxjobs}" -L 2 "$(readlink -f "$0")"

# Run and timings

Although the above example is kindof stupid, it already shows the queueing of
jobs is more efficient.

Script 1:

time ./script1.sh
[...snip snip...]
real 0m22.095s

Script 2:

time ./script2.sh
[...snip snip...]
real 0m18.120s

# How it works

The parent process:

* The parent, using xargs, handles the queue of jobs and schedules the jobs to
execute as a child process.
* The list function writes the parameters to stdout. These parameters are
separated by the NUL byte separator. The NUL byte separator is used because
this character cannot be used in filenames (which can contain spaces or even
newlines) and cannot be used in text (the NUL byte terminates the buffer for
a string).
* The -L option must match the amount of arguments that are specified for the
job. It will split the specified parameters per job.
* The expression "$(readlink -f "$0")" gets the absolute path to the
shellscript itself. This is passed as the executable to run for xargs.
* xargs calls the script itself with the specified parameters it is being fed.
The environment variable $CHILD_MODE is set to indicate to the script itself
it is run as a child process of the script.

The child process:

* The command-line arguments are passed by the parent using xargs.

* The environment variable $CHILD_MODE is set to indicate to the script itself
it is run as a child process of the script.

* The script itself (ran in child-mode process) only executes the task and
signals its status back to xargs and the parent.

* The exit status of the child program is signaled to xargs. This could be
handled, for example to stop on the first failure (in this example it is not).
For example if the program is killed, stopped or the exit status is 255 then
xargs stops running also.

# Description of used xargs options

(HTM) From the OpenBSD man page: »https://man.openbsd.org/xargs«

xargs - construct argument list(s) and execute utility

Options explained:

* -r: Do not run the command if there are no arguments. Normally the command
is executed at least once even if there are no arguments.
* -0: Change xargs to expect NUL ('\0') characters as separators, instead of
spaces and newlines.
* -P maxprocs: Parallel mode: run at most maxprocs invocations of utility
at once.
* -L number: Call utility for every number of non-empty lines read. A line
ending in unescaped white space and the next non-empty line are considered
to form one single line. If EOF is reached and fewer than number lines have
been read then utility will be called with the available lines.

# xargs options -0 and -P, portability and historic context

Some of the options, like -P are as of writing (2023) non-POSIX:
(HTM) https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html.
However many systems support this useful extension for many years now.

The specification even mentions implementations which support parallel
operations:

"The version of xargs required by this volume of POSIX.1-2017 is required to
wait for the completion of the invoked command before invoking another command.
This was done because historical scripts using xargs assumed sequential
execution. Implementations wanting to provide parallel operation of the invoked
utilities are encouraged to add an option enabling parallel invocation, but
should still wait for termination of all of the children before xargs
terminates normally."

Some historic context:

The xargs -0 option was added on 1996-06-11 by Theo de Raadt, about a year
after the NetBSD import (over 27 years ago at the time of writing):

(HTM) CVS log

On OpenBSD the xargs -P option was added on 2003-12-06 by syncing the FreeBSD
code:

(HTM) CVS log

Looking at the imported git history log of GNU findutils (which has xargs), the
very first commit already had the -0 and -P option:

(HTM) git log

commit c030b5ee33bbec3c93cddc3ca9ebec14c24dbe07
Author: Kevin Dalley <kevin@seti.org>
Date: Sun Feb 4 20:35:16 1996 +0000

Initial revision

# xargs: some incompatibilities found

* Using the -0 option empty fields are handled differently in different
implementations.
* The -n and -L option doesn't work correctly in many of the BSD implementations.
Some count empty fields, some don't. In early implementations in FreeBSD and
OpenBSD it only processed the first line. In OpenBSD it has been improved
around 2017.

Depending on what you want to do a workaround could be to use the -0 option
with a single field and use the -n flag. Then in each child program invocation
split the field by a separator.

# References

(HTM) * xargs: »https://man.openbsd.org/xargs«
(HTM) * printf: »https://man.openbsd.org/printf«
(HTM) * ksh, wait: »https://man.openbsd.org/ksh#wait«
(HTM) * wait(2): »https://man.openbsd.org/wait«