(DIR) <- Back
       
       
       # xargs: an example for parallel batch jobs
       
       Last modification on 2023-12-17
       
       This describes a simple shellscript programming pattern to process a list of
       jobs in parallel. This script example is contained in one file.
       
       
       # Simple but less optimal example
       
               #!/bin/sh
               maxjobs=4
               
               # fake program for example purposes.
               someprogram() {
                       echo "Yep yep, I'm totally a real program!"
                       sleep "$1"
               }
               
               # run(arg1, arg2)
               run() {
                       echo "[$1] $2 started" >&2
                       someprogram "$1" >/dev/null
                       status="$?"
                       echo "[$1] $2 done" >&2
                       return "$status"
               }
               
               # process the jobs.
               j=1
               for f in 1 2 3 4 5 6 7 8 9 10; do
                       run "$f" "something" &
               
                       jm=$((j % maxjobs)) # shell arithmetic: modulo
                       test "$jm" = "0" && wait
                       j=$((j+1))
               done
               wait
       
       
       # Why is this less optimal
       
       This is less optimal because it waits until all jobs in the same batch are finished
       (each batch contain $maxjobs items).
       
       For example with 2 items per batch and 4 total jobs it could be:
       
       * Job 1 is started.
       * Job 2 is started.
       * Job 2 is done.
       * Job 1 is done.
       * Wait: wait on process status of all background processes.
       * Job 3 in new batch is started.
       
       
       This could be optimized to:
       
       * Job 1 is started.
       * Job 2 is started.
       * Job 2 is done.
       * Job 3 in new batch is started (immediately).
       * Job 1 is done.
       * ...
       
       
       It also does not handle signals such as SIGINT (^C). However the xargs example
       below does:
       
       
       # Example
       
               #!/bin/sh
               maxjobs=4
               
               # fake program for example purposes.
               someprogram() {
                       echo "Yep yep, I'm totally a real program!"
                       sleep "$1"
               }
               
               # run(arg1, arg2)
               run() {
                       echo "[$1] $2 started" >&2
                       someprogram "$1" >/dev/null
                       status="$?"
                       echo "[$1] $2 done" >&2
                       return "$status"
               }
               
               # child process job.
               if test "$CHILD_MODE" = "1"; then
                       run "$1" "$2"
                       exit "$?"
               fi
               
               # generate a list of jobs for processing.
               list() {
                       for f in 1 2 3 4 5 6 7 8 9 10; do
                               printf '%s\0%s\0' "$f" "something"
                       done
               }
               
               # process jobs in parallel.
               list | CHILD_MODE="1" xargs -r -0 -P "${maxjobs}" -L 2 "$(readlink -f "$0")"
       
       
       # Run and timings
       
       Although the above example is kindof stupid, it already shows the queueing of
       jobs is more efficient.
       
       Script 1:
       
               time ./script1.sh
               [...snip snip...]
               real    0m22.095s
       
       Script 2:
       
               time ./script2.sh
               [...snip snip...]
               real    0m18.120s
       
       
       # How it works
       
       The parent process:
       
       * The parent, using xargs, handles the queue of jobs and schedules the jobs to
         execute as a child process.
       * The list function writes the parameters to stdout. These parameters are
         separated by the NUL byte separator. The NUL byte separator is used because
         this character cannot be used in filenames (which can contain spaces or even
         newlines) and cannot be used in text (the NUL byte terminates the buffer for
         a string).
       * The -L option must match the amount of arguments that are specified for the
         job. It will split the specified parameters per job.
       * The expression "$(readlink -f "$0")" gets the absolute path to the
         shellscript itself. This is passed as the executable to run for xargs.
       * xargs calls the script itself with the specified parameters it is being fed.
         The environment variable $CHILD_MODE is set to indicate to the script itself
         it is run as a child process of the script.
       
       
       The child process:
       
       * The command-line arguments are passed by the parent using xargs.
       
       * The environment variable $CHILD_MODE is set to indicate to the script itself
         it is run as a child process of the script.
       
       * The script itself (ran in child-mode process) only executes the task and
         signals its status back to xargs and the parent.
       
       * The exit status of the child program is signaled to xargs. This could be
         handled, for example to stop on the first failure (in this example it is not).
         For example if the program is killed, stopped or the exit status is 255 then
         xargs stops running also.
       
       
       # Description of used xargs options
       
 (HTM) From the OpenBSD man page: »https://man.openbsd.org/xargs«
       
               xargs - construct argument list(s) and execute utility
       
       Options explained:
       
       * -r: Do not run the command if there are no arguments. Normally the command
         is executed at least once even if there are no arguments.
       * -0: Change xargs to expect NUL ('\0') characters as separators, instead of
         spaces and newlines.
       * -P maxprocs: Parallel mode: run at most maxprocs invocations of utility
         at once.
       * -L number: Call utility for every number of non-empty lines read. A line
         ending in unescaped white space and the next non-empty line are considered
         to form one single line. If EOF is reached and fewer than number lines have
         been read then utility will be called with the available lines.
       
       
       # xargs options -0 and -P, portability and historic context
       
       Some of the options, like -P are as of writing (2023) non-POSIX:
 (HTM) https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html.
       However many systems support this useful extension for many years now.
       
       The specification even mentions implementations which support parallel
       operations:
       
       "The version of xargs required by this volume of POSIX.1-2017 is required to
       wait for the completion of the invoked command before invoking another command.
       This was done because historical scripts using xargs assumed sequential
       execution. Implementations wanting to provide parallel operation of the invoked
       utilities are encouraged to add an option enabling parallel invocation, but
       should still wait for termination of all of the children before xargs
       terminates normally."
       
       
       Some historic context:
       
       The xargs -0 option was added on 1996-06-11 by Theo de Raadt, about a year
       after the NetBSD import (over 27 years ago at the time of writing):
       
 (HTM) CVS log
       
       On OpenBSD the xargs -P option was added on 2003-12-06 by syncing the FreeBSD
       code:
       
 (HTM) CVS log
       
       
       Looking at the imported git history log of GNU findutils (which has xargs), the
       very first commit already had the -0 and -P option:
       
 (HTM) git log
       
               commit c030b5ee33bbec3c93cddc3ca9ebec14c24dbe07
               Author: Kevin Dalley <kevin@seti.org>
               Date:   Sun Feb 4 20:35:16 1996 +0000
               
                   Initial revision
       
       
       # xargs: some incompatibilities found
       
       * Using the -0 option empty fields are handled differently in different
         implementations.
       * The -n and -L option doesn't work correctly in many of the BSD implementations.
         Some count empty fields, some don't.  In early implementations in FreeBSD and
         OpenBSD it only processed the first line.  In OpenBSD it has been improved
         around 2017.
       
       Depending on what you want to do a workaround could be to use the -0 option
       with a single field and use the -n flag.  Then in each child program invocation
       split the field by a separator.
       
       
       # References
       
 (HTM) * xargs: »https://man.openbsd.org/xargs«
 (HTM) * printf: »https://man.openbsd.org/printf«
 (HTM) * ksh, wait: »https://man.openbsd.org/ksh#wait«
 (HTM) * wait(2): »https://man.openbsd.org/wait«