xargs.md - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       xargs.md (6992B)
       ---
            1 This describes a simple shellscript programming pattern to process a list of
            2 jobs in parallel. This script example is contained in one file.
            3 
            4 
            5 # Simple but less optimal example
            6 
            7         #!/bin/sh
            8         maxjobs=4
            9         
           10         # fake program for example purposes.
           11         someprogram() {
           12                 echo "Yep yep, I'm totally a real program!"
           13                 sleep "$1"
           14         }
           15         
           16         # run(arg1, arg2)
           17         run() {
           18                 echo "[$1] $2 started" >&2
           19                 someprogram "$1" >/dev/null
           20                 status="$?"
           21                 echo "[$1] $2 done" >&2
           22                 return "$status"
           23         }
           24         
           25         # process the jobs.
           26         j=1
           27         for f in 1 2 3 4 5 6 7 8 9 10; do
           28                 run "$f" "something" &
           29         
           30                 jm=$((j % maxjobs)) # shell arithmetic: modulo
           31                 test "$jm" = "0" && wait
           32                 j=$((j+1))
           33         done
           34         wait
           35 
           36 
           37 # Why is this less optimal
           38 
           39 This is less optimal because it waits until all jobs in the same batch are finished
           40 (each batch contain $maxjobs items).
           41 
           42 For example with 2 items per batch and 4 total jobs it could be:
           43 
           44 * Job 1 is started.
           45 * Job 2 is started.
           46 * Job 2 is done.
           47 * Job 1 is done.
           48 * Wait: wait on process status of all background processes.
           49 * Job 3 in new batch is started.
           50 
           51 
           52 This could be optimized to:
           53 
           54 * Job 1 is started.
           55 * Job 2 is started.
           56 * Job 2 is done.
           57 * Job 3 in new batch is started (immediately).
           58 * Job 1 is done.
           59 * ...
           60 
           61 
           62 It also does not handle signals such as SIGINT (^C). However the xargs example
           63 below does:
           64 
           65 
           66 # Example
           67 
           68         #!/bin/sh
           69         maxjobs=4
           70         
           71         # fake program for example purposes.
           72         someprogram() {
           73                 echo "Yep yep, I'm totally a real program!"
           74                 sleep "$1"
           75         }
           76         
           77         # run(arg1, arg2)
           78         run() {
           79                 echo "[$1] $2 started" >&2
           80                 someprogram "$1" >/dev/null
           81                 status="$?"
           82                 echo "[$1] $2 done" >&2
           83                 return "$status"
           84         }
           85         
           86         # child process job.
           87         if test "$CHILD_MODE" = "1"; then
           88                 run "$1" "$2"
           89                 exit "$?"
           90         fi
           91         
           92         # generate a list of jobs for processing.
           93         list() {
           94                 for f in 1 2 3 4 5 6 7 8 9 10; do
           95                         printf '%s\0%s\0' "$f" "something"
           96                 done
           97         }
           98         
           99         # process jobs in parallel.
          100         list | CHILD_MODE="1" xargs -r -0 -P "${maxjobs}" -L 2 "$(readlink -f "$0")"
          101 
          102 
          103 # Run and timings
          104 
          105 Although the above example is kindof stupid, it already shows the queueing of
          106 jobs is more efficient.
          107 
          108 Script 1:
          109 
          110         time ./script1.sh
          111         [...snip snip...]
          112         real    0m22.095s
          113 
          114 Script 2:
          115 
          116         time ./script2.sh
          117         [...snip snip...]
          118         real    0m18.120s
          119 
          120 
          121 # How it works
          122 
          123 The parent process:
          124 
          125 * The parent, using xargs, handles the queue of jobs and schedules the jobs to
          126   execute as a child process.
          127 * The list function writes the parameters to stdout. These parameters are
          128   separated by the NUL byte separator. The NUL byte separator is used because
          129   this character cannot be used in filenames (which can contain spaces or even
          130   newlines) and cannot be used in text (the NUL byte terminates the buffer for
          131   a string).
          132 * The -L option must match the amount of arguments that are specified for the
          133   job. It will split the specified parameters per job.
          134 * The expression "$(readlink -f "$0")" gets the absolute path to the
          135   shellscript itself. This is passed as the executable to run for xargs.
          136 * xargs calls the script itself with the specified parameters it is being fed.
          137   The environment variable $CHILD_MODE is set to indicate to the script itself
          138   it is run as a child process of the script.
          139 
          140 
          141 The child process:
          142 
          143 * The command-line arguments are passed by the parent using xargs.
          144 
          145 * The environment variable $CHILD_MODE is set to indicate to the script itself
          146   it is run as a child process of the script.
          147 
          148 * The script itself (ran in child-mode process) only executes the task and
          149   signals its status back to xargs and the parent.
          150 
          151 * The exit status of the child program is signaled to xargs. This could be
          152   handled, for example to stop on the first failure (in this example it is not).
          153   For example if the program is killed, stopped or the exit status is 255 then
          154   xargs stops running also.
          155 
          156 
          157 # Description of used xargs options
          158 
          159 From the OpenBSD man page: <https://man.openbsd.org/xargs>
          160 
          161         xargs - construct argument list(s) and execute utility
          162 
          163 Options explained:
          164 
          165 * -r: Do not run the command if there are no arguments. Normally the command
          166   is executed at least once even if there are no arguments.
          167 * -0: Change xargs to expect NUL ('\0') characters as separators, instead of
          168   spaces and newlines.
          169 * -P maxprocs: Parallel mode: run at most maxprocs invocations of utility
          170   at once.
          171 * -L number: Call utility for every number of non-empty lines read. A line
          172   ending in unescaped white space and the next non-empty line are considered
          173   to form one single line. If EOF is reached and fewer than number lines have
          174   been read then utility will be called with the available lines.
          175 
          176 
          177 # xargs options -0 and -P, portability and historic context
          178 
          179 Some of the options, like -P are as of writing (2023) non-POSIX:
          180 <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html>.
          181 However many systems support this useful extension for many years now.
          182 
          183 The specification even mentions implementations which support parallel
          184 operations:
          185 
          186 "The version of xargs required by this volume of POSIX.1-2017 is required to
          187 wait for the completion of the invoked command before invoking another command.
          188 This was done because historical scripts using xargs assumed sequential
          189 execution. Implementations wanting to provide parallel operation of the invoked
          190 utilities are encouraged to add an option enabling parallel invocation, but
          191 should still wait for termination of all of the children before xargs
          192 terminates normally."
          193 
          194 
          195 Some historic context:
          196 
          197 The xargs -0 option was added on 1996-06-11 by Theo de Raadt, about a year
          198 after the NetBSD import (over 27 years ago at the time of writing):
          199 
          200 [CVS log](http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/xargs/xargs.c?rev=1.2&content-type=text/x-cvsweb-markup)
          201 
          202 On OpenBSD the xargs -P option was added on 2003-12-06 by syncing the FreeBSD
          203 code:
          204 
          205 [CVS log](http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/xargs/xargs.c?rev=1.14&content-type=text/x-cvsweb-markup)
          206 
          207 
          208 Looking at the imported git history log of GNU findutils (which has xargs), the
          209 very first commit already had the -0 and -P option:
          210 
          211 [git log](https://savannah.gnu.org/git/?group=findutils)
          212 
          213         commit c030b5ee33bbec3c93cddc3ca9ebec14c24dbe07
          214         Author: Kevin Dalley <kevin@seti.org>
          215         Date:   Sun Feb 4 20:35:16 1996 +0000
          216         
          217             Initial revision
          218 
          219 
          220 # xargs: some incompatibilities found
          221 
          222 * Using the -0 option empty fields are handled differently in different
          223   implementations.
          224 * The -n and -L option doesn't work correctly in many of the BSD implementations.
          225   Some count empty fields, some don't.  In early implementations in FreeBSD and
          226   OpenBSD it only processed the first line.  In OpenBSD it has been improved
          227   around 2017.
          228 
          229 Depending on what you want to do a workaround could be to use the -0 option
          230 with a single field and use the -n flag.  Then in each child program invocation
          231 split the field by a separator.
          232 
          233 
          234 # References
          235 
          236 * xargs: <https://man.openbsd.org/xargs>
          237 * printf: <https://man.openbsd.org/printf>
          238 * ksh, wait: <https://man.openbsd.org/ksh#wait>
          239 * wait(2): <https://man.openbsd.org/wait>