[HN Gopher] Benchmarking shell pipelines and the Unix "tools" ph...
       ___________________________________________________________________
        
       Benchmarking shell pipelines and the Unix "tools" philosophy
        
       Author : weinzierl
       Score  : 50 points
       Date   : 2020-01-06 11:31 UTC (1 days ago)
        
 (HTM) web link (blog.plover.com)
 (TXT) w3m dump (blog.plover.com)
        
       | tuldia wrote:
       | Thanks for this!
       | 
       | Another nice thing about /usr/bin/time is the --verbose flag
       | which gives:                 Command being timed: "ls"       User
       | time (seconds): 0.00       System time (seconds): 0.00
       | Percent of CPU this job got: 0%       Elapsed (wall clock) time
       | (h:mm:ss or m:ss): 0:00.00       Average shared text size
       | (kbytes): 0       Average unshared data size (kbytes): 0
       | Average stack size (kbytes): 0       Average total size (kbytes):
       | 0       Maximum resident set size (kbytes): 1912       Average
       | resident set size (kbytes): 0       Major (requiring I/O) page
       | faults: 0       Minor (reclaiming a frame) page faults: 112
       | Voluntary context switches: 1       Involuntary context switches:
       | 1       Swaps: 0       File system inputs: 0       File system
       | outputs: 0       Socket messages sent: 0       Socket messages
       | received: 0       Signals delivered: 0       Page size (bytes):
       | 4096       Exit status: 0
       | 
       | :)
        
       | justinsaccount wrote:
       | 'sort | uniq -c | sort -n' is an interesting pipeline. It will
       | always work and does a great job with large cardinality data on
       | low memory systems.
       | 
       | However, if you have the ram, or know the data set has a low
       | cardinality (like, http status codes or filesnames instead of ip
       | addresses) then something that works in memory will be much more
       | efficient.
       | 
       | I threw 144,000,000 'hello' and 'world' into a file:
       | justin@box:~$ ls -lh words       -rw-r--r-- 1 justin justin 824M
       | Jan  7 15:21 words       justin@box:~$ wc -l words
       | 144000000 words                 justin@box:~$ time (sort
       | <words|uniq  -c)       72000000 hello       72000000 world
       | real 0m22.831s       user 0m32.999s       sys 0m4.675s
       | 
       | Compared to doing it in memory with awk:
       | justin@box:~$ time awk '{words[$1]++} END {for (w in words)
       | printf("%s %d\n", w, words[w])}' < words       hello 72000000
       | world 72000000            real 0m10.639s       user 0m9.736s
       | sys 0m0.876s
       | 
       | so, half the time and 1/3 the cpu.
        
         | crystaldev wrote:
         | All of your examples work in memory.
        
         | tuldia wrote:
         | This is because in the first example you are invoking two
         | programs. The first one sort the content of the file, the
         | second count how many lines are equal.
         | 
         | While in the awk example it is creating a hash table with all
         | words and incrementing by the key and then printing.
         | 
         | There is no sorting plus printing may be buffered.
        
           | justinsaccount wrote:
           | Thanks for explaining my own comment to me.
        
       | skywhopper wrote:
       | "What if Unix had less compositionality but I could use it with
       | less memorized trivia? Would that be an improvement? I don't
       | know."
       | 
       | The answer is "no" here, because the alternative doesn't exist.
       | Could it be created? Maybe in theory, but I suspect that the
       | amount of stuff that you'd need to memorize (or learn to look up)
       | to use it effectively would be about the same for any system that
       | allowed a similar variety of work to be accomplished. If you are
       | willing to trade off functionality for simplicity, then sure, it
       | can be done. You can get it today by just not using all these
       | tools at all, I suppose.
        
         | jessant wrote:
         | That is what the author says in the next paragraph.
         | 
         | > I don't know. I rather suspect that there's no way to
         | actually reach that hypothetical universe.
        
         | wahern wrote:
         | There would be less trivia to memorize if the command behaviors
         | and options were more consistent. You may not be able to
         | achieve that at the edges, where new commands and options are
         | added, but you can always go back and clean things up.
         | 
         | For example, the cut(1) command is intended to do precisely
         | what his f script does. But it's inconvenient because unlike
         | many other commands it (1) doesn't obey $IFS and (2) the -d
         | delimiter option only takes a single character. This could and
         | should be remediated with a new, simple option.
         | 
         | I think the only thing preventing that change is that there's
         | not enough interest in moving POSIX forward faster; certainly
         | not like JavaScript.
         | 
         | Another problem are GNU tools. They have many great features
         | but _OMG_ are they a nightmare of inconsistency. BSD extensions
         | tend to be much better thought through, perhaps because GNU
         | tools tend to be lead by a single developer while BSD tools
         | tend to be more team oriented.
         | 
         | So the way forward isn't to replace the organic evolution, it's
         | to layer on processes that refine the proven extensions. And we
         | already have some of those processes in place; we just need to
         | imbue them with more authority, and that starts by not rolling
         | our eyes at standardization and portability.
        
       | Neil44 wrote:
       | I got excited when I saw the 'f' and 'count' commands, but
       | they're just scripts he has on his system. Like doing grep
       | 'plover' blah.log | cut -d ' ' -f 11 | sort | uniq -c | sort -n.
       | Personally I'd prefer to use the ubiquitous commands that work
       | everywhere than rely on having custom scripts on my system, but
       | they are nice.
        
         | tuldia wrote:
         | > [...] but they're just scripts he has on his system.
         | Personally I'd prefer to use the ubiquitous commands that work
         | everywhere than rely on having custom scripts on my system
         | [...]
         | 
         | Is okay for one to have their own tools.                 $ f()
         | { printf "\$%s" "$1"; }       $ echo a b c | awk '{ print $(f
         | 2) }'
         | 
         | His system is not very different from mine or yours. He just
         | chose to combine the tools in a specific way.
        
         | juped wrote:
         | Most people who use Unix directly build up some stuff in ~/bin
         | (often a misnomer because it's shell scripts and not binaries,
         | although mine is less of a misnomer than most because so much
         | is in C rather than shell). The trick is to build them _out of_
         | the standard portable components that exist everywhere. (This
         | means, among other things, no #! /bin/bash.)
        
           | tuldia wrote:
           | sed 's| no | not only |'
        
         | amp108 wrote:
         | That's the whole _point_ of shell scripting, to take a series
         | of minimal programs and tie them together into something that
         | does a more complex task. There 's no reason to distrust a
         | shell script simply because it _is_ a script any more than
         | there is to trust a binary simply because it 's a binary.
        
           | tasogare wrote:
           | That's the theory but frankly the syntax is so cumbersome,
           | irregular and needs so many googling for "easy" things like
           | conditional, substring, etc. that I now use a real
           | programming language if a script needs to be anything more
           | than a list of commands without any logic (besides variables
           | substitution).
        
           | cholmon wrote:
           | Sure, but relying on custom shell scripts as unix primitives
           | can be problematic if you find yourself frequently
           | managing/troubleshooting systems that you don't own, and you
           | don't want to (or aren't allowed to) put those handy scripts
           | in place. Then when you're on any given system, you forget
           | whether you can use "f", or if you have to fall back on awk.
           | 
           | I think it's less about not trusting custom scripts than it
           | is about ensuring that your unix muscle memory doesn't
           | atrophy.
        
       ___________________________________________________________________
       (page generated 2020-01-07 23:00 UTC)