[HN Gopher] Benchmarking shell pipelines and the Unix "tools" ph... ___________________________________________________________________ Benchmarking shell pipelines and the Unix "tools" philosophy Author : weinzierl Score : 50 points Date : 2020-01-06 11:31 UTC (1 days ago) (HTM) web link (blog.plover.com) (TXT) w3m dump (blog.plover.com) | tuldia wrote: | Thanks for this! | | Another nice thing about /usr/bin/time is the --verbose flag | which gives: Command being timed: "ls" User | time (seconds): 0.00 System time (seconds): 0.00 | Percent of CPU this job got: 0% Elapsed (wall clock) time | (h:mm:ss or m:ss): 0:00.00 Average shared text size | (kbytes): 0 Average unshared data size (kbytes): 0 | Average stack size (kbytes): 0 Average total size (kbytes): | 0 Maximum resident set size (kbytes): 1912 Average | resident set size (kbytes): 0 Major (requiring I/O) page | faults: 0 Minor (reclaiming a frame) page faults: 112 | Voluntary context switches: 1 Involuntary context switches: | 1 Swaps: 0 File system inputs: 0 File system | outputs: 0 Socket messages sent: 0 Socket messages | received: 0 Signals delivered: 0 Page size (bytes): | 4096 Exit status: 0 | | :) | justinsaccount wrote: | 'sort | uniq -c | sort -n' is an interesting pipeline. It will | always work and does a great job with large cardinality data on | low memory systems. | | However, if you have the ram, or know the data set has a low | cardinality (like, http status codes or filesnames instead of ip | addresses) then something that works in memory will be much more | efficient. | | I threw 144,000,000 'hello' and 'world' into a file: | justin@box:~$ ls -lh words -rw-r--r-- 1 justin justin 824M | Jan 7 15:21 words justin@box:~$ wc -l words | 144000000 words justin@box:~$ time (sort | <words|uniq -c) 72000000 hello 72000000 world | real 0m22.831s user 0m32.999s sys 0m4.675s | | Compared to doing it in memory with awk: | justin@box:~$ time awk '{words[$1]++} END {for (w in words) | printf("%s %d\n", w, words[w])}' < words hello 72000000 | world 72000000 real 0m10.639s user 0m9.736s | sys 0m0.876s | | so, half the time and 1/3 the cpu. | crystaldev wrote: | All of your examples work in memory. | tuldia wrote: | This is because in the first example you are invoking two | programs. The first one sort the content of the file, the | second count how many lines are equal. | | While in the awk example it is creating a hash table with all | words and incrementing by the key and then printing. | | There is no sorting plus printing may be buffered. | justinsaccount wrote: | Thanks for explaining my own comment to me. | skywhopper wrote: | "What if Unix had less compositionality but I could use it with | less memorized trivia? Would that be an improvement? I don't | know." | | The answer is "no" here, because the alternative doesn't exist. | Could it be created? Maybe in theory, but I suspect that the | amount of stuff that you'd need to memorize (or learn to look up) | to use it effectively would be about the same for any system that | allowed a similar variety of work to be accomplished. If you are | willing to trade off functionality for simplicity, then sure, it | can be done. You can get it today by just not using all these | tools at all, I suppose. | jessant wrote: | That is what the author says in the next paragraph. | | > I don't know. I rather suspect that there's no way to | actually reach that hypothetical universe. | wahern wrote: | There would be less trivia to memorize if the command behaviors | and options were more consistent. You may not be able to | achieve that at the edges, where new commands and options are | added, but you can always go back and clean things up. | | For example, the cut(1) command is intended to do precisely | what his f script does. But it's inconvenient because unlike | many other commands it (1) doesn't obey $IFS and (2) the -d | delimiter option only takes a single character. This could and | should be remediated with a new, simple option. | | I think the only thing preventing that change is that there's | not enough interest in moving POSIX forward faster; certainly | not like JavaScript. | | Another problem are GNU tools. They have many great features | but _OMG_ are they a nightmare of inconsistency. BSD extensions | tend to be much better thought through, perhaps because GNU | tools tend to be lead by a single developer while BSD tools | tend to be more team oriented. | | So the way forward isn't to replace the organic evolution, it's | to layer on processes that refine the proven extensions. And we | already have some of those processes in place; we just need to | imbue them with more authority, and that starts by not rolling | our eyes at standardization and portability. | Neil44 wrote: | I got excited when I saw the 'f' and 'count' commands, but | they're just scripts he has on his system. Like doing grep | 'plover' blah.log | cut -d ' ' -f 11 | sort | uniq -c | sort -n. | Personally I'd prefer to use the ubiquitous commands that work | everywhere than rely on having custom scripts on my system, but | they are nice. | tuldia wrote: | > [...] but they're just scripts he has on his system. | Personally I'd prefer to use the ubiquitous commands that work | everywhere than rely on having custom scripts on my system | [...] | | Is okay for one to have their own tools. $ f() | { printf "\$%s" "$1"; } $ echo a b c | awk '{ print $(f | 2) }' | | His system is not very different from mine or yours. He just | chose to combine the tools in a specific way. | juped wrote: | Most people who use Unix directly build up some stuff in ~/bin | (often a misnomer because it's shell scripts and not binaries, | although mine is less of a misnomer than most because so much | is in C rather than shell). The trick is to build them _out of_ | the standard portable components that exist everywhere. (This | means, among other things, no #! /bin/bash.) | tuldia wrote: | sed 's| no | not only |' | amp108 wrote: | That's the whole _point_ of shell scripting, to take a series | of minimal programs and tie them together into something that | does a more complex task. There 's no reason to distrust a | shell script simply because it _is_ a script any more than | there is to trust a binary simply because it 's a binary. | tasogare wrote: | That's the theory but frankly the syntax is so cumbersome, | irregular and needs so many googling for "easy" things like | conditional, substring, etc. that I now use a real | programming language if a script needs to be anything more | than a list of commands without any logic (besides variables | substitution). | cholmon wrote: | Sure, but relying on custom shell scripts as unix primitives | can be problematic if you find yourself frequently | managing/troubleshooting systems that you don't own, and you | don't want to (or aren't allowed to) put those handy scripts | in place. Then when you're on any given system, you forget | whether you can use "f", or if you have to fall back on awk. | | I think it's less about not trusting custom scripts than it | is about ensuring that your unix muscle memory doesn't | atrophy. ___________________________________________________________________ (page generated 2020-01-07 23:00 UTC)