[HN Gopher] Performance comparison: counting words in Python, C/...
       ___________________________________________________________________
        
       Performance comparison: counting words in Python, C/C++, Awk, Rust,
       and more
        
       Author : pcr910303
       Score  : 202 points
       Date   : 2022-07-24 15:31 UTC (7 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | gxt wrote:
       | > a simple interview problem
       | 
       | I've been asking different versions of that question since I
       | started interviewing and it is a surprisingly good filter.
       | 
       | But with articles like this, now I need new questions..
        
       | staticassertion wrote:
       | It's a bit of a shame that Rust's stdlib doesn't have a "fast but
       | less safe" hasher. It's one of the easiest performance
       | optimizations in my experience because it's very rare that an
       | attacker controls input to your hashmap in a way that can
       | actually lead to a DOS.
       | 
       | Not that it doesn't happen, it's just not something I've run
       | into.
       | 
       | Just today I was testing out the performance difference hashing
       | some Scylla queries and got nearly 2x faster hashing moving to
       | fxhash.
        
       | anonymoushn wrote:
       | Recently there was a contest to optimize the performance of a
       | similar program[0] and a Zoom discussion of the optimizations[1].
       | 
       | The programs are not comparable in the the following ways:
       | 
       | - Case: TFA requires (at least) ASCII lowercasing but the contest
       | problem required no lowercasing.
       | 
       | - Ordering: TFA does not require sorting, but the contest problem
       | required sorting.
       | 
       | - Memory: TFA imposes a requirement phrased as "don't read the
       | whole file into memory" and this sounds like it's a resource-
       | saving constraint, but it's usually a constraint that requires
       | the program to spend additional resources. You could just mmap
       | the file and store pointers into the mapped region. It costs
       | extra to do copies instead of no copies.
       | 
       | - Text: TFA is unclear on what assumptions may be made about the
       | lengths of words. For the contest problem, the Hungarian
       | wikipedia input's longest word is around 80k.
       | 
       | - Safe, Hashing, Stdlib: TFA imposes some restrictions on what
       | constructs may be used that are not imposed in the contest
       | problem.
       | 
       | For the contest version of this problem, it seems like you can
       | tokenize, hash, and count strings at around 1GB/s. Adapting a
       | solution to solve TFA's problem (but not to conform to its
       | Safe/Hashing/Stdlib requirements) would probably not carry too
       | large of a penalty, since it's like 3 instructions to ASCII-
       | lowercase 32 bytes and 1 string copy per unique string should
       | take negligible time compared to the hash table lookups. So there
       | is some room for the optimized solutions to go a little faster,
       | if more optimizations are permitted.
       | 
       | [0]: https://easyperf.net/blog/2022/05/28/Performance-analysis-
       | an...
       | 
       | [1]: https://www.youtube.com/watch?v=R_yX0XjdSBY
        
       | fractalb wrote:
       | grep is indeed fast!
        
       | unhammer wrote:
       | Hm, I tried comparing the simple.hs vs simple.py (with ghc 8.10.7
       | which I happened to have here, their stack.yaml uses 8.10.4) and
       | it's 5.7s vs 1.9s, not quite as stark a difference. Maybe because
       | I'm on Linux?
        
       | mustache_kimono wrote:
       | Re: the Rust optimized implementation, I was able to get ~20-25%
       | better performance by rewriting the for loops as iterators, using
       | Rust's byte range pattern matching, and a buffered writer, which
       | seems crazy, but it's true. I chalked it up to some crazy
       | ILP/SIMD tricks the compiler is doing.
       | 
       | I even submitted a PR[0], but Ben decided he was tired of
       | maintaining and decided to archive the project (which fair
       | enough!).
       | 
       | [0]: https://github.com/benhoyt/countwords/pull/115
        
         | grumpyprole wrote:
         | Iterators are potentially a much better design too as they
         | allow for more modular code.
        
         | codeflo wrote:
         | Very nice. That's in fact really clean, and (at least to me)
         | indeed surprising.
         | 
         | As rejection reasons go, "I wanted to name-drop person X" is
         | interesting. But as you said, that's the maintainer's decision
         | to make.
        
           | [deleted]
        
           | mustache_kimono wrote:
           | Can't win them all. I think I came along when they were just
           | really sick of maintaining it. Ben and Andrew have all my
           | respect.
        
       | codeflo wrote:
       | For those curious about Rust optimization, the "optimized"
       | version makes three changes compared to the "idiomatic" version:
       | 
       | * It uses byte strings instead of UTF-8 strings. In my opinion,
       | that's not an optimization, that's changing the problem.
       | Depending on the question you're asking, only one of the two can
       | be correct.
       | 
       | * It uses a faster hash algorithm. It's not the first time this
       | came up in a benchmark article. Rust's decision to use a DOS-safe
       | hash by default (and not provide a fast algorithm in the std,
       | like other languages do) really seems to hurt it in that kind of
       | microbenchmark.
       | 
       | * It uses get_mut+insert instead of the more convenient
       | HashMap::entry method, because the latter would require
       | redundantly allocating the key even in the repeat case. I've hit
       | this problem in the past as well. Maybe the upcoming
       | HashMap::raw_entry_mut will make this kind of optimization
       | cleaner.
        
         | saghm wrote:
         | > It uses byte strings instead of UTF-8 strings. In my opinion,
         | that's not an optimization, that's changing the problem.
         | Depending on the question you're asking, only one of the two
         | can be correct.
         | 
         | Sure, but is it changing the problem to something easier than
         | what the other languages are already doing, or to something
         | more similar? I'd imagine the C code is basically just using
         | byte arrays as well, for instance.
        
         | tialaramex wrote:
         | The problem specified _declares_ the words we 're counting are
         | ASCII:
         | 
         | > ASCII: it's okay to only support ASCII for the whitespace
         | handling and lowercase operation
         | 
         | UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8
         | solution is correct for ASCII, but a bytes-as-ASCII solution
         | works fine in Rust _if_ you only need ASCII.
         | 
         | This is why Rust provides ASCII variants of a lot of functions
         | on strings, and the same functions are available on byte slices
         | [u8] where ASCII could be what you have (whereas their Unicode
         | cousins are not available on byte slices).
        
           | codeflo wrote:
           | >> ASCII: it's okay to only support ASCII for the whitespace
           | handling and lowercase operation
           | 
           | That's a somewhat specific list -- at least I didn't read
           | that as a general "the program can assume that the input is
           | only ASCII".
           | 
           | But then, the author seems to have accepted solutions that
           | crash on non-UTF8 sequences and ones that byte-compare them,
           | so probably either behavior was meant to be fine. I just
           | don't get that from this rule.
        
             | anonymoushn wrote:
             | > That's a somewhat specific list -- at least I didn't read
             | that as a general "the program can assume that the input is
             | only ASCII".
             | 
             | I don't think it assumes the input is only ASCII. If the
             | problem is "given UTF-8 text, split on ASCII whitespace and
             | convert ASCII uppercase letters to lowercase," you can do
             | that correctly (and produce correct UTF-8 output) without
             | really being UTF-8 aware. For why, see here:
             | https://en.wikipedia.org/wiki/UTF-8#Encoding
             | 
             | > But then, the author seems to have accepted solutions
             | that crash on non-UTF8 sequences and ones that byte-compare
             | them, so probably either behavior was meant to be fine. I
             | just don't get that from this rule.
             | 
             | That's a separate concern right? The rules are only about
             | the behavior when the program is given UTF-8 input.
        
         | olalonde wrote:
         | > * It uses byte strings instead of UTF-8 strings. In my
         | opinion, that's not an optimization, that's changing the
         | problem. Depending on the question you're asking, only one of
         | the two can be correct.
         | 
         | To be fair, that's what the C version does as well.
        
         | stewbrew wrote:
         | Ad uft8: it would change the problem only if the definition of
         | word breaks changed. E.g., if a word break is defined by
         | whitespace, then it probably doesn't.
        
           | Conlectus wrote:
           | What happens when the ASCII character for space is embedded
           | within a multibyte UTF-8 codepoint?
        
             | Kiwikwi wrote:
             | Such "overlong" encodings are not valid UTF-8.
        
             | anyfoo wrote:
             | If UTF-8 allowed that, then a solution would split words
             | wrong according to UTF-8 rules, but it still would not
             | matter, since the problem statement says that "it's okay to
             | only support ASCII for the whitespace handling and
             | lowercase operation".
             | 
             | So if your solution gets word splitting (or lowercasing)
             | wrong _for non-ASCII input_ , it still gets a pass
             | according to my reading.
        
             | kd5bjo wrote:
             | UTF-8 is designed so that this doesn't happen: All bytes in
             | a multibyte codepoint have the high bit set, placing them
             | outside of the ASCII range.
        
             | bitwize wrote:
             | This never happens. In UTF-8, bit 7 is set in all valid
             | multibyte representations; ASCII characters _cannot_ appear
             | in them. Now, you can have an ASCII character in the middle
             | of a multibyte sequence, but this isn 't valid UTF-8 (and
             | should be treated as an error by the decoder). "Words"
             | generated by a splitter would also have invalid UTF-8 if
             | they were split on a space coming in between bytes in a
             | multibyte sequence, but these can detected. Word splitting
             | will not generate invalid words from valid input, or vice
             | versa; and if you know you have valid input, splitting on
             | space, \t, \n, \r, etc. is perfectly cromulent in UTF-8.
             | 
             | Now if you want to also split on U+200B (ZERO WIDTH SPACE),
             | U+202F (NARROW NO-BREAK SPACE), etc... hoo boy.
        
         | coldtea wrote:
         | > _It uses byte strings instead of UTF-8 strings. In my
         | opinion, that's not an optimization, that's changing the
         | problem. Depending on the question you're asking, only one of
         | the two can be correct._
         | 
         | For tons of questions, both can be correct.
        
         | SAI_Peregrinus wrote:
         | The question as posed only cares about ASCII.
        
       | benreesman wrote:
       | This was always my approach when I had do "leetcode whiteboard
       | interviews".
       | 
       | Start with something not too far north of FizzBuzz but with a ton
       | of scope to progressively enrich the problem until the clock runs
       | out.
       | 
       | At $BIG_CO du jour people tend to talk about how much "signal" an
       | interview produces, and starting somewhere that a meaningful
       | number of candidates can't do anything with offers very little
       | "signal", likewise very little Shannon information.
       | 
       | Great article!
        
       | jeffbee wrote:
       | Funny how you can take the "optimized" C++ and make it
       | significantly faster (vs. gnu/linux standard libraries) with
       | little effort. A lookup table for to_lower, instead of bit math,
       | helps slightly. Building with the LLVM-libc string functions
       | helps a lot, because the GNU std::string_view::operator== is
       | calling an AVX-optmized memcmp via the PLT, which is a terrible
       | strategy for short strings. LLVM-libc has an inlined bcmp (note:
       | bcmp can be significantly faster than memcmp, but on GNU they are
       | aliases) that is resolved for the target platform at build time
       | instead of run time.
       | 
       | Edit:
       | 
       | Even if you ignore LLVM-libc, just slapping this into the
       | optimized.cpp and replacing the critical `==` with `our_bcmp`
       | makes it 10% faster. IFUNC calls to micro-optimized SIMD
       | functions are counterproductive. It is far, far better that the
       | compiler can see all the code at build time.                 int
       | our_bcmp (const char* a, const char* b, size_t sz) {         for
       | (size_t i = 0; i < sz; i++) {           if (a[i] != b[i]) return
       | 1;         }         return 0;       }
        
       | svnpenn wrote:
       | You included Awk in the title, but omitted Go, really?
        
         | hu3 wrote:
         | Indeed. The original title is:
         | 
         | > Performance comparison: counting words in Python, Go, C++, C,
         | AWK, Forth, and Rust.
         | 
         | Which means the submitter actively chose Go to be removed from
         | the title. Judging from their post history they are aligned
         | with Rust community.
        
           | yakubin wrote:
           | The original title is 2 characters over the HN title length
           | limit. I suspect they dropped "Go" first, because it's the
           | language which has a 2-letter name, and they they
           | editorialised it further to hint that there are more
           | languages.
        
             | hu3 wrote:
             | Leaving Awk over the vastly more popular Go, hints a more
             | probable and frequent behaviour: bias.
             | 
             | Removing any of them would suffice.
        
               | e4m2 wrote:
               | The author maintains an AWK implementation:
               | 
               | https://github.com/benhoyt/goawk
        
               | hu3 wrote:
               | Ben Hoyt, isn't the same person as the HN submitter.
        
               | [deleted]
        
       | ppeetteerr wrote:
       | Curious if anyone knows why the swift version is slow? Is it the
       | casting from subsequence to a string?
        
       | BiteCode_dev wrote:
       | In this type of blog post with comparisons including many
       | languages, you usually see some unatural approach for your
       | favorite language.
       | 
       | However, in this case, the Python example is idiomatic.
       | 
       | Since the articles calls for better approaches, I would suggest
       | to take the opportunity to use the walrus operator and rsplit()
       | for the optimized version. Something like this:
       | reminding = ""         c=Counter( )         while (chunk :=
       | sys.stdin.read(64 * 1024)):             pre, post =
       | chunk.lower().rsplit('\n', 1)             c.update((reminding +
       | pre ).split())             reminding = post
       | 
       | It should not affect performance too much, and the code gets more
       | expressive.
       | 
       | However, the performances will be quite different depending of
       | the python version you use. Interestingly, Python 3.11 beta,
       | which comes with a lot performance tweaks, is slower for this
       | exercice, while being reported to be faster on real life tasks.
        
         | eesmith wrote:
         | Your code assumes there is at least one newline in each 64K
         | chunk, and assumes there are no words past the final newline.
         | 
         | It will fail on "abc" and give the wrong answer for "abc\ndef".
         | 
         | I prefer rpartition over rsplit to handle first case, and the
         | loop-and-a-half construct instead of the while+walrus operator
         | to handle the second, as in this modified version of your code:
         | remaining = ""         c=Counter( )         while True:
         | chunk = sys.stdin.read(64 * 1024)             if not chunk:
         | if not remaining:                     break                 pre
         | = post = ""             else:                 pre, mid, post =
         | chunk.lower().rpartition("\n")             c.update((remaining
         | + pre).split())             remaining = post
        
           | BiteCode_dev wrote:
           | Indeed, rpartition is better suited for this task.
        
       | raffraffraff wrote:
       | I love using the UNIX shell (pipes and programs that do "one
       | thing well") to the point where it hurts my ability to improve at
       | other other languages. But sometimes all you need is a throwaway
       | command to do something once, not a beautifully optimized
       | function that will be written, improved, compiled, profiled,
       | reoptimized and then executed a billion times. The utter tersity
       | of a bash one-liner just tickles me.
        
       | FpUser wrote:
       | I would not call it performance comparison at all. When Python
       | call functions written in C it is not Python's performance. Write
       | those functions using plain Python and then see the results. Sure
       | for this basic example in the article it does not matter from a
       | practical standpoint. But when you need to step away from canned
       | cases suddenly Python's performance sucks big time.
        
         | __marvin_the__ wrote:
         | Python's `collections.Counter` is written in Python and is a
         | subclass of the builtin `dict` type. I don't think it's
         | comparable to something like using `pandas` to solve the
         | problem.
         | 
         | https://github.com/python/cpython/blob/main/Lib/collections/...
        
         | kortex wrote:
         | Can we move past the whole "it is not really python to use
         | libraries written in C", especially when talking pure stdlib
         | python?
         | 
         | Python is basically a DSL for C extensions. That is the whole
         | point. It would be like criticizing any compiled language for
         | essentially being a DSL for machine code, and not "Real
         | Instructions".
         | 
         | Python's ability to interop with pre-built, optimized libraries
         | with a lightweight interface is arguably its greatest selling
         | point. Everyone and their dog knows purely interpreted CPython
         | is slow. It doesn't need to be pointed out every single time
         | python performance is brought up unless literally discussing
         | optimizing the CPython vm.
        
           | igouy wrote:
           | > It doesn't need to be pointed out ...
           | 
           | Apparently there will be someone who feels the need to point
           | it out; and someone who feels the need to point out that it
           | doesn't need to be pointed out; and ...
        
           | [deleted]
        
           | FpUser wrote:
           | I am not criticizing Python. It does well enough what it was
           | made to do. It just make no sense to call that particular
           | example a "language performance comparison". It is anything
           | but.
        
             | kortex wrote:
             | But you are still wrong. As mentioned, Dicts are incredibly
             | efficient data structures in Python (because they underpin
             | _everything_ ) and the Counter class is pure python. That's
             | 100% pure python. Saying dicts "don't count" because they
             | are implemented in C would disqualify the entire language
             | of CPython, as virtually everything under the hood is a
             | PyObject struct pointer. It just so happens that "counting
             | abstract objects" is the class of problem* CPython does
             | basically all the time in normal VM execution anyways, so
             | this task is easy and fast.
             | 
             | * looking up a reference of an attribute with a string key
             | underpins the meat and potatoes of python's execution model
        
       | ribit wrote:
       | These days I tend to use R for these type problems. It's slow,
       | but vectors as first-class data types simplify things a lot
       | 
       | readLines(file) |> strsplit(" +") |> unlist() |> table() |>
       | sort(desc = TRUE)
        
       | jerryjerryjerry wrote:
       | Our distributed NewSQL database project actually uses three of
       | them, Go, Rust, and C++, for different layers. There are pros and
       | cons for each of them, but it still challenge to maintain three
       | languages in a one system development. Performance wise, C++ and
       | Rust are good, coding wise Go is easiest but Rust is kind of
       | having best practice.
        
       | Karellen wrote:
       | Doesn't the Unix Shell implementation break the "threading"
       | constraint, in that each program in the pipe will have a main
       | thread each, and all the programs will run concurrently?
       | 
       | Or should the "threading" constraint really have been stated as
       | "you can use multiple threads, but only if they all have their
       | own address space"?
       | 
       | Alternatively, given the multi-core capabilities of even
       | budget/mobile/small systems these days (even the Rasberry Pi 4
       | has a quad-core CPU), isn't restricting the implementation to a
       | single thread a weird artificial limitation in 2022? Also, a GPU-
       | based implementation would be interesting, and GPUs most of their
       | power from their massively parallel core architecture.
        
         | slaymaker1907 wrote:
         | If I were doing the benchmark, I would just run it on a VM with
         | a single physical core. There are good reasons for comparing
         | single threaded performance since it may not be necessary or
         | desirable to scale that way. For example, suppose it's a wc API
         | or something; limiting each request to a single thread can be
         | better since you get fairer scheduling and less data sharing
         | which is almost always better for multiprocessing. Even shared
         | immutable data can cause problems since the CPU may not realize
         | the data is actually immutable.
         | 
         | I agree GPUs would probably do well on this problem, though it
         | depends on how expensive the memory copying to/from the GPU
         | ends up being. However, if you are going for all our
         | performance, it seems like you could use SIMD like memchr or
         | something.
        
           | Karellen wrote:
           | > If I were doing the benchmark, I would just run it on a VM
           | with a single physical core.
           | 
           | I think that's definitely an environment worth benchmarking
           | on - but I don't think that it should be the _only_
           | environment to benchmark on.
           | 
           | Also, I don't think it's a good reason to limit
           | implementations to a single thread, even if that is your
           | benchmark environment. It can be worth seeing how well an
           | implementation that's capable of taking advantage of multiple
           | cores/CPUs, does when it's only given one core to work with.
           | 
           | It's probably worth optimising for and benchmarking different
           | threading strategies too - does your implementation create
           | one thread per "work unit" and let the scheduler work them
           | all out, or does it create one a one thread per core (or
           | maybe, one thread per core, plus one), thread pool, and
           | assign work units to each each thread until they're done? And
           | how do those strategies work if they're only given a single
           | core?
           | 
           | The single core case is definitely worth testing, but it
           | seems odd to limit implementations to a single thread because
           | of it. If you think you can go faster with a threaded
           | implementation, you should be able to try that out.
        
       | fallingfrog wrote:
       | Your Perl implementation is almost as fast as c++? Something
       | weird is going on here. Maybe you are bottlenecked on disk read
       | speeds or something.
        
         | theonemind wrote:
         | Not really surprising. Perl approaches a domain specific
         | language for text processing with a heavily-optimized 34 year
         | old implementation.
        
         | bjoli wrote:
         | Why is that surprising? The perl interpreter is fast and
         | dispatches just about everything to heavily optimized C. Just
         | like python does, it is only better at it.
         | 
         | This is the kind of thing where perl really shines.
        
       | josefx wrote:
       | No need to decode the C++ names when you use
       | valgrind/callgrind/kcachegrind for C, that works just as well for
       | C++.
        
       | gibsonf1 wrote:
       | What code was used for Common Lisp? I'm guessing a well written
       | version would be in the top contenders for speed.
        
         | Jtsummers wrote:
         | https://github.com/benhoyt/countwords/blob/master/simple.lis...
         | 
         | From Hoyt's repository itself.
        
           | gibsonf1 wrote:
           | Which lisp compiler did you use?
        
             | Jtsummers wrote:
             | I haven't run it, just tracked down the source because I
             | was similarly curious about the slow runtime. If I do run
             | it it will be with SBCL. Looking at the repo, Hoyt did as
             | well.
        
               | gibsonf1 wrote:
               | That code is slow because it's working with a really long
               | list instead of a very fast hash table
        
           | gibsonf1 wrote:
           | Right, this came up a year ago with a version that ranks
           | among the top here:
           | https://news.ycombinator.com/item?id=26465857
           | (defmethod performance-count ((path-file string))
           | (let ((map (make-hash-table :test 'equal)))           (with-
           | open-file (stream path-file :direction :input :if-does-not-
           | exist nil)             (when stream               (loop for
           | line = (read-line stream nil 'end)              until (eq
           | line 'end)              do              (let ((split (split-
           | string #\space (string-downcase line))))
           | (dolist (word split)           (let ((index (gethash word
           | map)))             (if index                 (setf (gethash
           | word map) (incf index))               (setf (gethash word
           | map) 1))))))               (let ((keys (sort
           | (alexandria:hash-table-keys map)        (lambda(x y)(>
           | (gethash x map)(gethash y map))))))          (dolist (key
           | keys)            (format t "~A ~A~%" key (gethash key
           | map))))))))
        
             | Jtsummers wrote:
             | You don't need those extra newlines if you'd prefix every
             | code line with two spaces, by the way. It makes it much
             | more readable if you do that.
        
               | gibsonf1 wrote:
               | Nice! I didn't know that trick.
        
       | Twirrim wrote:
       | As they call out in the article, accurately counting words is a
       | bit more complicated that splitting by space. You need to account
       | for all sorts of punctuation etc.
       | 
       | When I've been asked this in programming interviews, I've almost
       | always been expected to produce something like the code in the
       | article (and do), but usually I'd point to something like the
       | NLTK library as a better approach. It's polished and highly
       | capable, and handles probably just about every edge case there
       | is.
        
       | chrismorgan wrote:
       | > _Incidentally, this problem set the scene for a wizard duel
       | between two computer scientists several decades ago. In 1986, Jon
       | Bentley asked Donald Knuth to show off "literate programming"
       | with a solution to this problem, and he came up with an
       | exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the
       | inventor of Unix pipelines) replied with a one-liner Unix shell
       | version using tr, sort, and uniq._
       | 
       | Since this writing and other linked resources present only one
       | side of the affair, I will mention: the presentation in _Pearls_
       | was quite unfair to Knuth, and the conclusions commonly made of
       | it somewhere between moderately and entirely unsound. (That is,
       | as conclusions _from the paper_. I speak of unsoundness of logic
       | only: these conclusions may be supportable from other sources.)
       | 
       | Knuth was challenged to demonstrate literate programming, and so
       | _that was what he demonstrated_. He showed building something
       | from the ground up assuming a very minimal environment, with
       | problem analysis and all: of course that ended up more verbose
       | than a Unix pipeline that glosses over problem analysis and
       | starts with most of the tools you need to make a probably-
       | acceptable solution!
       | 
       | McIlroy himself admitted a few years later that he had been "a
       | little unfair" to criticise on engineering grounds what had only
       | been an illustration of technique. Even in the paper, Bentley did
       | admit a degree of culpability for this mismatch in his "criticism
       | of programs" paragraph. ("He admires the execution of the
       | solution, but faults the problem on engineering grounds. (That
       | is, of course, my responsibility as problem assigner; Knuth
       | solved the problem he was given on grounds that are important to
       | most engineers-the paychecks provided by their problem
       | assigners.)")
       | 
       | But I do wish that Knuth had been given the opportunity to write
       | a review of McIlroy's review, for simultaneous publication.
        
         | mistrial9 wrote:
         | 1986 was very early days for modern operating systems. I
         | suspect that the hardware environment(low resource) would be
         | surprising to many phone owners here. So the efficacy had to do
         | with the tools low-level bindings, too
        
           | drfuchs wrote:
           | Indeed. In 1986, Knuth was exclusively using the SAIL DEC10,
           | running the WAITS OS. This environment limited him to
           | approximately half a megabyte of data space (heap plus stack
           | plus globals), and half a megabyte of code space. All
           | statically linked, so all library and runtime stuff had to
           | fit.
           | 
           | Also of note is that most CPUs of the day didn't execute
           | instructions orders of magnitude faster than bandwidth to
           | main memory, as is the case today; it was early days for
           | worrying about making algorithms "cache aware". So, for
           | instance, there was no big penalty for "pointer chasing" in
           | linked lists as there is today. Additionally, the severe
           | memory size limits affected various time vs. space tradeoffs
           | in algorithm design.
        
             | mistrial9 wrote:
             | and, is it a compiled language at all? versus a script
             | language of some kind.
        
               | Jtsummers wrote:
               | Is what a compiled language? For Knuth's program in the
               | Programming Pearls article? He wrote it in Pascal, so not
               | a scripting language.
        
         | [deleted]
        
         | [deleted]
        
       | mustache_kimono wrote:
       | ... (2021). Great read and fun project. See the discussion from
       | when it was originally posted by the author.[0]
       | 
       | [0]: https://news.ycombinator.com/item?id=26463967
        
       | claytongulick wrote:
       | My takeaway is that unoptimized c++ is pretty close to the same
       | performance as JavaScript.
       | 
       | Wild.
        
         | igouy wrote:
         | _fwiw_ https://benchmarksgame-
         | team.pages.debian.net/benchmarksgame/...
        
           | [deleted]
        
       | bsaul wrote:
       | surprised to see swift 4 times as slow as non-optimized go.
       | 
       | Anyone has an explanation ?
       | 
       | I know the swift string type is really complex, but i always
       | assumed it was at least performing well..
        
         | igouy wrote:
         | _fwiw_ https://benchmarksgame-
         | team.pages.debian.net/benchmarksgame/...
        
         | msbarnett wrote:
         | Reading anything at all into the relative performance of these
         | programs written by different people which are all doing
         | radically different things seems like a gigantic waste of time.
         | 
         | Some of these "simple" versions are custom-specifying
         | allocators to minimize memory overhead and doing highly
         | optimizable ASCII value manipulations and doing byte-equality
         | on strings, rather than go through slower Unicode libraries.
         | The "simple" Swift version is doing nothing at all to avoid
         | generating many allocations/deallocations and running fully
         | Unicode-aware algorithms to lower the input and perform
         | Unicode-normalized string comparisons, it's totally apples-to-
         | oranges stuff.
        
           | bsaul wrote:
           | i was wondering if this was due to unicode indeed. I had
           | hoped that the benchmarks required languages to perform the
           | same kinds of string check, otherwise there's really no point
           | indeed.
        
       | lrem wrote:
       | This site makes Chrome (Android, current) stop responding.
        
         | mellosouls wrote:
         | Yeah, mine too. Ironic considering the subject...
        
       | yakubin wrote:
       | I like the OCaml version. It's both very easy to read and comes
       | out really well in the benchmark.
       | 
       | I also looked at the "simple" Zig version, which came out really
       | well in the benchmark, and to me it didn't look simple at all. It
       | seems you need to make a lot of low level details explicit.
       | 
       | But IMHO AWK takes the crown here. :)
        
         | OskarS wrote:
         | > But IMHO AWK takes the crown here. :)
         | 
         | Agreed! AWK is still the king of this stuff. For tasks like
         | this, I kinda think AWK is nigh-unbeatable: so simple to write,
         | so obvious what's going on (even if you've never seen any AWK
         | program before, you're probably going to be able to figure out
         | what's going on there), and decently performant.
         | 
         | AWK is the bee's knees.
        
           | moffkalast wrote:
           | On the other hand, I'm rather surprised that the Go version
           | is almost as long and opaque as C.
        
             | benhoyt wrote:
             | Are you looking at the "simple" or the "optimized"
             | versions? For the optimized, yes, the Go one is very
             | similar to the C. For the simple, idiomatic version, the Go
             | version [1] is much simpler than the C one [2]: 40 very
             | straight-forward LoC vs 93 rather more complex ones
             | including pointer arithmetic, tricky manual memory
             | management, and so on.
             | 
             | [1] https://github.com/benhoyt/countwords/blob/c66dd01d868a
             | a83dc... [2] https://github.com/benhoyt/countwords/blob/c66
             | dd01d868aa83dc...
        
         | tannhaeuser wrote:
         | TFA only talks about gawk, though, making this comparison
         | meaningless ie. you can't benchmark _languages_ but only
         | _language implementations_. Same goes for other programming
         | languages ofc unless those have only a single implementation.
         | 
         | mawk is an order of magnitude faster than gawk, and gawk isn't
         | even the default on many Linuxen.
        
           | benhoyt wrote:
           | > TFA only talks about gawk, though, making this comparison
           | meaningless
           | 
           | Not exactly! To quote from TFA (I'm the author):
           | 
           | > Another "optimization" is to run it using mawk, a faster
           | AWK interpreter than gawk. In this case it's about 1.7 times
           | as fast as gawk -b. I'm using mawk in the benchmarks for the
           | optimized version.
        
       | [deleted]
        
       | osigurdson wrote:
       | C# really needs a GetOrAdd method on the regular Dictionary class
       | (as exists on ConcurrentDictionary). This common pattern (from
       | the optimized version) requires the hash function on "word" to be
       | evaluated twice (once for TryGetValue and once for Add). Then, if
       | there is a collision on the 32 bit hash then the equality check
       | needs to be run twice as well. Not good as all of these are O(N)
       | operations (where N is the size of "word"). I suspect the
       | optimized version would be materially faster if they implemented
       | their own dictionary that includes this.
       | if (!counts.TryGetValue(word, out var wordCountRef))
       | {                         counts.Add(word, new Ref<int>(1));
       | }
        
       | nr2x wrote:
       | Some of these efficiency gains are remarkable. Most of the code I
       | write is just the quickest way to write a mundane function.
       | Really curious if you went into a huge tech co and devoted ~10%
       | of SWE time solely to optimization of code base what the energy
       | savings would be.
        
         | jeffbee wrote:
         | You can't walk into a huge tech co and make a dent because huge
         | tech co already has hundreds of engineers on that job full
         | time. Just look at the papers coming out of huge tech co's.
         | They will publish if their technique saves them 0.2%. That's
         | how tight things are already wired in the big leagues.
         | 
         | The real money is in huge non-tech co, or medium tech co. You
         | could walk into a goldman sachs type of operation and point out
         | savings in CPU time that would be 90-99%. But they're not going
         | to care, because their corporate structures aren't wired to
         | observe and act on such things.
        
       | anonu wrote:
       | q solution in kdb, runs in about 2.5s on my machine vs simple.py
       | which takes 4.7s. Rough interpolation with the results table puts
       | this on par with Rust/Go/OCaml or other compiled languages.
       | \t desc count each group `$lower " " vs raze read0
       | `:kjvbible_x10.txt
       | 
       | It's also 1 line!
       | 
       | I am sure q pros or k purists can optimize this even more...
       | 
       | EDIT: Moving the lower earlier brings this down to 1.9s
       | desc count each group `$ " " vs lower raze read0
       | `:kjvbible_x10.txt
        
       | gary17the wrote:
       | Wow. Swift, touted as "safe by design and (...) runs lightning-
       | fast"[1] is more of a screw-up than I thought. Almost twice as
       | slow as Lua and behind even Pascal and Forth.
       | 
       | [1] https://developer.apple.com/swift/
        
         | sk0g wrote:
         | It's a single implementation, I wouldn't put too much on it. I
         | can definitely write a C++ implementation that's slower and
         | less optimised than most of those results, particularly if I
         | was using Boost.
         | 
         | iPhone and Mac apps run pretty well in my experience, and Swift
         | is definitely faster (in general) than Python at least. There
         | were some serious considerations to port over ML libraries to
         | Swift due to its ease of use, similar to Python, while
         | providing much better execution speed.
        
           | fractalb wrote:
           | > There were some serious considerations to port over ML
           | libraries to Swift due to its ease of use, similar to Python,
           | while providing much better execution speed. Google abandoned
           | that Swift work (if that's what you're referring to)
        
             | PossiblyKyle wrote:
             | Google and abandoning things, name a more iconic duo
        
         | tialaramex wrote:
         | I don't know that I'd characterise it as a "screw up" based on
         | one random person's solution to one particular problem.
         | 
         | In particular the "simple" solutions, which is all that's
         | offered for Swift, will be whatever was most idiomatic/ obvious
         | to a programmer of potentially quite variable quality.
         | 
         | It's unfortunate, I think, that there are "simple" solutions
         | for some languages which have had performance revisions. If you
         | needed an hour, or a friend, or even a community to "hint" how
         | to improve it that wasn't the simple solution.
         | 
         | [ For example I feel like the Rust might be faster asking for
         | ASCII lowercase, and using the unstable (not order-preserving)
         | sort algorithm, but even if I'm correct such a change would
         | logically be made to the optimised Rust, not the simple ]
        
         | mayoff wrote:
         | Note that only an unoptimized Swift solution was provided, and
         | Swift String operations (like `lowercased`) are fully Unicode-
         | compliant.
        
         | machinekob wrote:
         | Swift is in the same or better performance bracket then C#/Java
         | and unsafe usage can be close to performance of C (maybe 1.3-2x
         | slower). I'm not sure about ranting for poor single test
         | results :P
        
           | metaltyphoon wrote:
           | The optimized version of C# is not really optimized as there
           | are things left in the table that could be much close to C++.
        
           | gernb wrote:
           | Latner himself (creator of Swift) said Swift had speed issues
           | because of ARC that they were hoping to solve with better
           | code analysis. Did they?
        
             | [deleted]
        
             | georgelyon wrote:
             | They have been improving this steadily (I would argue it is
             | rarely an issue nowadays, and there have always been simple
             | workarounds), but there's no reason a solution to the
             | problem mentioned in the article should need to use ARC,
             | and should use structs/protocols instead (which don't use
             | reference counting). I've seen a lot of Benchmarks that
             | rate Swift poorly because of this.
        
           | guenthert wrote:
           | No need for hand-waving, there are publicly available
           | benchmark results: https://benchmarksgame-
           | team.pages.debian.net/benchmarksgame/...
           | 
           | It's not the only benchmark in town and it needs to be taken
           | with a grain of salt, but so does every benchmark.
        
         | bhauer wrote:
         | I only skimmed the article, but I suspect they are including
         | the time necessary to spin up the executable and any runtime
         | environment prior to executing the relevant code.
         | 
         | "Lightning fast" can mean a lot of things, and it might _not_
         | mean  "executables start quickly."
        
       | ijidak wrote:
       | > We usually think of I/O as expensive, but I/O isn't the
       | bottleneck here...The tokenization and hash table operations are
       | the bottleneck by a long shot
       | 
       | Interesting. This is something I'll have to keep in mind.
       | 
       | Best practice is always been to consider I/O the slowest.
       | 
       | But, it's true... times have changed.
       | 
       | Maybe we shouldn't make this assumption anymore.
        
         | FridgeSeal wrote:
         | Here's a couple of articles on the performance of IO and
         | slowness of other stuff that I find particularly interesting:
         | 
         | https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...
         | 
         | https://gregoryszorc.com/blog/2021/04/06/surprisingly-slow/
        
       | svat wrote:
       | See also a different comparison (on ~the same problem) at
       | StackExchange:
       | https://codegolf.stackexchange.com/questions/188133/bentleys...
       | 
       | (Copying from my comment the last time this was posted:
       | https://news.ycombinator.com/item?id=26467684)
       | 
       | There's also a nice book "Exercises in Programming Style" about
       | just this problem.
        
         | igouy wrote:
         | See also
         | 
         | http://www.bagley.org/~doug/shootout/bench/wordfreq/
        
       | compumike wrote:
       | How about extending this to comparing programming language
       | efficiency from a developer's standpoint, looking at bytes of
       | code, versus the execution time?
       | 
       | I took the source code file size from the repository, and the
       | runtime from the blog post.
       | 
       | Then I made an arbitrary overall "PAIN SCORE" (lower is better)
       | by multiplying code size * runtime. I suggest this is a
       | worthwhile metric simply because lower is better on both axes,
       | but of course, in the "real world" there will be different
       | economic costs to CPU time and developer time depending on the
       | use case. Here's the sorted results, from least "pain" to most:
       | LANGUAGE   FILENAME        CODE SIZE      RUNTIME     PAIN SCORE
       | (LOWER IS BETTER)         Shell      optimized.sh      75 bytes
       | 1.83 s      137.25         Crystal    simple.cr        240 bytes
       | 1.29 s      309.6         Nim        simple.nim       424 bytes
       | 0.77 s      326.48         Python     simple.py        208 bytes
       | 2.21 s      459.68         Ruby       simple.rb        175 bytes
       | 3.17 s      554.75         Go         optimized.go    1514 bytes
       | 0.40 s      605.6         Python     optimized.py     464 bytes
       | 1.33 s      617.12         Zig        optimized.zig   2688 bytes
       | 0.24 s      645.12         Go         simple.go        688 bytes
       | 1.12 s      770.56         Zig        simple.zig      1394 bytes
       | 0.55 s      766.7         Nim        optimized.nim   1683 bytes
       | 0.49 s      824.67         Shell      simple.sh         60 bytes
       | 14.81 s      888.6         Ruby       optimized.rb     401 bytes
       | 2.47 s      990.47         JavaScript simple.js        532 bytes
       | 1.88 s     1000.16         C          optimized.c     4360 bytes
       | 0.23 s     1002.80         Rust       optimized.rs    3065 bytes
       | 0.43 s     1317.95         Swift      simple.swift     317 bytes
       | 4.23 s     1340.91         JavaScript optimized.js    1501 bytes
       | 1.10 s     1651.1         C          simple.c        2735 bytes
       | 0.96 s     2625.6         Rust       simple.rs       2239 bytes
       | 1.38 s     3089.82
       | 
       | Sorting only by code size, the most concise implementations are:
       | Shell, Ruby, Python, Crystal. Nobody was aiming to play code golf
       | (i.e. minimize source code size), so these are fairly
       | straightforward, idiomatic, readable implementations.
       | 
       | I am definitely a Crystal fan, in fact this afternoon I'm
       | continuing to implement suggestions from my recent Show HN
       | https://news.ycombinator.com/item?id=32081943 comments. :)
        
         | adsharma wrote:
         | It'd be interesting to write code in a concise, easy to
         | learn/grok language and then transpile to a faster systems
         | language and recompute the PAIN score. I spent a few months
         | last year pursuing that approach (py2many on github).
         | 
         | We'd need a stdlib that works across languages for the approach
         | to be successful. nimpylib and similar libraries in other
         | languages are a step in that direction.
         | 
         | Alternatively, the python stdlib itself could be rewritten in
         | python and transpiled to other languages. Perhaps it'd help
         | alternative python implementations in terms of compatibility.
        
         | jandrese wrote:
         | I would expect the AWK version to have very low pain despite
         | its reputation. This is the sort of problem that AWK is well
         | optimized for.
        
         | bestinterest wrote:
         | Also a Crystal fan but from afar. The compile time speeds are a
         | real killer of productivity.
        
         | igouy wrote:
         | _fwiw_ https://benchmarksgame-
         | team.pages.debian.net/benchmarksgame/...
         | 
         | https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
        
         | optionalsquid wrote:
         | It would probably be prudent to strip comments before measuring
         | the code sizes:
         | 
         | I did not check all files, but noticed that the simple rust
         | code consists of about 60% comments (in terms of bytes), mostly
         | due to a verbose blurb. It also spends about 15% of the non-
         | comment bytes on printing nicer error messages, which is an
         | interesting choice for a micro-benchmark. The simple C and
         | FORTH versions similarly have a lot of comments.
         | 
         | Meanwhile a lot of the other files have very few or no
         | comments.
        
       | djhworld wrote:
       | EDIT: I think the use of the term "allocation" might be
       | overloaded here from the original article, after thinking about
       | this a bit more I think the author is referring to a different
       | way of accessing the map. If you look in godbolt
       | https://godbolt.org/z/Yf64rodW6 you can see that the pointer
       | version uses                 CALL
       | runtime.mapaccess2_fast64(SB)
       | 
       | whereas the 'raw' version uses                 CALL
       | runtime.mapassign_fast64(SB)
       | 
       | When reading the Go solution this bit stood out to me
       | 
       | > To reduce the allocations, we'll use a map[string]*int instead
       | of map[string]int so we only have to allocate once per unique
       | word, instead of for every increment
       | 
       | I just tried benchmarking this with a simple setup and I get zero
       | allocations for both approaches, although the "pointer" approach
       | is _slightly_ faster                 package main
       | import (        "testing"       )              func
       | BenchmarkIncrementMapRawInt(b *testing.B) {        var data =
       | make(map[int]int)               b.ResetTimer()        for i := 0;
       | i < b.N; i++ {         key := i % 1000         data[key]++
       | }       }              func BenchmarkIncrementMapPointerInt(b
       | *testing.B) {        var data = make(map[int]*int)
       | b.ResetTimer()        for i := 0; i < b.N; i++ {         key := i
       | % 1000         increment(data, key)        }       }
       | func increment(counts map[int]*int, value int) {        if p, ok
       | := counts[value]; ok {         *p++         return        }
       | n := 1        counts[value] = &n       }            $ go test
       | -bench=. -benchmem .       goos: darwin       goarch: arm64
       | BenchmarkIncrementMapRawInt-8        112288002         10.57
       | ns/op        0 B/op        0 allocs/op
       | BenchmarkIncrementMapPointerInt-8    139728302          8.586
       | ns/op        0 B/op        0 allocs/op
        
         | dan-robertson wrote:
         | Presumably this would allocate if bits weren't "primitive", eg
         | if you had 128 bit ints implemented as a tuple of two ints.
        
       | HarHarVeryFunny wrote:
       | This is a rather meaningless comparison, since the differences
       | are going to be dominated by:
       | 
       | 1) The choice of libraries/datatypes used for strings and
       | word->count map
       | 
       | 2) How the source file is split into words - probably library
       | function again, although in C/C++ one _could_ choose to implement
       | a super-optimized low level version that would blow the others
       | away
       | 
       | IMO a performance comparison between languages is only meaningful
       | if the runtime is not dominated by library functions, or if one
       | admits it's really a standard library performance comparison, not
       | a language comparison.
        
         | kitd wrote:
         | I understand what you're saying. However if you look at it not
         | from the point of the performance figures _per se_ , but rather
         | which language, when written idiomatically, can solve a
         | realistic problem performantly, then I think it makes more
         | sense.
        
         | ashvardanian wrote:
         | Agreed. It's a common desire to boil benchmarks down to a
         | single number.
         | 
         | Some of those languages, however, are far more complex then the
         | others. And as any complex tool, it requires certain
         | methodology. If you are to use heavy machinery like C++, forget
         | iostreams and character-level processing on your hot paths.
        
         | oconnor663 wrote:
         | Do you have a particular example problem in mind that might
         | demonstrate language performance better? One of the Benchmark
         | Game programs maybe?
        
       | habibur wrote:
       | You need to measure memory usage with it, take the product of the
       | two (memory and speed), and thus get a very good number on
       | performance for comparison.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-07-24 23:00 UTC)