[HN Gopher] Performance comparison: counting words in Python, C/... ___________________________________________________________________ Performance comparison: counting words in Python, C/C++, Awk, Rust, and more Author : pcr910303 Score : 202 points Date : 2022-07-24 15:31 UTC (7 hours ago) (HTM) web link (benhoyt.com) (TXT) w3m dump (benhoyt.com) | gxt wrote: | > a simple interview problem | | I've been asking different versions of that question since I | started interviewing and it is a surprisingly good filter. | | But with articles like this, now I need new questions.. | staticassertion wrote: | It's a bit of a shame that Rust's stdlib doesn't have a "fast but | less safe" hasher. It's one of the easiest performance | optimizations in my experience because it's very rare that an | attacker controls input to your hashmap in a way that can | actually lead to a DOS. | | Not that it doesn't happen, it's just not something I've run | into. | | Just today I was testing out the performance difference hashing | some Scylla queries and got nearly 2x faster hashing moving to | fxhash. | anonymoushn wrote: | Recently there was a contest to optimize the performance of a | similar program[0] and a Zoom discussion of the optimizations[1]. | | The programs are not comparable in the the following ways: | | - Case: TFA requires (at least) ASCII lowercasing but the contest | problem required no lowercasing. | | - Ordering: TFA does not require sorting, but the contest problem | required sorting. | | - Memory: TFA imposes a requirement phrased as "don't read the | whole file into memory" and this sounds like it's a resource- | saving constraint, but it's usually a constraint that requires | the program to spend additional resources. You could just mmap | the file and store pointers into the mapped region. It costs | extra to do copies instead of no copies. | | - Text: TFA is unclear on what assumptions may be made about the | lengths of words. For the contest problem, the Hungarian | wikipedia input's longest word is around 80k. | | - Safe, Hashing, Stdlib: TFA imposes some restrictions on what | constructs may be used that are not imposed in the contest | problem. | | For the contest version of this problem, it seems like you can | tokenize, hash, and count strings at around 1GB/s. Adapting a | solution to solve TFA's problem (but not to conform to its | Safe/Hashing/Stdlib requirements) would probably not carry too | large of a penalty, since it's like 3 instructions to ASCII- | lowercase 32 bytes and 1 string copy per unique string should | take negligible time compared to the hash table lookups. So there | is some room for the optimized solutions to go a little faster, | if more optimizations are permitted. | | [0]: https://easyperf.net/blog/2022/05/28/Performance-analysis- | an... | | [1]: https://www.youtube.com/watch?v=R_yX0XjdSBY | fractalb wrote: | grep is indeed fast! | unhammer wrote: | Hm, I tried comparing the simple.hs vs simple.py (with ghc 8.10.7 | which I happened to have here, their stack.yaml uses 8.10.4) and | it's 5.7s vs 1.9s, not quite as stark a difference. Maybe because | I'm on Linux? | mustache_kimono wrote: | Re: the Rust optimized implementation, I was able to get ~20-25% | better performance by rewriting the for loops as iterators, using | Rust's byte range pattern matching, and a buffered writer, which | seems crazy, but it's true. I chalked it up to some crazy | ILP/SIMD tricks the compiler is doing. | | I even submitted a PR[0], but Ben decided he was tired of | maintaining and decided to archive the project (which fair | enough!). | | [0]: https://github.com/benhoyt/countwords/pull/115 | grumpyprole wrote: | Iterators are potentially a much better design too as they | allow for more modular code. | codeflo wrote: | Very nice. That's in fact really clean, and (at least to me) | indeed surprising. | | As rejection reasons go, "I wanted to name-drop person X" is | interesting. But as you said, that's the maintainer's decision | to make. | [deleted] | mustache_kimono wrote: | Can't win them all. I think I came along when they were just | really sick of maintaining it. Ben and Andrew have all my | respect. | codeflo wrote: | For those curious about Rust optimization, the "optimized" | version makes three changes compared to the "idiomatic" version: | | * It uses byte strings instead of UTF-8 strings. In my opinion, | that's not an optimization, that's changing the problem. | Depending on the question you're asking, only one of the two can | be correct. | | * It uses a faster hash algorithm. It's not the first time this | came up in a benchmark article. Rust's decision to use a DOS-safe | hash by default (and not provide a fast algorithm in the std, | like other languages do) really seems to hurt it in that kind of | microbenchmark. | | * It uses get_mut+insert instead of the more convenient | HashMap::entry method, because the latter would require | redundantly allocating the key even in the repeat case. I've hit | this problem in the past as well. Maybe the upcoming | HashMap::raw_entry_mut will make this kind of optimization | cleaner. | saghm wrote: | > It uses byte strings instead of UTF-8 strings. In my opinion, | that's not an optimization, that's changing the problem. | Depending on the question you're asking, only one of the two | can be correct. | | Sure, but is it changing the problem to something easier than | what the other languages are already doing, or to something | more similar? I'd imagine the C code is basically just using | byte arrays as well, for instance. | tialaramex wrote: | The problem specified _declares_ the words we 're counting are | ASCII: | | > ASCII: it's okay to only support ASCII for the whitespace | handling and lowercase operation | | UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8 | solution is correct for ASCII, but a bytes-as-ASCII solution | works fine in Rust _if_ you only need ASCII. | | This is why Rust provides ASCII variants of a lot of functions | on strings, and the same functions are available on byte slices | [u8] where ASCII could be what you have (whereas their Unicode | cousins are not available on byte slices). | codeflo wrote: | >> ASCII: it's okay to only support ASCII for the whitespace | handling and lowercase operation | | That's a somewhat specific list -- at least I didn't read | that as a general "the program can assume that the input is | only ASCII". | | But then, the author seems to have accepted solutions that | crash on non-UTF8 sequences and ones that byte-compare them, | so probably either behavior was meant to be fine. I just | don't get that from this rule. | anonymoushn wrote: | > That's a somewhat specific list -- at least I didn't read | that as a general "the program can assume that the input is | only ASCII". | | I don't think it assumes the input is only ASCII. If the | problem is "given UTF-8 text, split on ASCII whitespace and | convert ASCII uppercase letters to lowercase," you can do | that correctly (and produce correct UTF-8 output) without | really being UTF-8 aware. For why, see here: | https://en.wikipedia.org/wiki/UTF-8#Encoding | | > But then, the author seems to have accepted solutions | that crash on non-UTF8 sequences and ones that byte-compare | them, so probably either behavior was meant to be fine. I | just don't get that from this rule. | | That's a separate concern right? The rules are only about | the behavior when the program is given UTF-8 input. | olalonde wrote: | > * It uses byte strings instead of UTF-8 strings. In my | opinion, that's not an optimization, that's changing the | problem. Depending on the question you're asking, only one of | the two can be correct. | | To be fair, that's what the C version does as well. | stewbrew wrote: | Ad uft8: it would change the problem only if the definition of | word breaks changed. E.g., if a word break is defined by | whitespace, then it probably doesn't. | Conlectus wrote: | What happens when the ASCII character for space is embedded | within a multibyte UTF-8 codepoint? | Kiwikwi wrote: | Such "overlong" encodings are not valid UTF-8. | anyfoo wrote: | If UTF-8 allowed that, then a solution would split words | wrong according to UTF-8 rules, but it still would not | matter, since the problem statement says that "it's okay to | only support ASCII for the whitespace handling and | lowercase operation". | | So if your solution gets word splitting (or lowercasing) | wrong _for non-ASCII input_ , it still gets a pass | according to my reading. | kd5bjo wrote: | UTF-8 is designed so that this doesn't happen: All bytes in | a multibyte codepoint have the high bit set, placing them | outside of the ASCII range. | bitwize wrote: | This never happens. In UTF-8, bit 7 is set in all valid | multibyte representations; ASCII characters _cannot_ appear | in them. Now, you can have an ASCII character in the middle | of a multibyte sequence, but this isn 't valid UTF-8 (and | should be treated as an error by the decoder). "Words" | generated by a splitter would also have invalid UTF-8 if | they were split on a space coming in between bytes in a | multibyte sequence, but these can detected. Word splitting | will not generate invalid words from valid input, or vice | versa; and if you know you have valid input, splitting on | space, \t, \n, \r, etc. is perfectly cromulent in UTF-8. | | Now if you want to also split on U+200B (ZERO WIDTH SPACE), | U+202F (NARROW NO-BREAK SPACE), etc... hoo boy. | coldtea wrote: | > _It uses byte strings instead of UTF-8 strings. In my | opinion, that's not an optimization, that's changing the | problem. Depending on the question you're asking, only one of | the two can be correct._ | | For tons of questions, both can be correct. | SAI_Peregrinus wrote: | The question as posed only cares about ASCII. | benreesman wrote: | This was always my approach when I had do "leetcode whiteboard | interviews". | | Start with something not too far north of FizzBuzz but with a ton | of scope to progressively enrich the problem until the clock runs | out. | | At $BIG_CO du jour people tend to talk about how much "signal" an | interview produces, and starting somewhere that a meaningful | number of candidates can't do anything with offers very little | "signal", likewise very little Shannon information. | | Great article! | jeffbee wrote: | Funny how you can take the "optimized" C++ and make it | significantly faster (vs. gnu/linux standard libraries) with | little effort. A lookup table for to_lower, instead of bit math, | helps slightly. Building with the LLVM-libc string functions | helps a lot, because the GNU std::string_view::operator== is | calling an AVX-optmized memcmp via the PLT, which is a terrible | strategy for short strings. LLVM-libc has an inlined bcmp (note: | bcmp can be significantly faster than memcmp, but on GNU they are | aliases) that is resolved for the target platform at build time | instead of run time. | | Edit: | | Even if you ignore LLVM-libc, just slapping this into the | optimized.cpp and replacing the critical `==` with `our_bcmp` | makes it 10% faster. IFUNC calls to micro-optimized SIMD | functions are counterproductive. It is far, far better that the | compiler can see all the code at build time. int | our_bcmp (const char* a, const char* b, size_t sz) { for | (size_t i = 0; i < sz; i++) { if (a[i] != b[i]) return | 1; } return 0; } | svnpenn wrote: | You included Awk in the title, but omitted Go, really? | hu3 wrote: | Indeed. The original title is: | | > Performance comparison: counting words in Python, Go, C++, C, | AWK, Forth, and Rust. | | Which means the submitter actively chose Go to be removed from | the title. Judging from their post history they are aligned | with Rust community. | yakubin wrote: | The original title is 2 characters over the HN title length | limit. I suspect they dropped "Go" first, because it's the | language which has a 2-letter name, and they they | editorialised it further to hint that there are more | languages. | hu3 wrote: | Leaving Awk over the vastly more popular Go, hints a more | probable and frequent behaviour: bias. | | Removing any of them would suffice. | e4m2 wrote: | The author maintains an AWK implementation: | | https://github.com/benhoyt/goawk | hu3 wrote: | Ben Hoyt, isn't the same person as the HN submitter. | [deleted] | ppeetteerr wrote: | Curious if anyone knows why the swift version is slow? Is it the | casting from subsequence to a string? | BiteCode_dev wrote: | In this type of blog post with comparisons including many | languages, you usually see some unatural approach for your | favorite language. | | However, in this case, the Python example is idiomatic. | | Since the articles calls for better approaches, I would suggest | to take the opportunity to use the walrus operator and rsplit() | for the optimized version. Something like this: | reminding = "" c=Counter( ) while (chunk := | sys.stdin.read(64 * 1024)): pre, post = | chunk.lower().rsplit('\n', 1) c.update((reminding + | pre ).split()) reminding = post | | It should not affect performance too much, and the code gets more | expressive. | | However, the performances will be quite different depending of | the python version you use. Interestingly, Python 3.11 beta, | which comes with a lot performance tweaks, is slower for this | exercice, while being reported to be faster on real life tasks. | eesmith wrote: | Your code assumes there is at least one newline in each 64K | chunk, and assumes there are no words past the final newline. | | It will fail on "abc" and give the wrong answer for "abc\ndef". | | I prefer rpartition over rsplit to handle first case, and the | loop-and-a-half construct instead of the while+walrus operator | to handle the second, as in this modified version of your code: | remaining = "" c=Counter( ) while True: | chunk = sys.stdin.read(64 * 1024) if not chunk: | if not remaining: break pre | = post = "" else: pre, mid, post = | chunk.lower().rpartition("\n") c.update((remaining | + pre).split()) remaining = post | BiteCode_dev wrote: | Indeed, rpartition is better suited for this task. | raffraffraff wrote: | I love using the UNIX shell (pipes and programs that do "one | thing well") to the point where it hurts my ability to improve at | other other languages. But sometimes all you need is a throwaway | command to do something once, not a beautifully optimized | function that will be written, improved, compiled, profiled, | reoptimized and then executed a billion times. The utter tersity | of a bash one-liner just tickles me. | FpUser wrote: | I would not call it performance comparison at all. When Python | call functions written in C it is not Python's performance. Write | those functions using plain Python and then see the results. Sure | for this basic example in the article it does not matter from a | practical standpoint. But when you need to step away from canned | cases suddenly Python's performance sucks big time. | __marvin_the__ wrote: | Python's `collections.Counter` is written in Python and is a | subclass of the builtin `dict` type. I don't think it's | comparable to something like using `pandas` to solve the | problem. | | https://github.com/python/cpython/blob/main/Lib/collections/... | kortex wrote: | Can we move past the whole "it is not really python to use | libraries written in C", especially when talking pure stdlib | python? | | Python is basically a DSL for C extensions. That is the whole | point. It would be like criticizing any compiled language for | essentially being a DSL for machine code, and not "Real | Instructions". | | Python's ability to interop with pre-built, optimized libraries | with a lightweight interface is arguably its greatest selling | point. Everyone and their dog knows purely interpreted CPython | is slow. It doesn't need to be pointed out every single time | python performance is brought up unless literally discussing | optimizing the CPython vm. | igouy wrote: | > It doesn't need to be pointed out ... | | Apparently there will be someone who feels the need to point | it out; and someone who feels the need to point out that it | doesn't need to be pointed out; and ... | [deleted] | FpUser wrote: | I am not criticizing Python. It does well enough what it was | made to do. It just make no sense to call that particular | example a "language performance comparison". It is anything | but. | kortex wrote: | But you are still wrong. As mentioned, Dicts are incredibly | efficient data structures in Python (because they underpin | _everything_ ) and the Counter class is pure python. That's | 100% pure python. Saying dicts "don't count" because they | are implemented in C would disqualify the entire language | of CPython, as virtually everything under the hood is a | PyObject struct pointer. It just so happens that "counting | abstract objects" is the class of problem* CPython does | basically all the time in normal VM execution anyways, so | this task is easy and fast. | | * looking up a reference of an attribute with a string key | underpins the meat and potatoes of python's execution model | ribit wrote: | These days I tend to use R for these type problems. It's slow, | but vectors as first-class data types simplify things a lot | | readLines(file) |> strsplit(" +") |> unlist() |> table() |> | sort(desc = TRUE) | jerryjerryjerry wrote: | Our distributed NewSQL database project actually uses three of | them, Go, Rust, and C++, for different layers. There are pros and | cons for each of them, but it still challenge to maintain three | languages in a one system development. Performance wise, C++ and | Rust are good, coding wise Go is easiest but Rust is kind of | having best practice. | Karellen wrote: | Doesn't the Unix Shell implementation break the "threading" | constraint, in that each program in the pipe will have a main | thread each, and all the programs will run concurrently? | | Or should the "threading" constraint really have been stated as | "you can use multiple threads, but only if they all have their | own address space"? | | Alternatively, given the multi-core capabilities of even | budget/mobile/small systems these days (even the Rasberry Pi 4 | has a quad-core CPU), isn't restricting the implementation to a | single thread a weird artificial limitation in 2022? Also, a GPU- | based implementation would be interesting, and GPUs most of their | power from their massively parallel core architecture. | slaymaker1907 wrote: | If I were doing the benchmark, I would just run it on a VM with | a single physical core. There are good reasons for comparing | single threaded performance since it may not be necessary or | desirable to scale that way. For example, suppose it's a wc API | or something; limiting each request to a single thread can be | better since you get fairer scheduling and less data sharing | which is almost always better for multiprocessing. Even shared | immutable data can cause problems since the CPU may not realize | the data is actually immutable. | | I agree GPUs would probably do well on this problem, though it | depends on how expensive the memory copying to/from the GPU | ends up being. However, if you are going for all our | performance, it seems like you could use SIMD like memchr or | something. | Karellen wrote: | > If I were doing the benchmark, I would just run it on a VM | with a single physical core. | | I think that's definitely an environment worth benchmarking | on - but I don't think that it should be the _only_ | environment to benchmark on. | | Also, I don't think it's a good reason to limit | implementations to a single thread, even if that is your | benchmark environment. It can be worth seeing how well an | implementation that's capable of taking advantage of multiple | cores/CPUs, does when it's only given one core to work with. | | It's probably worth optimising for and benchmarking different | threading strategies too - does your implementation create | one thread per "work unit" and let the scheduler work them | all out, or does it create one a one thread per core (or | maybe, one thread per core, plus one), thread pool, and | assign work units to each each thread until they're done? And | how do those strategies work if they're only given a single | core? | | The single core case is definitely worth testing, but it | seems odd to limit implementations to a single thread because | of it. If you think you can go faster with a threaded | implementation, you should be able to try that out. | fallingfrog wrote: | Your Perl implementation is almost as fast as c++? Something | weird is going on here. Maybe you are bottlenecked on disk read | speeds or something. | theonemind wrote: | Not really surprising. Perl approaches a domain specific | language for text processing with a heavily-optimized 34 year | old implementation. | bjoli wrote: | Why is that surprising? The perl interpreter is fast and | dispatches just about everything to heavily optimized C. Just | like python does, it is only better at it. | | This is the kind of thing where perl really shines. | josefx wrote: | No need to decode the C++ names when you use | valgrind/callgrind/kcachegrind for C, that works just as well for | C++. | gibsonf1 wrote: | What code was used for Common Lisp? I'm guessing a well written | version would be in the top contenders for speed. | Jtsummers wrote: | https://github.com/benhoyt/countwords/blob/master/simple.lis... | | From Hoyt's repository itself. | gibsonf1 wrote: | Which lisp compiler did you use? | Jtsummers wrote: | I haven't run it, just tracked down the source because I | was similarly curious about the slow runtime. If I do run | it it will be with SBCL. Looking at the repo, Hoyt did as | well. | gibsonf1 wrote: | That code is slow because it's working with a really long | list instead of a very fast hash table | gibsonf1 wrote: | Right, this came up a year ago with a version that ranks | among the top here: | https://news.ycombinator.com/item?id=26465857 | (defmethod performance-count ((path-file string)) | (let ((map (make-hash-table :test 'equal))) (with- | open-file (stream path-file :direction :input :if-does-not- | exist nil) (when stream (loop for | line = (read-line stream nil 'end) until (eq | line 'end) do (let ((split (split- | string #\space (string-downcase line)))) | (dolist (word split) (let ((index (gethash word | map))) (if index (setf (gethash | word map) (incf index)) (setf (gethash word | map) 1)))))) (let ((keys (sort | (alexandria:hash-table-keys map) (lambda(x y)(> | (gethash x map)(gethash y map)))))) (dolist (key | keys) (format t "~A ~A~%" key (gethash key | map)))))))) | Jtsummers wrote: | You don't need those extra newlines if you'd prefix every | code line with two spaces, by the way. It makes it much | more readable if you do that. | gibsonf1 wrote: | Nice! I didn't know that trick. | Twirrim wrote: | As they call out in the article, accurately counting words is a | bit more complicated that splitting by space. You need to account | for all sorts of punctuation etc. | | When I've been asked this in programming interviews, I've almost | always been expected to produce something like the code in the | article (and do), but usually I'd point to something like the | NLTK library as a better approach. It's polished and highly | capable, and handles probably just about every edge case there | is. | chrismorgan wrote: | > _Incidentally, this problem set the scene for a wizard duel | between two computer scientists several decades ago. In 1986, Jon | Bentley asked Donald Knuth to show off "literate programming" | with a solution to this problem, and he came up with an | exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the | inventor of Unix pipelines) replied with a one-liner Unix shell | version using tr, sort, and uniq._ | | Since this writing and other linked resources present only one | side of the affair, I will mention: the presentation in _Pearls_ | was quite unfair to Knuth, and the conclusions commonly made of | it somewhere between moderately and entirely unsound. (That is, | as conclusions _from the paper_. I speak of unsoundness of logic | only: these conclusions may be supportable from other sources.) | | Knuth was challenged to demonstrate literate programming, and so | _that was what he demonstrated_. He showed building something | from the ground up assuming a very minimal environment, with | problem analysis and all: of course that ended up more verbose | than a Unix pipeline that glosses over problem analysis and | starts with most of the tools you need to make a probably- | acceptable solution! | | McIlroy himself admitted a few years later that he had been "a | little unfair" to criticise on engineering grounds what had only | been an illustration of technique. Even in the paper, Bentley did | admit a degree of culpability for this mismatch in his "criticism | of programs" paragraph. ("He admires the execution of the | solution, but faults the problem on engineering grounds. (That | is, of course, my responsibility as problem assigner; Knuth | solved the problem he was given on grounds that are important to | most engineers-the paychecks provided by their problem | assigners.)") | | But I do wish that Knuth had been given the opportunity to write | a review of McIlroy's review, for simultaneous publication. | mistrial9 wrote: | 1986 was very early days for modern operating systems. I | suspect that the hardware environment(low resource) would be | surprising to many phone owners here. So the efficacy had to do | with the tools low-level bindings, too | drfuchs wrote: | Indeed. In 1986, Knuth was exclusively using the SAIL DEC10, | running the WAITS OS. This environment limited him to | approximately half a megabyte of data space (heap plus stack | plus globals), and half a megabyte of code space. All | statically linked, so all library and runtime stuff had to | fit. | | Also of note is that most CPUs of the day didn't execute | instructions orders of magnitude faster than bandwidth to | main memory, as is the case today; it was early days for | worrying about making algorithms "cache aware". So, for | instance, there was no big penalty for "pointer chasing" in | linked lists as there is today. Additionally, the severe | memory size limits affected various time vs. space tradeoffs | in algorithm design. | mistrial9 wrote: | and, is it a compiled language at all? versus a script | language of some kind. | Jtsummers wrote: | Is what a compiled language? For Knuth's program in the | Programming Pearls article? He wrote it in Pascal, so not | a scripting language. | [deleted] | [deleted] | mustache_kimono wrote: | ... (2021). Great read and fun project. See the discussion from | when it was originally posted by the author.[0] | | [0]: https://news.ycombinator.com/item?id=26463967 | claytongulick wrote: | My takeaway is that unoptimized c++ is pretty close to the same | performance as JavaScript. | | Wild. | igouy wrote: | _fwiw_ https://benchmarksgame- | team.pages.debian.net/benchmarksgame/... | [deleted] | bsaul wrote: | surprised to see swift 4 times as slow as non-optimized go. | | Anyone has an explanation ? | | I know the swift string type is really complex, but i always | assumed it was at least performing well.. | igouy wrote: | _fwiw_ https://benchmarksgame- | team.pages.debian.net/benchmarksgame/... | msbarnett wrote: | Reading anything at all into the relative performance of these | programs written by different people which are all doing | radically different things seems like a gigantic waste of time. | | Some of these "simple" versions are custom-specifying | allocators to minimize memory overhead and doing highly | optimizable ASCII value manipulations and doing byte-equality | on strings, rather than go through slower Unicode libraries. | The "simple" Swift version is doing nothing at all to avoid | generating many allocations/deallocations and running fully | Unicode-aware algorithms to lower the input and perform | Unicode-normalized string comparisons, it's totally apples-to- | oranges stuff. | bsaul wrote: | i was wondering if this was due to unicode indeed. I had | hoped that the benchmarks required languages to perform the | same kinds of string check, otherwise there's really no point | indeed. | lrem wrote: | This site makes Chrome (Android, current) stop responding. | mellosouls wrote: | Yeah, mine too. Ironic considering the subject... | yakubin wrote: | I like the OCaml version. It's both very easy to read and comes | out really well in the benchmark. | | I also looked at the "simple" Zig version, which came out really | well in the benchmark, and to me it didn't look simple at all. It | seems you need to make a lot of low level details explicit. | | But IMHO AWK takes the crown here. :) | OskarS wrote: | > But IMHO AWK takes the crown here. :) | | Agreed! AWK is still the king of this stuff. For tasks like | this, I kinda think AWK is nigh-unbeatable: so simple to write, | so obvious what's going on (even if you've never seen any AWK | program before, you're probably going to be able to figure out | what's going on there), and decently performant. | | AWK is the bee's knees. | moffkalast wrote: | On the other hand, I'm rather surprised that the Go version | is almost as long and opaque as C. | benhoyt wrote: | Are you looking at the "simple" or the "optimized" | versions? For the optimized, yes, the Go one is very | similar to the C. For the simple, idiomatic version, the Go | version [1] is much simpler than the C one [2]: 40 very | straight-forward LoC vs 93 rather more complex ones | including pointer arithmetic, tricky manual memory | management, and so on. | | [1] https://github.com/benhoyt/countwords/blob/c66dd01d868a | a83dc... [2] https://github.com/benhoyt/countwords/blob/c66 | dd01d868aa83dc... | tannhaeuser wrote: | TFA only talks about gawk, though, making this comparison | meaningless ie. you can't benchmark _languages_ but only | _language implementations_. Same goes for other programming | languages ofc unless those have only a single implementation. | | mawk is an order of magnitude faster than gawk, and gawk isn't | even the default on many Linuxen. | benhoyt wrote: | > TFA only talks about gawk, though, making this comparison | meaningless | | Not exactly! To quote from TFA (I'm the author): | | > Another "optimization" is to run it using mawk, a faster | AWK interpreter than gawk. In this case it's about 1.7 times | as fast as gawk -b. I'm using mawk in the benchmarks for the | optimized version. | [deleted] | osigurdson wrote: | C# really needs a GetOrAdd method on the regular Dictionary class | (as exists on ConcurrentDictionary). This common pattern (from | the optimized version) requires the hash function on "word" to be | evaluated twice (once for TryGetValue and once for Add). Then, if | there is a collision on the 32 bit hash then the equality check | needs to be run twice as well. Not good as all of these are O(N) | operations (where N is the size of "word"). I suspect the | optimized version would be materially faster if they implemented | their own dictionary that includes this. | if (!counts.TryGetValue(word, out var wordCountRef)) | { counts.Add(word, new Ref<int>(1)); | } | nr2x wrote: | Some of these efficiency gains are remarkable. Most of the code I | write is just the quickest way to write a mundane function. | Really curious if you went into a huge tech co and devoted ~10% | of SWE time solely to optimization of code base what the energy | savings would be. | jeffbee wrote: | You can't walk into a huge tech co and make a dent because huge | tech co already has hundreds of engineers on that job full | time. Just look at the papers coming out of huge tech co's. | They will publish if their technique saves them 0.2%. That's | how tight things are already wired in the big leagues. | | The real money is in huge non-tech co, or medium tech co. You | could walk into a goldman sachs type of operation and point out | savings in CPU time that would be 90-99%. But they're not going | to care, because their corporate structures aren't wired to | observe and act on such things. | anonu wrote: | q solution in kdb, runs in about 2.5s on my machine vs simple.py | which takes 4.7s. Rough interpolation with the results table puts | this on par with Rust/Go/OCaml or other compiled languages. | \t desc count each group `$lower " " vs raze read0 | `:kjvbible_x10.txt | | It's also 1 line! | | I am sure q pros or k purists can optimize this even more... | | EDIT: Moving the lower earlier brings this down to 1.9s | desc count each group `$ " " vs lower raze read0 | `:kjvbible_x10.txt | gary17the wrote: | Wow. Swift, touted as "safe by design and (...) runs lightning- | fast"[1] is more of a screw-up than I thought. Almost twice as | slow as Lua and behind even Pascal and Forth. | | [1] https://developer.apple.com/swift/ | sk0g wrote: | It's a single implementation, I wouldn't put too much on it. I | can definitely write a C++ implementation that's slower and | less optimised than most of those results, particularly if I | was using Boost. | | iPhone and Mac apps run pretty well in my experience, and Swift | is definitely faster (in general) than Python at least. There | were some serious considerations to port over ML libraries to | Swift due to its ease of use, similar to Python, while | providing much better execution speed. | fractalb wrote: | > There were some serious considerations to port over ML | libraries to Swift due to its ease of use, similar to Python, | while providing much better execution speed. Google abandoned | that Swift work (if that's what you're referring to) | PossiblyKyle wrote: | Google and abandoning things, name a more iconic duo | tialaramex wrote: | I don't know that I'd characterise it as a "screw up" based on | one random person's solution to one particular problem. | | In particular the "simple" solutions, which is all that's | offered for Swift, will be whatever was most idiomatic/ obvious | to a programmer of potentially quite variable quality. | | It's unfortunate, I think, that there are "simple" solutions | for some languages which have had performance revisions. If you | needed an hour, or a friend, or even a community to "hint" how | to improve it that wasn't the simple solution. | | [ For example I feel like the Rust might be faster asking for | ASCII lowercase, and using the unstable (not order-preserving) | sort algorithm, but even if I'm correct such a change would | logically be made to the optimised Rust, not the simple ] | mayoff wrote: | Note that only an unoptimized Swift solution was provided, and | Swift String operations (like `lowercased`) are fully Unicode- | compliant. | machinekob wrote: | Swift is in the same or better performance bracket then C#/Java | and unsafe usage can be close to performance of C (maybe 1.3-2x | slower). I'm not sure about ranting for poor single test | results :P | metaltyphoon wrote: | The optimized version of C# is not really optimized as there | are things left in the table that could be much close to C++. | gernb wrote: | Latner himself (creator of Swift) said Swift had speed issues | because of ARC that they were hoping to solve with better | code analysis. Did they? | [deleted] | georgelyon wrote: | They have been improving this steadily (I would argue it is | rarely an issue nowadays, and there have always been simple | workarounds), but there's no reason a solution to the | problem mentioned in the article should need to use ARC, | and should use structs/protocols instead (which don't use | reference counting). I've seen a lot of Benchmarks that | rate Swift poorly because of this. | guenthert wrote: | No need for hand-waving, there are publicly available | benchmark results: https://benchmarksgame- | team.pages.debian.net/benchmarksgame/... | | It's not the only benchmark in town and it needs to be taken | with a grain of salt, but so does every benchmark. | bhauer wrote: | I only skimmed the article, but I suspect they are including | the time necessary to spin up the executable and any runtime | environment prior to executing the relevant code. | | "Lightning fast" can mean a lot of things, and it might _not_ | mean "executables start quickly." | ijidak wrote: | > We usually think of I/O as expensive, but I/O isn't the | bottleneck here...The tokenization and hash table operations are | the bottleneck by a long shot | | Interesting. This is something I'll have to keep in mind. | | Best practice is always been to consider I/O the slowest. | | But, it's true... times have changed. | | Maybe we shouldn't make this assumption anymore. | FridgeSeal wrote: | Here's a couple of articles on the performance of IO and | slowness of other stuff that I find particularly interesting: | | https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap... | | https://gregoryszorc.com/blog/2021/04/06/surprisingly-slow/ | svat wrote: | See also a different comparison (on ~the same problem) at | StackExchange: | https://codegolf.stackexchange.com/questions/188133/bentleys... | | (Copying from my comment the last time this was posted: | https://news.ycombinator.com/item?id=26467684) | | There's also a nice book "Exercises in Programming Style" about | just this problem. | igouy wrote: | See also | | http://www.bagley.org/~doug/shootout/bench/wordfreq/ | compumike wrote: | How about extending this to comparing programming language | efficiency from a developer's standpoint, looking at bytes of | code, versus the execution time? | | I took the source code file size from the repository, and the | runtime from the blog post. | | Then I made an arbitrary overall "PAIN SCORE" (lower is better) | by multiplying code size * runtime. I suggest this is a | worthwhile metric simply because lower is better on both axes, | but of course, in the "real world" there will be different | economic costs to CPU time and developer time depending on the | use case. Here's the sorted results, from least "pain" to most: | LANGUAGE FILENAME CODE SIZE RUNTIME PAIN SCORE | (LOWER IS BETTER) Shell optimized.sh 75 bytes | 1.83 s 137.25 Crystal simple.cr 240 bytes | 1.29 s 309.6 Nim simple.nim 424 bytes | 0.77 s 326.48 Python simple.py 208 bytes | 2.21 s 459.68 Ruby simple.rb 175 bytes | 3.17 s 554.75 Go optimized.go 1514 bytes | 0.40 s 605.6 Python optimized.py 464 bytes | 1.33 s 617.12 Zig optimized.zig 2688 bytes | 0.24 s 645.12 Go simple.go 688 bytes | 1.12 s 770.56 Zig simple.zig 1394 bytes | 0.55 s 766.7 Nim optimized.nim 1683 bytes | 0.49 s 824.67 Shell simple.sh 60 bytes | 14.81 s 888.6 Ruby optimized.rb 401 bytes | 2.47 s 990.47 JavaScript simple.js 532 bytes | 1.88 s 1000.16 C optimized.c 4360 bytes | 0.23 s 1002.80 Rust optimized.rs 3065 bytes | 0.43 s 1317.95 Swift simple.swift 317 bytes | 4.23 s 1340.91 JavaScript optimized.js 1501 bytes | 1.10 s 1651.1 C simple.c 2735 bytes | 0.96 s 2625.6 Rust simple.rs 2239 bytes | 1.38 s 3089.82 | | Sorting only by code size, the most concise implementations are: | Shell, Ruby, Python, Crystal. Nobody was aiming to play code golf | (i.e. minimize source code size), so these are fairly | straightforward, idiomatic, readable implementations. | | I am definitely a Crystal fan, in fact this afternoon I'm | continuing to implement suggestions from my recent Show HN | https://news.ycombinator.com/item?id=32081943 comments. :) | adsharma wrote: | It'd be interesting to write code in a concise, easy to | learn/grok language and then transpile to a faster systems | language and recompute the PAIN score. I spent a few months | last year pursuing that approach (py2many on github). | | We'd need a stdlib that works across languages for the approach | to be successful. nimpylib and similar libraries in other | languages are a step in that direction. | | Alternatively, the python stdlib itself could be rewritten in | python and transpiled to other languages. Perhaps it'd help | alternative python implementations in terms of compatibility. | jandrese wrote: | I would expect the AWK version to have very low pain despite | its reputation. This is the sort of problem that AWK is well | optimized for. | bestinterest wrote: | Also a Crystal fan but from afar. The compile time speeds are a | real killer of productivity. | igouy wrote: | _fwiw_ https://benchmarksgame- | team.pages.debian.net/benchmarksgame/... | | https://benchmarksgame-team.pages.debian.net/benchmarksgame/... | optionalsquid wrote: | It would probably be prudent to strip comments before measuring | the code sizes: | | I did not check all files, but noticed that the simple rust | code consists of about 60% comments (in terms of bytes), mostly | due to a verbose blurb. It also spends about 15% of the non- | comment bytes on printing nicer error messages, which is an | interesting choice for a micro-benchmark. The simple C and | FORTH versions similarly have a lot of comments. | | Meanwhile a lot of the other files have very few or no | comments. | djhworld wrote: | EDIT: I think the use of the term "allocation" might be | overloaded here from the original article, after thinking about | this a bit more I think the author is referring to a different | way of accessing the map. If you look in godbolt | https://godbolt.org/z/Yf64rodW6 you can see that the pointer | version uses CALL | runtime.mapaccess2_fast64(SB) | | whereas the 'raw' version uses CALL | runtime.mapassign_fast64(SB) | | When reading the Go solution this bit stood out to me | | > To reduce the allocations, we'll use a map[string]*int instead | of map[string]int so we only have to allocate once per unique | word, instead of for every increment | | I just tried benchmarking this with a simple setup and I get zero | allocations for both approaches, although the "pointer" approach | is _slightly_ faster package main | import ( "testing" ) func | BenchmarkIncrementMapRawInt(b *testing.B) { var data = | make(map[int]int) b.ResetTimer() for i := 0; | i < b.N; i++ { key := i % 1000 data[key]++ | } } func BenchmarkIncrementMapPointerInt(b | *testing.B) { var data = make(map[int]*int) | b.ResetTimer() for i := 0; i < b.N; i++ { key := i | % 1000 increment(data, key) } } | func increment(counts map[int]*int, value int) { if p, ok | := counts[value]; ok { *p++ return } | n := 1 counts[value] = &n } $ go test | -bench=. -benchmem . goos: darwin goarch: arm64 | BenchmarkIncrementMapRawInt-8 112288002 10.57 | ns/op 0 B/op 0 allocs/op | BenchmarkIncrementMapPointerInt-8 139728302 8.586 | ns/op 0 B/op 0 allocs/op | dan-robertson wrote: | Presumably this would allocate if bits weren't "primitive", eg | if you had 128 bit ints implemented as a tuple of two ints. | HarHarVeryFunny wrote: | This is a rather meaningless comparison, since the differences | are going to be dominated by: | | 1) The choice of libraries/datatypes used for strings and | word->count map | | 2) How the source file is split into words - probably library | function again, although in C/C++ one _could_ choose to implement | a super-optimized low level version that would blow the others | away | | IMO a performance comparison between languages is only meaningful | if the runtime is not dominated by library functions, or if one | admits it's really a standard library performance comparison, not | a language comparison. | kitd wrote: | I understand what you're saying. However if you look at it not | from the point of the performance figures _per se_ , but rather | which language, when written idiomatically, can solve a | realistic problem performantly, then I think it makes more | sense. | ashvardanian wrote: | Agreed. It's a common desire to boil benchmarks down to a | single number. | | Some of those languages, however, are far more complex then the | others. And as any complex tool, it requires certain | methodology. If you are to use heavy machinery like C++, forget | iostreams and character-level processing on your hot paths. | oconnor663 wrote: | Do you have a particular example problem in mind that might | demonstrate language performance better? One of the Benchmark | Game programs maybe? | habibur wrote: | You need to measure memory usage with it, take the product of the | two (memory and speed), and thus get a very good number on | performance for comparison. | [deleted] ___________________________________________________________________ (page generated 2022-07-24 23:00 UTC)