[HN Gopher] Lomuto's Comeback ___________________________________________________________________ Lomuto's Comeback Author : VHRanger Score : 190 points Date : 2020-05-17 15:27 UTC (7 hours ago) (HTM) web link (dlang.org) (TXT) w3m dump (dlang.org) | codeflo wrote: | I've read that some vector instruction sets use masking | instructions to avoid branching. You execute both arms of the | conditional and discard the unwanted computation, which can be | cheaper than an X% chance of a branch misprediction. | | Should CPUs include more masking instructions for regular, non- | vectorized code as well? | tom_mellior wrote: | 32-bit ARM assembly language provided conditional execution for | most instructions. This was dropped in the 64-bit instruction | set. (https://en.wikipedia.org/wiki/Predication_(computer_archi | tec..., https://en.wikipedia.org/wiki/ARM_architecture#64-bit) | I imagine the architects had a clear picture of the advantages | and disadvantages, and made a very well-informed decision. I | guess that part of the reason might have been that it's not | clear when compilers should emit predicated code instead of | branches. | | Also, you can get something very similar even for scalar code | as long as your machine has a conditional move operation, which | they all have: /* true path */ a = | ... b = ... true_result = ... /* false | path */ x = ... y = ... false_result = | ... /* final result */ result = (condition ? | true_result : false_result) | | This works if the two "branches" have no side effects (memory | writes, function calls) and cannot raise exceptions. Again, | it's very difficult to estimate for compilers (and humans) if | it will pay off to run both computations rather than branch. | | The trade-offs for vectorization are different from scalar code | since vectorizing a loop by a factor N gives a huge win, and | even if you waste some time doing some redundant computations, | you still have a reasonable chance of being faster than scalar. | ncmncm wrote: | You can get close to the performance of a cmov instruction by | generating a pair of all-1s and all-0s values: a=c, b=c-1, | and a result z=a&x|b&y. | | Gcc will not under any circumstances produce two cmov | instructions in a basic block, so that is your only | alternative without dropping to asm. Clang is happy to | produce two adjacent cmov instructions. Usually your ALUs are | not otherwise so engaged as to make the number of operations | involved costly. | banachtarski wrote: | I'm unconvinced because the instruction set sizes are already | becoming fairly bloated (increasing microcode translation cost | and i-cache pressure) but also because CPUs already do | speculative execution which has a nearly equivalent effect to | the idiom you're referring to. | | *edit: wording | MauranKilom wrote: | To be more precise, _some_ CPUs do the "both sides of the | conditional" execution (EDIT This link is incorrect! Refer to | "Eager execution" in the last link below. /EDIT | https://en.wikipedia.org/wiki/Eager_evaluation) - years ago I | heard that mobile processors do so for power reasons. No idea | if that's still the case. | | The problem in question is due to predictive execution in | which the CPU has to flush the pipeline on misprediction. | This is dominant in desktop PC CPUs. I have no knowledge on | what is prevalent in e.g. server or mobile CPUs. | | Both are (according to | https://en.wikipedia.org/wiki/Speculative_execution) forms of | speculative execution. | banachtarski wrote: | I'm speaking broadly about desktop/console CPUs which I | have low-level familiarity with. I don't know of any CPUs | in this class that don't perform some form of speculative | execution. | | I'm not sure if eager evaluation is related to the topic at | hand and was speaking specifically about speculative | execution. Eager evaluation is just what pretty much every | language except Haskell (or Haskell-like) does. | MauranKilom wrote: | Sorry, that first link was nonsense. I just copied the | wiki link under "Eager execution" in the Speculative | Execution page, but that is of course far from the same | as "eager evaluation" (which is what the link goes to). I | apologized for the confusion. | | The point is that there are at least two ways CPUs can | speculatively execute a conditional: Either predict which | branch is taken and flush the pipeline if the prediction | was wrong, or execute both arms of the conditional and | discard the one that was not actually taken. The top- | level SIMD comment is about the latter, but most | optimization around speculative execution is for the | former. | | Yes, CPUs do speculative execution. The "flush pipeline | on mispredict" kind ("predictive execution"). That is | _not_ the same as the "execute both and discard the | untaken one" kind ("eager execution"). Your first reply | suggests you consider them equivalent when they are not. | I just wanted to clear that up. | banachtarski wrote: | Ah OK I understand now thanks for clearing that up. I | meant that the end effect was similar but not so much | that the mechanism was identical. | zengid wrote: | I love this sort of "Historical Fiction" where interactions and | conversations between The Great People of Science are imagined | and portrayed. A great example of this is Louisa Gilder's | wonderful book | _The_Age_Of_Entanglement:_When_Quantum_Physics_was_Reborn | | https://www.amazon.com/Age-Entanglement-Quantum-Physics-Rebo... | turndown wrote: | If you enjoy books like that, I recommend The Cambridge | Quintet[0]; it's a wonderful book. | | 0. https://www.amazon.com/Cambridge-Quintet-Scientific- | Speculat... | gumby wrote: | Thanks for writing his because I _hated_ it -- I read that par | thinking it would shed some light on the topic of the paper at | large, but it was just noise to me. | | Which is _not_ to imply that I am right and you wrong or vice | versa! Simply to observe that people are different. | | As I liked the article but was annoyed by that beginning, it's | great for me that the first comment I saw was yours. So thanks | for making it. | mhh__ wrote: | Anthony Zee's Quantum Field Theory textbook takes a similar | tone. | mwcampbell wrote: | Oh, I assumed that dialogue between Lomuto and Hoare was real | -- at least the gist of it, if not the exact words. | zengid wrote: | The interaction was probably real, since there is a date and | location, but the dialog (and certainly the narration) seems | to be taking some artistic liberties perhaps? This is also | what Louisa did in her book on the history of quantum | mechanics; real interactions with imagined dialogs. My | apologies if I gave the wrong impression. | ipsofacto wrote: | There's an even better branch-free (super scalar) sorting | algorithm: "In-place Parallel Super Scalar Samplesort (IPS4o)" | which we started using: | | https://github.com/SaschaWitt/ips4o | | https://arxiv.org/abs/1705.02257 | | As an example, to sort 10 million random longs on my computer it | takes std::sort 766 ms (roughly in line with Andrei's numbers) | and ips4o::sort takes 274 ms. | | [edit:formatting] | phkahler wrote: | 2x-3x improvement doesn't seem like much for something that | runs in parallel. | ipsofacto wrote: | The 274 ms is for the sequential version of the algorithm (on | my i7-9750 laptop). | | On my office Xeon E5-2690 machine when using multiple threads | the runtime decreases like this | | 840 ms for std::sort | | 372 ms for IPS4o sequentially | | 201 ms for 2 threads | | 104 ms for 4 threads | | 53 ms for 8 threads | | 33 ms for 16 threads | goldenkey wrote: | You might want to mention if are affiliated with the algorithm | you are promoting. ipsofacto ~ ips4o | ComputerGuru wrote: | Or maybe the throwaway was named after the comment it was | intended to be used to post. | ipsofacto wrote: | No affiliation, just picked the name because it fit... The | company I work at was exploring various sort algorithms to | use and this turned out to be the fastest. | | Another benefit of this algorithm is that it is parallizes | extremely well. | ncmncm wrote: | It is true that more swaps may be cheaper than fewer swaps. | | The operations to count, nowadays, are not swaps or comparisons, | which are very cheap, but instead pipeline stalls, which are | very, very expensive. | | It is easy to better-than-double the speed of vanilla quicksort | with a three-line change in the partition step. | cozzyd wrote: | Writing branch-free code (which I find myself sometimes doing to | take advantage of simd vectorization) is always really painful. I | wonder if someone has made an attempt at better tooling for this. | repsilat wrote: | One thing people have made is array languages. The "primitive" | operations apply to arrays of data, so branches around | individual elements can't be expressed. | | I think it's also generally less expressive, though. Writing | Lomuto's partition as efficiently would likely be impossible, | but writing it branch-free might not be -- you would compose | branch-free primitives like left = | filter(array, array < array[0]) right = filter(array, | array > array[0]) | | where `filter` is branch-free and takes an array of Boolean | values as its second argument. | | Shrugs, aside from some standard set of functions/primitives | and idioms I'm not sure it would help much, though perhaps a | mindset-shift is a more likely solution than something more | technical. | VHRanger wrote: | This sort of vectorised pseudo branching is often used in | SIMD algorithms | cozzyd wrote: | Yes... and the transformations are often non-trivial | (especially when considering the possibility of out-of- | bounds elements, which can't just safely be multiplied by | zero--they might be a NaN, or on the edge of a page). It | would be nice if there was a tool (or compiler option!) | that could convert my branchy code into an "identical" | (assuming floating-point associativity, etc.) branchless | version. | snovv_crash wrote: | The Eigen Array [1] object is fairly close to this. | Combining the 'select' with comparisons, chained | operators etc. Eigen vectorized stuff pretty effectively | too. | | 1. https://eigen.tuxfamily.org/dox/group__TutorialArrayCl | ass.ht... | klyrs wrote: | I tend to use the same method as the author has presented. | | 1. Write your branchy code | | 2. Massage the branches until their code is identical (testing | frequently) | | 3. Trim a useless branch, and eat a piece of chocolate | eqvinox wrote: | So what's the impact of the data dependency chain on "first -= | smaller" to the next iteration? | | (I guess at the very least it prevents vectorization, but other | than that ... shouldn't be too much, but probably still the new | limiting factor?) | eloff wrote: | If you have a branch free algorithm, you can vectorize it. If he | discussed implementing this with SIMD instructions (whether by | human hand or compiler auto vectorization) I missed that part. | But that seems an interesting angle to me. | kccqzy wrote: | Not necessarily. A branch-free algorithm can totally have a | data dependency from the previous iteration of the loop to the | next, making it impossible to vectorize. | eloff wrote: | That's true, although there are sometimes workarounds for | that by modifying the algorithm to iterate over a small | groups where there is still a data-dependency between | iterations, but not within each group, allowing one to use | SIMD. e.g. instead of delta compression over an array of | integers where you subtract each integer from the prior one, | you can subtract each group from the prior one with only a | small loss of compression ratio. | | The question is, is there such a dependency in this | algorithm? I didn't check. | klyrs wrote: | Yes, sorting typically (and this one specifically) has long | dependency chains. The natural reflex is to split the | chains across multiple workers... and that's how you get | merge-sort. But even the last pass has a linear dependence | chain. Not long ago, somebody posted a clever algorithm | based on 4-way swaps, which is a clever way of addressing | the issue. | | Edit: I can't seem to locate that 4-way swapping algorithm, | I swear it was posted to HN in the last year... somebody | halp! | kccqzy wrote: | You might be thinking of this: | https://news.ycombinator.com/item?id=22322967 | klyrs wrote: | Bingo, thanks! | mehrdadn wrote: | Notice that this is on _random longs_. Of _course_ branch | misprediction and memory bandwidth is going to crush you for a | branchy sort... you 'll be wrong like half the time, and the | comparisons and swapping are trivial! Real world data isn't | random, so I'd expect branch predictors to do much better with | recognizing patterns. And your sorting isn't going to be as | simple as sorting integers according to their values all that | often. It's a neat exercise, but absent more realistic | benchmarking to the contrary, I wouldn't assume I'll get a faster | program by just substituting even a typical quicksort with | this... | sesuximo wrote: | Sorting is the case study but not the insight. I think the | cool/surprising part of the article is that doing more work can | be faster... Even more memory work (which per conventional | wisdom is not trivial!) | phkahler wrote: | >> Sorting is the case study but not the insight. I think the | cool/surprising part of the article is that doing more work | can be faster | | Yeah, I figured heap sort should be the standard because it | has a worst case O(n*log(n)) complexity. But it does tend to | do more swaps (a smarter implementation can avoid many | writes), and the memory access is very cache-unfriendly. In | practice it's just not used. | mehrdadn wrote: | My comment wasn't inherently about sorting either. Branch- | free in general is faster in the _worst_ cases, i.e. if the | branch is difficult to predict. Except most real-world | branches are predictable... that 's why we have branch | predictors. So in general you should expect branch-free to be | slower, unless you have reason to assume your branches are | actually close to random. | derefr wrote: | > unless you have reason to assume your branches are | actually close to random. | | I mean, to the degree that you need to sort the data at | all, it contains informational entropy. | | Many "presents as sorted" data structures trade space for | time by keeping track of the internal sortedness of chunks | of the data, between sortation passes. Heck, any insert- | optimized data structure (B+ trees; LevelDB's sorted- | string-tables; N-ary tries generally) will get most of its | performance gains by being able to guarantee, at node split | time, that the children of any particular node are sorted | relative to one-another. | | So, in many cases, by the very fact that you don't already | have metadata attached to your data asserting that it's | sorted, it's highly likely to have no branch- | predictability. | | (Couple that to the fact that load-balancing in OLTP | distributed systems favors writes to evenly-keyspace- | distributed partitioning keys, ala Dynamo, and you'll | frequently find that your inputs to an index-like data | structure can be _guaranteed random_.) | cozzyd wrote: | Branches also can't be vectorized, so going branch free can | lead to a substantial speedup if you can use vector | instructions (you might be doing twice the work, but your | vector instruction may let you do it 4-16 times faster). | ncmncm wrote: | Indeed, AVX512 instructions can sort 16 int32 or well- | behaved float values optimally, with no branches. It is a | good final pass to any other algorithm. | repsilat wrote: | Maybe? | | Lots of people replaced their sorting algorithms with Tim-Sort | because runs of alrealdy-sorted data are empirically common (at | least in some workloads). I've seen it myself, in use-cases | where sort keys change slightly from iteration to iteration in | some larger algorithm, and sort items need to be dynamically | kept in order. It no doubt also happens when grabbing data from | multiple places, each of which is theoretically ordered | arbitrarily but is in practice ordered predictably. | | So yeah, the author picked the use-cases where this algorithm | is going to show in its best light. But that's not bad either, | right? Data is often _not_ well-ordered -- if you 're counting | occurrences of words, and want to get an ordered list (and | never mind that you wouldn't use a comparison-based sort for | that) you wouldn't expect much order. Likewise sorting any | keys/values from a decent hash table. | | One worry that _I_ would have applies specifically to C++ and | D: sometimes values are big, and copies are expensive. But I | guess users can always choose to sort pointers like other | languages do, if the standard library uses a copy-heavy, | branch-light algorithm. | TeMPOraL wrote: | I think one example of extremely unsorted data is _random | data_ in Monte Carlo simulations, in particular if you need | to compute statistical summaries that need the data to be | sorted - e.g. percentiles. So you may end up repeatedly | sorting million-long random arrays many time a second. | | Then again, maybe there are algorithms letting you compute | percentiles without sorting? | kccqzy wrote: | Extracting percentile information doesn't require sorting. | Even if you're too lazy to implement anything custom, a few | invocations of std::nth_element will likely outperform | sorting: | https://en.cppreference.com/w/cpp/algorithm/nth_element | [deleted] | cozzyd wrote: | Yeah, you can histogram the values if you just need an | approximate answer (or if you have few enough discrete | values, you can just have a bin per value). Once you have | an approximate answer, if you need a more precise answer, | you can go through the data again and only consider values | within the right bin (which hopefully will be a lot fewer, | if you've chosen your bins wisely) and then use something | like std::nth_element. | loeg wrote: | > Then again, maybe there are algorithms letting you | compute percentiles without sorting? | | I think you may be able to do so probabilistically, which | is often good enough? Additionally, one observation about | percentiles is that you don't need to sort all of the data; | you can just hold on to the top N results in something like | a Heap. | leni536 wrote: | If you need k percentile values then I think you could do | it in O(n log(k)) with introselect. You most definitely | don't need O(n log(n)). | gautamcgoel wrote: | Reminds me of this great article on writing a fast mergesort: | | http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/ | Arcuru wrote: | Reminds me of this paper on binary search, which also shows a | speed up using branch-free comparisons. | | "Array Layouts for Comparison-Based Searching" | https://arxiv.org/pdf/1509.05053.pdf | nightcracker wrote: | Hoare partitioning can also be implemented in a branchless | manner, my algorithm pdqsort does this: | https://github.com/orlp/pdqsort | kleiba wrote: | To anyone interested in a deeper treatment of quicksort and | variants thereof, I can recommend Sebastian Wild's PhD thesis on | that topic: https://kluedo.ub.uni- | kl.de/frontdoor/deliver/index/docId/44... | bachmeier wrote: | Off topic: Is it common practice for a dissertation at a German | university to be written in English? | cozzyd wrote: | In physics it certainly is. As far as I know, much all | scientific conferences and journals are in English these | days. Even journals like JETP (https://en.wikipedia.org/wiki/ | Journal_of_Experimental_and_Th...) became joint English and | Russian in the 50's! | DaiPlusPlus wrote: | Yes - to ensure the widest possible audience. English is the | lingua-franca of academia and has been for almost a century | now. | kleiba wrote: | At least in MINT fields, probably less so in the | humanities. Correct me if I'm wrong. | atq2119 wrote: | You're right. German economics is famously insular, for | example. Part of that is cultural, part of it is that | they don't write as much in English. | kkaefer wrote: | Depends on the field, but it is very common in computer | science and generally in natural sciences. Studying computer | science, I don't think I ever even wrote as much as a paper | in German; everything's English. | kleiba wrote: | Hardly anyone is going to read, let alone cite your work if | you were to publish in German. I'm not even aware of any | German-speaking conferences. | Ericson2314 wrote: | I'm not sure I've ever sorted an array or list. Rather I put | things into sets or maps or built in index---I always wanted to | access things later, or have an on-line data structure (which a | sorted flat thing is not). | | This whole field is a waste of time, and anachronism from when | the master copy was paper/pre-computer and computers were just | doing the analytics. | [deleted] | ncmncm wrote: | Putting things in a map or set is almost invariably slower, | often much slower, than a smart sort. If performance doesn't | matter, then go ahead. Most often sort performance doesn't | matter, or anyway most of the time is spent elsewhere. People | do use Python or Bash in production, without shame. | | But some of us work on problems where performance of the sort | does matter. | ComputerGuru wrote: | Your dismissal of "sorted flat thing" as being necessarily not | online is incorrect or unnecessarily strict to your detriment. | | I wrote this comparing the high-level performance of a "set" | (AVL or red-black tree) against just that: | | Faster lookups, insertions, and in-order traversal than a red- | black or AVL tree [0] | | [0]: https://neosmart.net/blog/2019/sorted-list-vs-binary- | search-... | | (Spoiler: it depends on the nature of your data, but sorted | vector can be dramatically faster.) ___________________________________________________________________ (page generated 2020-05-17 23:00 UTC)