[HN Gopher] Lomuto's Comeback
       ___________________________________________________________________
        
       Lomuto's Comeback
        
       Author : VHRanger
       Score  : 190 points
       Date   : 2020-05-17 15:27 UTC (7 hours ago)
        
 (HTM) web link (dlang.org)
 (TXT) w3m dump (dlang.org)
        
       | codeflo wrote:
       | I've read that some vector instruction sets use masking
       | instructions to avoid branching. You execute both arms of the
       | conditional and discard the unwanted computation, which can be
       | cheaper than an X% chance of a branch misprediction.
       | 
       | Should CPUs include more masking instructions for regular, non-
       | vectorized code as well?
        
         | tom_mellior wrote:
         | 32-bit ARM assembly language provided conditional execution for
         | most instructions. This was dropped in the 64-bit instruction
         | set. (https://en.wikipedia.org/wiki/Predication_(computer_archi
         | tec..., https://en.wikipedia.org/wiki/ARM_architecture#64-bit)
         | I imagine the architects had a clear picture of the advantages
         | and disadvantages, and made a very well-informed decision. I
         | guess that part of the reason might have been that it's not
         | clear when compilers should emit predicated code instead of
         | branches.
         | 
         | Also, you can get something very similar even for scalar code
         | as long as your machine has a conditional move operation, which
         | they all have:                   /* true path */         a =
         | ...         b = ...         true_result = ...         /* false
         | path */         x = ...         y = ...         false_result =
         | ...         /* final result */         result = (condition ?
         | true_result : false_result)
         | 
         | This works if the two "branches" have no side effects (memory
         | writes, function calls) and cannot raise exceptions. Again,
         | it's very difficult to estimate for compilers (and humans) if
         | it will pay off to run both computations rather than branch.
         | 
         | The trade-offs for vectorization are different from scalar code
         | since vectorizing a loop by a factor N gives a huge win, and
         | even if you waste some time doing some redundant computations,
         | you still have a reasonable chance of being faster than scalar.
        
           | ncmncm wrote:
           | You can get close to the performance of a cmov instruction by
           | generating a pair of all-1s and all-0s values: a=c, b=c-1,
           | and a result z=a&x|b&y.
           | 
           | Gcc will not under any circumstances produce two cmov
           | instructions in a basic block, so that is your only
           | alternative without dropping to asm. Clang is happy to
           | produce two adjacent cmov instructions. Usually your ALUs are
           | not otherwise so engaged as to make the number of operations
           | involved costly.
        
         | banachtarski wrote:
         | I'm unconvinced because the instruction set sizes are already
         | becoming fairly bloated (increasing microcode translation cost
         | and i-cache pressure) but also because CPUs already do
         | speculative execution which has a nearly equivalent effect to
         | the idiom you're referring to.
         | 
         | *edit: wording
        
           | MauranKilom wrote:
           | To be more precise, _some_ CPUs do the  "both sides of the
           | conditional" execution (EDIT This link is incorrect! Refer to
           | "Eager execution" in the last link below. /EDIT
           | https://en.wikipedia.org/wiki/Eager_evaluation) - years ago I
           | heard that mobile processors do so for power reasons. No idea
           | if that's still the case.
           | 
           | The problem in question is due to predictive execution in
           | which the CPU has to flush the pipeline on misprediction.
           | This is dominant in desktop PC CPUs. I have no knowledge on
           | what is prevalent in e.g. server or mobile CPUs.
           | 
           | Both are (according to
           | https://en.wikipedia.org/wiki/Speculative_execution) forms of
           | speculative execution.
        
             | banachtarski wrote:
             | I'm speaking broadly about desktop/console CPUs which I
             | have low-level familiarity with. I don't know of any CPUs
             | in this class that don't perform some form of speculative
             | execution.
             | 
             | I'm not sure if eager evaluation is related to the topic at
             | hand and was speaking specifically about speculative
             | execution. Eager evaluation is just what pretty much every
             | language except Haskell (or Haskell-like) does.
        
               | MauranKilom wrote:
               | Sorry, that first link was nonsense. I just copied the
               | wiki link under "Eager execution" in the Speculative
               | Execution page, but that is of course far from the same
               | as "eager evaluation" (which is what the link goes to). I
               | apologized for the confusion.
               | 
               | The point is that there are at least two ways CPUs can
               | speculatively execute a conditional: Either predict which
               | branch is taken and flush the pipeline if the prediction
               | was wrong, or execute both arms of the conditional and
               | discard the one that was not actually taken. The top-
               | level SIMD comment is about the latter, but most
               | optimization around speculative execution is for the
               | former.
               | 
               | Yes, CPUs do speculative execution. The "flush pipeline
               | on mispredict" kind ("predictive execution"). That is
               | _not_ the same as the  "execute both and discard the
               | untaken one" kind ("eager execution"). Your first reply
               | suggests you consider them equivalent when they are not.
               | I just wanted to clear that up.
        
               | banachtarski wrote:
               | Ah OK I understand now thanks for clearing that up. I
               | meant that the end effect was similar but not so much
               | that the mechanism was identical.
        
       | zengid wrote:
       | I love this sort of "Historical Fiction" where interactions and
       | conversations between The Great People of Science are imagined
       | and portrayed. A great example of this is Louisa Gilder's
       | wonderful book
       | _The_Age_Of_Entanglement:_When_Quantum_Physics_was_Reborn
       | 
       | https://www.amazon.com/Age-Entanglement-Quantum-Physics-Rebo...
        
         | turndown wrote:
         | If you enjoy books like that, I recommend The Cambridge
         | Quintet[0]; it's a wonderful book.
         | 
         | 0. https://www.amazon.com/Cambridge-Quintet-Scientific-
         | Speculat...
        
         | gumby wrote:
         | Thanks for writing his because I _hated_ it -- I read that par
         | thinking it would shed some light on the topic of the paper at
         | large, but it was just noise to me.
         | 
         | Which is _not_ to imply that I am right and you wrong or vice
         | versa! Simply to observe that people are different.
         | 
         | As I liked the article but was annoyed by that beginning, it's
         | great for me that the first comment I saw was yours. So thanks
         | for making it.
        
         | mhh__ wrote:
         | Anthony Zee's Quantum Field Theory textbook takes a similar
         | tone.
        
         | mwcampbell wrote:
         | Oh, I assumed that dialogue between Lomuto and Hoare was real
         | -- at least the gist of it, if not the exact words.
        
           | zengid wrote:
           | The interaction was probably real, since there is a date and
           | location, but the dialog (and certainly the narration) seems
           | to be taking some artistic liberties perhaps? This is also
           | what Louisa did in her book on the history of quantum
           | mechanics; real interactions with imagined dialogs. My
           | apologies if I gave the wrong impression.
        
       | ipsofacto wrote:
       | There's an even better branch-free (super scalar) sorting
       | algorithm: "In-place Parallel Super Scalar Samplesort (IPS4o)"
       | which we started using:
       | 
       | https://github.com/SaschaWitt/ips4o
       | 
       | https://arxiv.org/abs/1705.02257
       | 
       | As an example, to sort 10 million random longs on my computer it
       | takes std::sort 766 ms (roughly in line with Andrei's numbers)
       | and ips4o::sort takes 274 ms.
       | 
       | [edit:formatting]
        
         | phkahler wrote:
         | 2x-3x improvement doesn't seem like much for something that
         | runs in parallel.
        
           | ipsofacto wrote:
           | The 274 ms is for the sequential version of the algorithm (on
           | my i7-9750 laptop).
           | 
           | On my office Xeon E5-2690 machine when using multiple threads
           | the runtime decreases like this
           | 
           | 840 ms for std::sort
           | 
           | 372 ms for IPS4o sequentially
           | 
           | 201 ms for 2 threads
           | 
           | 104 ms for 4 threads
           | 
           | 53 ms for 8 threads
           | 
           | 33 ms for 16 threads
        
         | goldenkey wrote:
         | You might want to mention if are affiliated with the algorithm
         | you are promoting. ipsofacto ~ ips4o
        
           | ComputerGuru wrote:
           | Or maybe the throwaway was named after the comment it was
           | intended to be used to post.
        
             | ipsofacto wrote:
             | No affiliation, just picked the name because it fit... The
             | company I work at was exploring various sort algorithms to
             | use and this turned out to be the fastest.
             | 
             | Another benefit of this algorithm is that it is parallizes
             | extremely well.
        
       | ncmncm wrote:
       | It is true that more swaps may be cheaper than fewer swaps.
       | 
       | The operations to count, nowadays, are not swaps or comparisons,
       | which are very cheap, but instead pipeline stalls, which are
       | very, very expensive.
       | 
       | It is easy to better-than-double the speed of vanilla quicksort
       | with a three-line change in the partition step.
        
       | cozzyd wrote:
       | Writing branch-free code (which I find myself sometimes doing to
       | take advantage of simd vectorization) is always really painful. I
       | wonder if someone has made an attempt at better tooling for this.
        
         | repsilat wrote:
         | One thing people have made is array languages. The "primitive"
         | operations apply to arrays of data, so branches around
         | individual elements can't be expressed.
         | 
         | I think it's also generally less expressive, though. Writing
         | Lomuto's partition as efficiently would likely be impossible,
         | but writing it branch-free might not be -- you would compose
         | branch-free primitives like                   left =
         | filter(array, array < array[0])         right = filter(array,
         | array > array[0])
         | 
         | where `filter` is branch-free and takes an array of Boolean
         | values as its second argument.
         | 
         | Shrugs, aside from some standard set of functions/primitives
         | and idioms I'm not sure it would help much, though perhaps a
         | mindset-shift is a more likely solution than something more
         | technical.
        
           | VHRanger wrote:
           | This sort of vectorised pseudo branching is often used in
           | SIMD algorithms
        
             | cozzyd wrote:
             | Yes... and the transformations are often non-trivial
             | (especially when considering the possibility of out-of-
             | bounds elements, which can't just safely be multiplied by
             | zero--they might be a NaN, or on the edge of a page). It
             | would be nice if there was a tool (or compiler option!)
             | that could convert my branchy code into an "identical"
             | (assuming floating-point associativity, etc.) branchless
             | version.
        
               | snovv_crash wrote:
               | The Eigen Array [1] object is fairly close to this.
               | Combining the 'select' with comparisons, chained
               | operators etc. Eigen vectorized stuff pretty effectively
               | too.
               | 
               | 1. https://eigen.tuxfamily.org/dox/group__TutorialArrayCl
               | ass.ht...
        
         | klyrs wrote:
         | I tend to use the same method as the author has presented.
         | 
         | 1. Write your branchy code
         | 
         | 2. Massage the branches until their code is identical (testing
         | frequently)
         | 
         | 3. Trim a useless branch, and eat a piece of chocolate
        
       | eqvinox wrote:
       | So what's the impact of the data dependency chain on "first -=
       | smaller" to the next iteration?
       | 
       | (I guess at the very least it prevents vectorization, but other
       | than that ... shouldn't be too much, but probably still the new
       | limiting factor?)
        
       | eloff wrote:
       | If you have a branch free algorithm, you can vectorize it. If he
       | discussed implementing this with SIMD instructions (whether by
       | human hand or compiler auto vectorization) I missed that part.
       | But that seems an interesting angle to me.
        
         | kccqzy wrote:
         | Not necessarily. A branch-free algorithm can totally have a
         | data dependency from the previous iteration of the loop to the
         | next, making it impossible to vectorize.
        
           | eloff wrote:
           | That's true, although there are sometimes workarounds for
           | that by modifying the algorithm to iterate over a small
           | groups where there is still a data-dependency between
           | iterations, but not within each group, allowing one to use
           | SIMD. e.g. instead of delta compression over an array of
           | integers where you subtract each integer from the prior one,
           | you can subtract each group from the prior one with only a
           | small loss of compression ratio.
           | 
           | The question is, is there such a dependency in this
           | algorithm? I didn't check.
        
             | klyrs wrote:
             | Yes, sorting typically (and this one specifically) has long
             | dependency chains. The natural reflex is to split the
             | chains across multiple workers... and that's how you get
             | merge-sort. But even the last pass has a linear dependence
             | chain. Not long ago, somebody posted a clever algorithm
             | based on 4-way swaps, which is a clever way of addressing
             | the issue.
             | 
             | Edit: I can't seem to locate that 4-way swapping algorithm,
             | I swear it was posted to HN in the last year... somebody
             | halp!
        
               | kccqzy wrote:
               | You might be thinking of this:
               | https://news.ycombinator.com/item?id=22322967
        
               | klyrs wrote:
               | Bingo, thanks!
        
       | mehrdadn wrote:
       | Notice that this is on _random longs_. Of _course_ branch
       | misprediction and memory bandwidth is going to crush you for a
       | branchy sort... you 'll be wrong like half the time, and the
       | comparisons and swapping are trivial! Real world data isn't
       | random, so I'd expect branch predictors to do much better with
       | recognizing patterns. And your sorting isn't going to be as
       | simple as sorting integers according to their values all that
       | often. It's a neat exercise, but absent more realistic
       | benchmarking to the contrary, I wouldn't assume I'll get a faster
       | program by just substituting even a typical quicksort with
       | this...
        
         | sesuximo wrote:
         | Sorting is the case study but not the insight. I think the
         | cool/surprising part of the article is that doing more work can
         | be faster... Even more memory work (which per conventional
         | wisdom is not trivial!)
        
           | phkahler wrote:
           | >> Sorting is the case study but not the insight. I think the
           | cool/surprising part of the article is that doing more work
           | can be faster
           | 
           | Yeah, I figured heap sort should be the standard because it
           | has a worst case O(n*log(n)) complexity. But it does tend to
           | do more swaps (a smarter implementation can avoid many
           | writes), and the memory access is very cache-unfriendly. In
           | practice it's just not used.
        
           | mehrdadn wrote:
           | My comment wasn't inherently about sorting either. Branch-
           | free in general is faster in the _worst_ cases, i.e. if the
           | branch is difficult to predict. Except most real-world
           | branches are predictable... that 's why we have branch
           | predictors. So in general you should expect branch-free to be
           | slower, unless you have reason to assume your branches are
           | actually close to random.
        
             | derefr wrote:
             | > unless you have reason to assume your branches are
             | actually close to random.
             | 
             | I mean, to the degree that you need to sort the data at
             | all, it contains informational entropy.
             | 
             | Many "presents as sorted" data structures trade space for
             | time by keeping track of the internal sortedness of chunks
             | of the data, between sortation passes. Heck, any insert-
             | optimized data structure (B+ trees; LevelDB's sorted-
             | string-tables; N-ary tries generally) will get most of its
             | performance gains by being able to guarantee, at node split
             | time, that the children of any particular node are sorted
             | relative to one-another.
             | 
             | So, in many cases, by the very fact that you don't already
             | have metadata attached to your data asserting that it's
             | sorted, it's highly likely to have no branch-
             | predictability.
             | 
             | (Couple that to the fact that load-balancing in OLTP
             | distributed systems favors writes to evenly-keyspace-
             | distributed partitioning keys, ala Dynamo, and you'll
             | frequently find that your inputs to an index-like data
             | structure can be _guaranteed random_.)
        
             | cozzyd wrote:
             | Branches also can't be vectorized, so going branch free can
             | lead to a substantial speedup if you can use vector
             | instructions (you might be doing twice the work, but your
             | vector instruction may let you do it 4-16 times faster).
        
               | ncmncm wrote:
               | Indeed, AVX512 instructions can sort 16 int32 or well-
               | behaved float values optimally, with no branches. It is a
               | good final pass to any other algorithm.
        
         | repsilat wrote:
         | Maybe?
         | 
         | Lots of people replaced their sorting algorithms with Tim-Sort
         | because runs of alrealdy-sorted data are empirically common (at
         | least in some workloads). I've seen it myself, in use-cases
         | where sort keys change slightly from iteration to iteration in
         | some larger algorithm, and sort items need to be dynamically
         | kept in order. It no doubt also happens when grabbing data from
         | multiple places, each of which is theoretically ordered
         | arbitrarily but is in practice ordered predictably.
         | 
         | So yeah, the author picked the use-cases where this algorithm
         | is going to show in its best light. But that's not bad either,
         | right? Data is often _not_ well-ordered -- if you 're counting
         | occurrences of words, and want to get an ordered list (and
         | never mind that you wouldn't use a comparison-based sort for
         | that) you wouldn't expect much order. Likewise sorting any
         | keys/values from a decent hash table.
         | 
         | One worry that _I_ would have applies specifically to C++ and
         | D: sometimes values are big, and copies are expensive. But I
         | guess users can always choose to sort pointers like other
         | languages do, if the standard library uses a copy-heavy,
         | branch-light algorithm.
        
           | TeMPOraL wrote:
           | I think one example of extremely unsorted data is _random
           | data_ in Monte Carlo simulations, in particular if you need
           | to compute statistical summaries that need the data to be
           | sorted - e.g. percentiles. So you may end up repeatedly
           | sorting million-long random arrays many time a second.
           | 
           | Then again, maybe there are algorithms letting you compute
           | percentiles without sorting?
        
             | kccqzy wrote:
             | Extracting percentile information doesn't require sorting.
             | Even if you're too lazy to implement anything custom, a few
             | invocations of std::nth_element will likely outperform
             | sorting:
             | https://en.cppreference.com/w/cpp/algorithm/nth_element
        
             | [deleted]
        
             | cozzyd wrote:
             | Yeah, you can histogram the values if you just need an
             | approximate answer (or if you have few enough discrete
             | values, you can just have a bin per value). Once you have
             | an approximate answer, if you need a more precise answer,
             | you can go through the data again and only consider values
             | within the right bin (which hopefully will be a lot fewer,
             | if you've chosen your bins wisely) and then use something
             | like std::nth_element.
        
             | loeg wrote:
             | > Then again, maybe there are algorithms letting you
             | compute percentiles without sorting?
             | 
             | I think you may be able to do so probabilistically, which
             | is often good enough? Additionally, one observation about
             | percentiles is that you don't need to sort all of the data;
             | you can just hold on to the top N results in something like
             | a Heap.
        
             | leni536 wrote:
             | If you need k percentile values then I think you could do
             | it in O(n log(k)) with introselect. You most definitely
             | don't need O(n log(n)).
        
       | gautamcgoel wrote:
       | Reminds me of this great article on writing a fast mergesort:
       | 
       | http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/
        
       | Arcuru wrote:
       | Reminds me of this paper on binary search, which also shows a
       | speed up using branch-free comparisons.
       | 
       | "Array Layouts for Comparison-Based Searching"
       | https://arxiv.org/pdf/1509.05053.pdf
        
       | nightcracker wrote:
       | Hoare partitioning can also be implemented in a branchless
       | manner, my algorithm pdqsort does this:
       | https://github.com/orlp/pdqsort
        
       | kleiba wrote:
       | To anyone interested in a deeper treatment of quicksort and
       | variants thereof, I can recommend Sebastian Wild's PhD thesis on
       | that topic: https://kluedo.ub.uni-
       | kl.de/frontdoor/deliver/index/docId/44...
        
         | bachmeier wrote:
         | Off topic: Is it common practice for a dissertation at a German
         | university to be written in English?
        
           | cozzyd wrote:
           | In physics it certainly is. As far as I know, much all
           | scientific conferences and journals are in English these
           | days. Even journals like JETP (https://en.wikipedia.org/wiki/
           | Journal_of_Experimental_and_Th...) became joint English and
           | Russian in the 50's!
        
           | DaiPlusPlus wrote:
           | Yes - to ensure the widest possible audience. English is the
           | lingua-franca of academia and has been for almost a century
           | now.
        
             | kleiba wrote:
             | At least in MINT fields, probably less so in the
             | humanities. Correct me if I'm wrong.
        
               | atq2119 wrote:
               | You're right. German economics is famously insular, for
               | example. Part of that is cultural, part of it is that
               | they don't write as much in English.
        
           | kkaefer wrote:
           | Depends on the field, but it is very common in computer
           | science and generally in natural sciences. Studying computer
           | science, I don't think I ever even wrote as much as a paper
           | in German; everything's English.
        
           | kleiba wrote:
           | Hardly anyone is going to read, let alone cite your work if
           | you were to publish in German. I'm not even aware of any
           | German-speaking conferences.
        
       | Ericson2314 wrote:
       | I'm not sure I've ever sorted an array or list. Rather I put
       | things into sets or maps or built in index---I always wanted to
       | access things later, or have an on-line data structure (which a
       | sorted flat thing is not).
       | 
       | This whole field is a waste of time, and anachronism from when
       | the master copy was paper/pre-computer and computers were just
       | doing the analytics.
        
         | [deleted]
        
         | ncmncm wrote:
         | Putting things in a map or set is almost invariably slower,
         | often much slower, than a smart sort. If performance doesn't
         | matter, then go ahead. Most often sort performance doesn't
         | matter, or anyway most of the time is spent elsewhere. People
         | do use Python or Bash in production, without shame.
         | 
         | But some of us work on problems where performance of the sort
         | does matter.
        
         | ComputerGuru wrote:
         | Your dismissal of "sorted flat thing" as being necessarily not
         | online is incorrect or unnecessarily strict to your detriment.
         | 
         | I wrote this comparing the high-level performance of a "set"
         | (AVL or red-black tree) against just that:
         | 
         | Faster lookups, insertions, and in-order traversal than a red-
         | black or AVL tree [0]
         | 
         | [0]: https://neosmart.net/blog/2019/sorted-list-vs-binary-
         | search-...
         | 
         | (Spoiler: it depends on the nature of your data, but sorted
         | vector can be dramatically faster.)
        
       ___________________________________________________________________
       (page generated 2020-05-17 23:00 UTC)