[HN Gopher] Changing std:sort at Google's scale and beyond
       ___________________________________________________________________
        
       Changing std:sort at Google's scale and beyond
        
       Author : ashvardanian
       Score  : 352 points
       Date   : 2022-04-20 15:57 UTC (7 hours ago)
        
 (HTM) web link (danlark.org)
 (TXT) w3m dump (danlark.org)
        
       | jeffbee wrote:
       | Huh. If I found that Reflection::ListFieldsMayFailOnStripped was
       | a frequent caller of std::sort, to an extent that it was actually
       | costing measurable CPU time, first thing I would do is find and
       | fix all callers of _that_.
        
         | gipp wrote:
         | "Fix" how exactly? I'm not quite sure what you're suggesting.
        
           | jeffbee wrote:
           | C++ proto reflection is a convenience. If it actually shows
           | up at scale in production, that smells like a design error.
        
             | moultano wrote:
             | All things show up at scale in production at Google.
        
               | jeffbee wrote:
               | In one sense yes, but in another sense even though Google
               | is large their threshold for something being worth fixing
               | based on efficiency justifications is correspondingly
               | large.
        
               | foota wrote:
               | Frequently though, you're dealing with a death by a
               | thousand paper cuts. Perhaps that call comes from some
               | common library whose implementation requires reflection.
        
               | moultano wrote:
               | It isn't though. A typical developer will never care
               | about 0.1% of unnecessary CPU usage in their application,
               | but at Google, depending on what component it's in, that
               | can easily be a million dollar find.
        
               | usefulcat wrote:
               | If you're measuring in terms of time, electricity, or
               | hardware, then the threshold would tend to be lower for
               | Google, not higher. A 0.1% savings on a million machines
               | is ~1 million times more valuable than a 0.1% savings on
               | a single machine.
               | 
               | OTOH if the measurement is risk of breaking something,
               | then it's not nearly so clear, since larger organizations
               | tend to have more software and hardware, and therefore
               | more opportunities for things to get broken.
        
               | jeffbee wrote:
               | You're saying the opposite of what the other guy said,
               | though. They said that at scale small things are big. I'm
               | saying that at scale .001% is still .001%.
        
               | jcelerier wrote:
               | > In one sense yes, but in another sense even though
               | Google is large their threshold for something being worth
               | fixing based on efficiency justifications is
               | correspondingly large.
               | 
               | not everything is a linear relationship
        
       | pierrebai wrote:
       | In the first example of problematic calls to nth_element, there
       | is another bug: if the container contains nothing, it will call
       | nth_element with begin(), begin() - 1, ...
       | 
       | Also, typo: "compare functions much comply" -> "compare functions
       | must comply"
        
       | timmytokyo wrote:
       | "It took us around 20 years to find something really appropriate,
       | thanks to Andrei Alexandrescu."
       | 
       | The way this is written, it sounds as if Andrei Alexandrescu is
       | responsible for the 20 year delay.
        
       | lauv0x wrote:
        
       | samueloph wrote:
       | I'm starting to question my understanding of algorithmic
       | complexity, the author says things like "complexity being worst
       | case O(n \log n)" and seems to talk about big O notation used for
       | average cases as well.
       | 
       | I've learned that big O notation is always and exclusively about
       | the worst case, for average and best cases we use Theta Th and
       | Omega O, respectively.
       | 
       | Do people not follow these definitions strictly? Has the author
       | committed a mistake or am I missing something?
       | 
       | I did experience an interviewer once saying that my complexity
       | analysis was wrong as the worst case would be extremely rare to
       | hit, I did try to explain nicely that big O notation is always
       | about the worst case, and that they were thinking about big Theta
       | instead, now I wonder if I was wrong.
        
         | waynecochran wrote:
         | You are mixing two things together. The "average/expected case"
         | and the "worst case" are two different functions, each with
         | their own O, Th, O.
        
         | Corence wrote:
         | O, Theta, and Omega are independent from best, worst, and
         | average case. You can mathematically establish upper, lower, or
         | exact bounds on any of the average, best, or worst cases.
         | 
         | It is perfectly valid to say "the best case is O(n)" which
         | means the best case scales no worse than linearly. The "no
         | worse" here is not describing the best case, but rather the
         | strictness of your bound. You could say a sort algorithm is
         | O(n!) since, yes, it does scale no worse than n! but it's not
         | particularly helpful information.
         | 
         | Big O notation is used imprecisely frequently (see also people
         | saying a dataset is O(millions) to describe scale).
        
         | SQueeeeeL wrote:
         | You should look into amortized complexity. That's the
         | formalized name for worst case runtime over multiple trials.
        
         | samhw wrote:
         | The other replies are great, but just to put it concisely: big
         | O is 'worst case' in terms of input _size_ , while the 'best
         | case big O' is 'best case' in terms of the input _contents_
         | ('shape').
        
         | bsedlm wrote:
         | In my opinion, nothing about big-O and related omega, theta,
         | etc, is at all strict.
         | 
         | Maybe I value "pure symbolic clairty" too much. I do see a very
         | _pragmatic_ usefullness to the complexity notation.
        
         | tempnow987 wrote:
         | I always used big O for worst case (by default). I did think
         | that was common.
         | 
         | Theta and Omega I use differently however.
         | 
         | I sometimes say, best case if X then it's O(n) or whatever.
        
         | larsrc wrote:
         | There are variations, that's what we learned back at Uni.
         | Expected O(f(n)), average O(f(n)), amortised O(f(n)), even
         | expected amortised O(f(n)). "Expected" implied a randomised
         | algorithm.
        
         | vmsp wrote:
         | In school, we used big O notation for all cases. I've never
         | heard about theta or omega notation.
         | 
         | Wikipedia's bubble sort page -- as an example -- also uses it
         | for all cases.
         | 
         | https://en.wikipedia.org/wiki/Bubble_sort
        
           | rockinghigh wrote:
           | You can see the list of notations here: https://en.wikipedia.
           | org/wiki/Big_O_notation#Family_of_Bachm...
        
         | dwohnitmok wrote:
         | Big O, Theta, and Omega notations all ultimately denote sets of
         | functions. So e.g. O(n) is the set of all functions which are
         | eventually overtaken by some linear function.
         | 
         | When we say "X is O(f(n))" what we really mean is "the function
         | that characterizes X is contained within the set of functions
         | denoted by O(f(n))." Anything that can be described by a
         | function hence can be given Big O, Theta, or Omega notation. So
         | yes worst case time complexity could be described in this
         | notation, but so could average time complexity.
         | 
         | But we could also describe things that have nothing to do with
         | time or space complexity with big O notation as well. As long
         | as we've turned it into a function, we can describe it with big
         | O notation.
        
         | londons_explore wrote:
         | Computer scientists stole some notation from mathematicians,
         | but then slightly changed and substantially slackened the
         | definition...
        
           | tsimionescu wrote:
           | No, not really. CS and other branches of maths are using the
           | exact same definitions of O,o,Th, and o. It is true that O is
           | used with a different definition in CS, but it's _stricter_
           | not more slack.
           | 
           | As explained elsewhere though, they are not notations for
           | algorithmic complexity, they are notations for function
           | categorization. It just happens that a common use for this
           | notation in CS is to categorize the function
           | "best/average/worse/... case complexity of an algorithm", but
           | they are also used for other purposes.
        
             | ufo wrote:
             | In my experience, a lot of the time when programmers say
             | big O, they actually mean big Theta...
        
               | tsimionescu wrote:
               | Yes, I agree with that. However, if f(n)=Theta(n) =>
               | f(n)=O(n), so they are usually not wrong, technically,
               | just unnecessarily broad.
        
               | adrianN wrote:
               | Programmers need not be very good computer scientists.
               | The actual scientists publishing papers on algorithms
               | usually get it right.
        
         | throw149102 wrote:
         | I like to use the metaphor of driving somewhere, and trying to
         | estimate when you're going to get there. Maybe you have a hotel
         | reservation, and you want to make sure you don't get there too
         | early nor too late. Then (with some times filled in for
         | example):                 +------------------------------------
         | ---+-------------------------------+---------------------------
         | -----+----------------------------------------+       |
         | .                   | Best Case (little/no traffic) | Average
         | Case (average traffic) | Worst Case (car accident, snow-in etc)
         | |       +---------------------------------------+--------------
         | -----------------+--------------------------------+------------
         | ----------------------------+       | Max time, upper bound
         | (Big O)         | 30 minutes                    | 45 minutes
         | | 1 hour 20 minutes                      |       | Both min and
         | max, tight bound (Theta) | 25 minutes                    | 35
         | minutes                     | 1 hour
         | |       | Minimum time, lower bound (Omega)     | 15 minutes
         | | 25 minutes                     | 50 minutes
         | |       +---------------------------------------+--------------
         | -----------------+--------------------------------+------------
         | ----------------------------+
         | 
         | So then you can start asking questions like - if I assume I'm
         | in the average case, and I can show up earliest at 20 minutes
         | from now, and show up latest at 50 minutes from now, can I
         | guarantee that I will make it within that window? The answer in
         | this case is yes, because our range is 25-45 minutes, so the
         | earliest I can show up is in 25 minutes and the latest in 45
         | minutes.
         | 
         | This also helps explain why Big O is generally more useful - we
         | can almost always be early to something, but we can almost
         | never be late to something. So showing up 5 minutes early
         | usually is not a problem.
         | 
         | It also shows why Theta is better than both - in this case, if
         | we had this table we could completely ignore both the Big O row
         | and Omega row, because Theta is a better bound for both min and
         | max time.
         | 
         | Of course, in algorithmic complexity we replace the times with
         | functions in terms of some variables, usually just _n_.
        
           | tsimionescu wrote:
           | Your example is nice, but extremely informal, as these are
           | all constant numbers, while O/Theta/Omega (and o, omega,
           | theta) are about functions.
           | 
           | I think the most important thing that needs to be clarified
           | though is that algorithmic complexity is first and foremost a
           | function, `WorseCaseComplexity(size of input) = max number of
           | steps for any input of that size`. This function is often
           | then classed into a "complexity class" using O/Theta/Omega
           | etc, but that is only done for the sake of easier comparison
           | of complexities between algorithms.
        
         | powersnail wrote:
         | I think you are mistaken here. The concept of
         | worst/best/amortized case is different from O/Theta/Omega. The
         | latter is about the direction with which it is bounding.
         | 
         | Big O/Theta/Omega denotes sets of functions, and each of them
         | can be applied to best/worst/amortized complexity analysis.
        
         | tsimionescu wrote:
         | Worst case, median case, and best case complexity have nothing
         | to do with O/Th/O.
         | 
         | When analyzing an algorithm, you want to compute the number of
         | steps it has to execute in certain cases as a function of its
         | input size(s). That function will be something like 2n^2 + 67n
         | + 189 steps. Since this is too much information, you can then
         | express it instead as O(n^2) (or O(n^100)) or O(n^2) (or O(25n)
         | ) or Th(n^2).
         | 
         | You can then also compute the best case running time of the
         | algorithm as a function of the size of the input, and get 2n^2
         | + 67n. Then you can say that the best case running time is also
         | O(n^2).
         | 
         | Finally, you can (sometimes) define an "average" running time.
         | 
         | Now, what do we mean by "shortest running time"? Is that not
         | always for n=0 or something? No - the best case running time
         | has to do with the running time in a case where the input has a
         | certain shape. For some sorting algorithms for example, if the
         | input is already sorted, they will only go through the list
         | once and do nothing, so they have a best case running time that
         | is O(n). Other sorting algorithms will still perform steps even
         | if the input is already sorted, so even their best case running
         | time may be O(n log(n)).
         | 
         | Now, you can say that, given any possible input, the running
         | time of the algorithm is O(worse case running time) and O(best
         | case running time). You definitely CAN'T say in general that it
         | is Th(average case running time) though, as in will sometimes
         | run in time that is shorter than "average case running time"
         | (for best case scenarios); or sometimes it will run in time
         | that is longer than "average case running time" (for worse case
         | scenarios).
         | 
         | Basically, the worse/best/average case running times are an
         | analysis of how properties other than the size of the input
         | affect its running time.
         | 
         | Big-O notation and its friends have nothing to do with
         | complexity analysis. They are a general notation for the
         | asymptotic behavior of mathematical functions. Big-O is usually
         | understood as an upper bound on a function, big-O is a lower
         | bound, and Th is a tight bound (the function f is by definition
         | always within a factor of k above or below Th(f)).
         | 
         | Edit: let's take an example. Say we have the following
         | algorithm:                 isAllOdd(array x):         if x is
         | empty:           return true         if x[0] is even:
         | return false         return isAllOdd(x[1:])
         | 
         | In the best case, for any size array with the first element
         | being even, this function finishes after executing 4
         | instructions: 2 comparisons, an array access and then a return.
         | So it's best case complexity is best(n) = 4, which is O(1),
         | Th(1), O(1).
         | 
         | In the worse case, all elements of the array are odd, so we
         | have to go through all n elements to find that out. It will
         | have to execute 6 instructions for each element in the array (2
         | comparisons, 2 array accesses, a call and a return), except for
         | the last element where it will execute only 2 instructions
         | (check size then return). So its worst case complexity function
         | is worst(n) = 6(n-1) + 2, which is O(n), Th(n), O(n).
         | 
         | In the average case, for every call we are flipping a coin, and
         | we have n total flips. Since we can assume the coin is fair, we
         | will expect to see an even number after log(n/2) calls of
         | isAllOdd. So, our average case complexity is average(n) =
         | 6*log((n-1)/2) + 2 steps, which is O(log n), Th(log n), O(log
         | n). (I am ignoring cases where n < 2, which is anyway
         | irrelevant for asymptotic behavior).
         | 
         | Now, what is the overall complexity of isAllOdd? We can
         | generalize the average case calculation if we want to get an
         | exact value; but whatever that function, overall(n), is, we
         | know for sure best(n) < overall(n) < worst(n); so overall(n) is
         | O(n) and O(1). However, we can't define any Th for it - there
         | is no function which it will differ from by at most a constant
         | factor for any n.
         | 
         | Why do the detailed analysis instead of saying this is O(n) and
         | being done with it? Well, perhaps we know that we are in fact
         | calling this function with a large number of even numbers, so
         | we would actually over-estimate its complexity. If we instead
         | know we are calling it with a large number of odd numbers, we
         | would also find out from the average case analysis that it
         | would be better to define a reverse funtion, isAllEven, and
         | compute isAllOdd = !isAllEven.
        
         | karpierz wrote:
         | Theta(X) is not "average case" behaviour, it means "both
         | Omega(X) and O(X)".
         | 
         | Ex: merge sort is Theta(n log(n)), but quick sort is not.
         | 
         | The author is likely saying "worst case O(n log(n))" as
         | redundancy to emphasize that this is worst case behaviour.
        
           | tsimionescu wrote:
           | It's not redundant. You can have "average case O(n log(n))"
           | but "worse case O(n^2)", like quicksort does. Depending on
           | the pivot function, that "worse case" can be more or less
           | common.
        
             | karpierz wrote:
             | It is not redundant to say "average case O(n log(n))". It's
             | not formally a correct usage of O(). But colloquially, it
             | means that the average number of operations over all
             | possible inputs on length n is X.
             | 
             | It is redundant to say "worst case O(n^2)". By definition,
             | O(X) means that the maximum of the number of operations
             | over all possible inputs of length n is at most X.
        
               | tsimionescu wrote:
               | > It is not redundant to say "average case O(n log(n))".
               | It's not formally a correct usage of O().
               | 
               | You are wrong. It is perfectly formally correct to say
               | "average case complexity is 2n log(100n) + 78n + 90,
               | which is O(n log (n))".
               | 
               | > By definition, O(X) means that the maximum of the
               | number of operations over all possible inputs of length n
               | is at most X.
               | 
               | No, that is, in fact, a colloquial use of O(X). O(X),
               | formally, is simply the set of all functions that are
               | bigger than X up to some constant factor. It has nothing
               | to do with number of operations in itself.
               | 
               | Instead, we formally define complexity functions for an
               | algorithm, which are the minimum/average/maximum number
               | of operations the algorithm will execute for any input of
               | size N; call these functions Cmin(n), Cavg(n), Cmax(n).
               | Then, we can say that C_(n) is in O(X(n)) (or, as a
               | shorthand, C_(n) = O(X(n))) if there exists a number k
               | such that C_(n) < k * X(n) for any n > n0.
               | 
               | So, we can say that the best case complexity of an
               | algorithm, Cmin(n), is O(n^2); this means that even for
               | the best possible input of length n, it will execute at
               | least k * n^2 steps, for some k. Or, we can say that the
               | worse case complexity of some algorithm, Cmax(n), is
               | Omega(n^2); that is, that even in the worse case input of
               | length n, it will execute less than k * n^2 steps, for
               | some k.
               | 
               | In fact, the concept "number of steps that an algorithm
               | will execute for any input of size n" is not a
               | mathematical function at all (unlike min/max/avg number
               | of steps), since there isn't, in general, a unique such
               | number for any given size.
               | 
               | So, I would argue that saying "complexity O(n^2)" is in
               | fact ill-defined formally, and that, if you want to be
               | formal, you _must_ specify  "worst case complexity
               | O(n^2)".
        
               | stonemetal12 wrote:
               | Would argue it isn't perfectly formally correct, because
               | "average case complexity" isn't well defined. Is average
               | here mean, median or mode? It is typically used as if it
               | meant, 90th percentile, or all but a few degenerate
               | cases.
               | 
               | But that is a bit nit picky.
        
               | Kranar wrote:
               | You are incorrect about this. Worst case and Big O are
               | independent of one another. The O, O, Th, etc... are
               | about describing bounds where O(f) is used for the set of
               | asymptotic upper bounds on f, O(f) is the set of
               | asymptotic lower bounds on f, and Th(f) is the
               | intersection of O(f) and O(f).
               | 
               | Typically f is used to describe space or time complexity
               | of an algorithm, but it can be used for any function that
               | maps positive integers to the non-negative real numbers.
               | 
               | You can make any combination of best case/worst
               | case/average case with O, O, Th. In fact, you can even
               | consider other cases as well such as amortized cases,
               | 90th percentile cases, any case you want to investigate
               | that can be defined as a function from positive integers
               | to non-negative real numbers can be classified as having
               | an asymptotic upper bound, lower bound, or tight bound.
        
               | karpierz wrote:
               | Yeah, okay. You are correct here.
               | 
               | "Average-case of algorithm X" can be a function function
               | (average number of operations for all possible inputs of
               | length n). And asymptotic bounds can be applied to
               | arbitrary functions from R -> R.
        
         | dudus wrote:
         | I remember from college there was big-Oh `O(x)` and little-Oh
         | `o(x)`. Is Omega/Theta the same as little-Oh?
        
       | gwern wrote:
       | I'd like to hear more about the reinforcement learning part. The
       | patch literally just says 'Reinforcement Learning'. Not very
       | helpful!
        
       | hbn wrote:
       | Sorry if this seems pedantic in the grand scheme of the topic,
       | but is the indentation in the code samples messed up, or is there
       | some kind of logic to it? I've never seen anything like that, and
       | I find it pretty difficult to read.
        
         | bogwog wrote:
         | Looks normal to me (except it seems like some snippets indent
         | with 2 spaces while others indent with 4).
         | 
         | The line spacing is weird though.
        
         | danlark wrote:
         | Sorry, it definitely was a little inconsistent as I stored all
         | snippets without proper formatting and I tried at the same time
         | align texts for phones, etc
         | 
         | Failed at everything. Will improve
        
       | evmar wrote:
       | The "use ASLR for random seed" trick in there is pretty cute, but
       | I noticed it will just silently not work in non-ASLR settings,
       | and in particular under Wasm.
        
       | forrestthewoods wrote:
       | An interesting takeaway here is that custom comparators are
       | really hard to write correctly. There are a lot of constraints
       | that aren't obvious.
       | 
       | I suspect there should be a new std::verify_weak_ordering to help
       | people write tests that perform n^3 tests on potential data to
       | ensure a comparator is safe to use.
        
         | jcelerier wrote:
         | MSVC and GCC's standard libraries implementations both have a
         | debug mode which has been checking that for almost two decades.
        
       | DrBazza wrote:
       | Quite a long interesting read. Unless I missed it, I recall that
       | there's optimal solutions for sorting up to 7?? items. I can't
       | find the reference, but unsurprisingly I read it a dlang forum
       | post by Andrei Alexandrescu. Anyone have a link to that aspect of
       | sorting?
        
         | tialaramex wrote:
         | Under the section heading "Reinforcement learning for small
         | sorts", there are diagrams of the small sorts, for 3, 4 and 5
         | items.
         | 
         | They also explain that down here you're at the mercy of the
         | physical hardware. Emitting one fewer x86 instruction does not
         | necessarily make your program faster, still if it's not slower
         | then smaller programs are good too.
        
       | westurner wrote:
       | > LLVM history: _Back then we recognized some really interesting
       | benchmarks and we didn't recall anybody trying to really
       | benchmark sorts on different data patterns for standard
       | libraries._
       | 
       | Timsort https://en.wikipedia.org/wiki/Timsort :
       | 
       | > _In the worst case, Timsort takes O(n log n) comparisons to
       | sort an array of n elements. In the best case, which occurs when
       | the input is already sorted, it runs in linear time, meaning that
       | it is an adaptive sorting algorithm. [3]_
       | 
       | > _It is advantageous over Quicksort for sorting object
       | references or pointers because these require expensive memory
       | indirection to access data and perform comparisons and Quicksort
       | 's cache coherence benefits are greatly reduced._ [...]
       | 
       | > _Timsort has been Python 's standard sorting algorithm since
       | version 2.3 [~2002]. It is also used to sort arrays of non-
       | primitive type in Java SE 7,[4] on the Android platform,[5] in
       | GNU Octave,[6] on V8,[7] Swift,[8] and Rust.[9]_
       | 
       | Sorting algorithm > Comparison of algorithms
       | https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_o...
       | 
       | Schwartzian transform
       | https://en.wikipedia.org/wiki/Schwartzian_transform :
       | 
       | > _In Python 2.4 and above, both the sorted() function and the
       | in-place list.sort() method take a key= parameter that allows the
       | user to provide a "key function" (like foo in the examples
       | above). In Python 3 and above, use of the key function is the
       | only way to specify a custom sort order (the previously supported
       | cmp= parameter that allowed the user to provide a "comparison
       | function" was removed). Before Python 2.4, developers would use
       | the lisp-originated decorate-sort-undecorate (DSU) idiom,[2]
       | usually by wrapping the objects in a (sortkey, object) tuple_
       | 
       | Big-O Cheatsheet https://www.bigocheatsheet.com/
       | 
       | Quicksort in Python 7 ways (and many other languages) on
       | RosettaCode:
       | https://rosettacode.org/wiki/Sorting_algorithms/Quicksort#Py...
        
         | dralley wrote:
         | IIRC Rust and Swift have a modified timsort implementation to
         | remove the "galloping" strategy.
        
           | tialaramex wrote:
           | Rust has "an adaptive, iterative merge sort inspired by
           | timsort" to implement the stable sort() method if you have an
           | allocator, and a pattern-defeating quicksort to implement
           | unstable_sort() which is provided even if you don't have an
           | allocator (no_std embedded environments).
        
         | danlark wrote:
         | Thanks, I'll remove the note about "recalling"
        
           | westurner wrote:
           | xeus-cling didn't exist back then:
           | https://github.com/jupyter-xeus/xeus-cling#a-c-notebook
           | 
           | https://github.com/fffaraz/awesome-cpp#debug
           | 
           | A standard way to benchmark and chart {sorting algorithms,
           | web framework benchmarks,} would be great.
           | 
           | The TechEmpower "framework overhead" benchmarks might have at
           | least average case sorting in there somewhere: https://www.te
           | chempower.com/benchmarks/#section=data-r20&hw=...
        
             | westurner wrote:
             | awesome-algorithms https://github.com/tayllan/awesome-
             | algorithms#github-librari...
             | 
             | awesome-theoretical-computer-science > Machine Learning
             | Theory, Physics; Grover's; and surely something is faster
             | than Timsort: https://github.com/mostafatouny/awesome-
             | theoretical-computer...
        
               | samhw wrote:
               | Christ, the description on that second repo is so
               | _comically terribly_ spelled that I can't figure out
               | whether it's a joke:
               | 
               | > The interdicplinary of Mathematics and Computer
               | Science, Distinguisehed by its emphasis on mathemtical
               | technique and rigour.
               | 
               | (Perhaps a witty satire on the reputation of mathmos for
               | being godawful at anything literary...)
        
         | chkas wrote:
         | I don't understand why you would need Timsort. If the pivot
         | element is chosen randomly, Quicksort is always (or with
         | probability bordering on certainty) O(n log n).
         | 
         | Update: I confused Timsort with IntroSort. I should have
         | googled before posting ...
        
           | masklinn wrote:
           | > I don't understand why you would need Timsort. If the pivot
           | element is chosen randomly, Quicksort is always (or with
           | probability bordering on certainty) O(n log n).
           | 
           | For starters, it's a stable sort, which excludes quicksort
           | right out the gates.
           | 
           | Second, the primary hypothesis of timsort is that real-world
           | data is non-random, timsort is biased for partially or
           | almost-sorted sequences, which it can sort in as low as O(n).
           | That would be the point westurner was making, timsort's
           | creation is very much predicated on "really benchmark sorts
           | on different data patterns".
           | 
           | Third, for very short sequences (also a common situation)
           | it's an insertion sort, which is extremely efficient due to
           | the low overhead (even more so combined with the adaptivity
           | which insertion sort also has).
        
             | orlp wrote:
             | > For starters, it's a stable sort, which excludes
             | quicksort right out the gates.
             | 
             | Negative, it excludes in-place quicksort but out-of-place
             | quicksort can be stable just fine.
             | 
             | > Third, for very short sequences (also a common situation)
             | it's an insertion sort, which is extremely efficient due to
             | the low overhead (even more so combined with the adaptivity
             | which insertion sort also has).
             | 
             | Every modern quicksort implementation also does this (or a
             | similarly efficient smallsort).
        
               | superjan wrote:
               | Hi, you are ignoring the second point, and for the other
               | two points you are referring to particular extensions of
               | quicksort. Quite possibly you can extend quicksort
               | similar to how timsort extended mergesort, but if that
               | was commonplace nobody would care about timsort.
        
               | orlp wrote:
               | > and for the other two points you are referring to
               | particular extensions of quicksort
               | 
               | Maybe for the third point, but once again, every
               | implementation does this. For the first point, absolutely
               | not. Just because in-place partitioning is often the only
               | variant described, out-of-place partitioning is
               | absolutely just quicksort, not an extension.
               | 
               | > Quite possibly you can extend quicksort similar to how
               | timsort extended mergesort
               | 
               | Please see my top comment in this thread - I have done
               | exactly that. Or rather, I have extended powersort (which
               | is very similar to timsort) with quicksort.
        
         | loeg wrote:
         | I ported Timsort to the Windows kernel back in college. Good
         | times!
        
       | Something1234 wrote:
       | I'm confused, so the comparison function can modify the list? I
       | thought most sorting algorithms assumed the contents of the list
       | was constant. Why would I want my comparator to ever modify the
       | list under sort?
        
         | slavboj wrote:
         | An "accessed at" field is a trivial use case.
        
         | danlark wrote:
         | You don't, it's just not disallowed :)
        
           | infogulch wrote:
           | "because it prevents the user from writing a comparison
           | function that modifies the list while sorting" is not a
           | reason I would expect to prefer rust, and yet here we are.
        
             | mbrubeck wrote:
             | It's possible in Rust too, but only for a subset of element
             | types: those that support shared mutation (also known as
             | "internal mutation"). For example, a type containing an
             | `AtomicU32` or a `Mutex<T>` or a `Cell<T>` could have an
             | `Ord::cmp` implementation that alters its value, and the
             | compiler will not prevent this.
        
               | orlp wrote:
               | Yes, I had a nasty bug in an initial version of glidesort
               | where I (naively) assumed that I could create a temporary
               | copy of an element by doing a memcpy and call the
               | comparison function on this temporary copy a bunch of
               | times before throwing it away, since I assumed the
               | comparison function would not mutate the copy.
               | Unfortunately interior mutability means this would lead
               | to potential double drops, and it was just not sound.
        
         | [deleted]
        
         | jmalicki wrote:
         | If you examine the modification more closely, it is just
         | finding the list that would exhibit worst case performance -
         | once created, that list would always exhibit the worst case as
         | a constant list.
         | 
         | Once you have the resulting list, the indices are sorted, then
         | assigning each num_solid value to that index would result in
         | the list being constant, and then could be sorted to give such
         | worst case behavior.
         | 
         | I wish the article was more clear on this.
        
       | orlp wrote:
       | Hi all, I'm the author of pdqsort that's credited in the post (to
       | be clear: the authors acknowledged pdqsort at the bottom, I am
       | _not_ associated with the posted work directly). I recently held
       | a talk at my institute on efficient in-memory sorting and the
       | ideas in pdqsort, in case you 're interested in hearing some of
       | the theory behind it all: https://www.youtube.com/watch?v=jz-
       | PBiWwNjc
       | 
       | Next week I will hold another talk in the Dutch seminar on data
       | systems design (https://dsdsd.da.cwi.nl/) on glidesort, a new
       | _stable_ sorting algorithm I 've been working on. It is a
       | combination of adaptive quicksort (like pdqsort, fully adaptive
       | for many equal elements) and an adaptive mergesort (like timsort,
       | fully adaptive for long pre-sorted runs). It is the first
       | practical implementation of an algorithm I'm aware of that's
       | fully adaptive for both. Like pdqsort it uses modern architecture
       | aware branchless sorting, and it can use an arbitrary buffer
       | size, becoming faster as you give it more memory (although if
       | given a constant buffer size it will degrade to O(n (log n)^2) in
       | theory, in practice for realistic workloads it's just a near-
       | constant factor (c ~<= 3-5) slower).
       | 
       | The source code isn't publish-ready yet, I have to still do some
       | extra correctness vetting and testing, and in particular
       | exception-safety is still not yet fully implemented. This is
       | important because I wrote it in Rust where we must always give
       | back a valid initialized array, even if a comparison operator
       | caused a panic. But I do have some performance numbers to quote,
       | that won't significantly change.
       | 
       | For sorting 2^24 randomly shuffled distinct u32s using a buffer
       | of 2^23 elements (n/2), glidesort beats Rust's stdlib slice::sort
       | (which is a classic timsort also using a buffer of n/2) by a
       | factor of 3.5 times. When stably sorting the same numbers
       | comparing only their least significant 4 bits, it beats stdlib
       | slice::sort by 12.5 times using 6.5 times fewer comparisons, both
       | numbers on my Apple M1 Macbook. All of this is just using single-
       | threaded code with a generic comparison function. No SIMD, no
       | threads, no type-specific optimizations.
       | 
       | Finally, glidesort with a buffer size of >= n/2 is _faster_ than
       | pdqsort.
        
         | mlochbaum wrote:
         | Any chance you could comment on fluxsort[0], another fast
         | quicksort? It's stable and uses a buffer about the size of the
         | original array, which sounds like it puts it in a similar
         | category as glidesort. Benchmarks against pdqsort at the end of
         | that README; I can verify that it's faster on random data by
         | 30% or so, and the stable partitioning should mean it's at
         | least as adaptive (but the current implementation uses an
         | initial analysis pass followed by adaptive mergesort rather
         | than optimistic insertion sort to deal with nearly-sorted data,
         | which IMO is fragile). There's an in-place effort called
         | crumsort[1] along similar lines, but it's not stable.
         | 
         | I've been doing a lot of work on sorting[2], in particular
         | working to hybridize various approaches better. Very much
         | looking forward to seeing how glidesort works.
         | 
         | [0] https://github.com/scandum/fluxsort
         | 
         | [1] https://github.com/scandum/crumsort
         | 
         | [2]
         | https://mlochbaum.github.io/BQN/implementation/primitive/sor...
        
           | orlp wrote:
           | I have stolen a lot of ideas from scandum and extended his
           | ideas in new ways. He is definitely a mad genius. Glidesort
           | (on my M1 machine at least) matches fluxsort within a couple
           | % for random data, but glidesort is robust in that it will
           | always take advantage of pre-sorted runs and many equal
           | elements (at least if it has buffer memory), no matter where
           | they are in the array.
           | 
           | In particular, I was inspired by three things from scandum:
           | 
           | 1. fluxsort's out-of-place stable partitioning. From this I
           | got reminded that not only is out-of-place stable
           | partitioning a thing, it's highly competitive. I've always
           | had this as an idea in the back of my mind, but never went
           | through with it because I kept getting discouraged by C++'s
           | distinction of moving to uninitialized memory vs. moving into
           | a moved-from value (which is why I implemented glidesort in
           | Rust).
           | 
           | 2. quadsort's "ping pong merge", which reduces unnecessary
           | memcpys by merging both on the way _out_ and on the way _in_
           | the original array. I did have this idea before, but always
           | dismissed it because I thought keeping track of what 's where
           | would be a massive pain. Simply waiting until there's 4
           | things to merge eliminates this problem and is just genius.
           | 
           | 3. quadsort's "branchless parity merge", which merges from
           | both ends of the array if the merge is perfectly balanced. I
           | make no claim that I thought of this, it's just genius. I had
           | two key takeaways from this: you can make some very fast
           | small sorting algorithms with merges, and interleaving loops
           | to reduce data dependencies are significantly faster.
           | 
           | So I combined #1 & #3 into what I call bidirectional stable
           | partitioning, where I partition from _both_ sides of the
           | array into an out-of-place buffer through interleaved loops.
           | 
           | I extended the adaptiveness and applicability of #2 heavily
           | by replacing the 'merge' operation in powersort
           | (https://arxiv.org/abs/1805.04154) with a 'virtual merge'
           | operation that delays merges until necessary. This is also
           | what allows me to use quicksort in a bottom-up adaptive
           | mergesort, because I don't eagerly sort small runs! Instead I
           | simply keep unsorted runs around, 'merging' unsorted runs
           | simply by concatenating them - purely in bookkeeping.
           | 
           | I heavily extended 3 for the mergesort part by realizing we
           | don't need perfectly balanced merges, we can just take the
           | `min` of the two runs and start off with a merge from both
           | sides, and then look further. I also did more interleaving by
           | doing a binary search to compute independent parallel merges
           | where sensible, and interleaving those loops.
           | 
           | As a quick preview, here is a visualization of glidesort
           | using a buffer size of n/2, where I have artificially limited
           | the concatenation of unsorted runs to n/8 so that it won't
           | just look only like quicksort, and both the quicksort and
           | mergesort aspects are shown: https://cdn.discordapp.com/attac
           | hments/273539705595756544/96...
        
             | mlochbaum wrote:
             | Thanks for the discussion! Can't say I follow everything,
             | but using parity merge for part of an unbalanced merge
             | makes a lot of sense and that alone is worth it.
             | 
             | Stepped through the video a few times at 1/4 speed. The n/8
             | thing is a bit confusing, first because I didn't read it
             | and second because it makes it hard to tell a partition
             | result from the beginning of the next segment. I think I
             | can follow what's going on, but I don't get the purpose of
             | the bidirectional partition. It doesn't use less memory,
             | does it? So is there something to do with fitting in with
             | mergesort better? I'm not familiar with powersort; I'll
             | read up on it.
        
               | orlp wrote:
               | > but I don't get the purpose of the bidirectional
               | partition. It doesn't use less memory, does it? So is
               | there something to do with fitting in with mergesort
               | better
               | 
               | Nope, it does the same amount of comparisons, same number
               | of operations, same memory, etc. What it does do is it
               | allows you to interleave two independent loops, which is
               | also what makes the parity merge fast (I think scandum
               | misidentifies the loop unrolling for this effect for
               | large arrays - you can loop unroll merging large arrays
               | either way - for small constant-size merging it is
               | important however).
               | 
               | A modern CPU has a very long pipeline, and even though we
               | like to pretend all instructions are one cycle with no
               | latency, in reality there are real latencies to
               | instructions and memory accesses, and multiple
               | instructions can be executed in the same cycle. Since
               | each iteration of the partition depends on the previous
               | one (which pointer did we increment? we have to know
               | before we can execute the next store), you can hide these
               | latencies better if you interleave two independent loops.
               | In addition you can use more instruction-level
               | parallelism.
        
               | mlochbaum wrote:
               | Got it. I'd considered that but something about your
               | explanation threw me off. A subtlety I missed was that
               | you can't just do two forward partitions, because the
               | second one begins exactly halfway through the array--one
               | of the results would be placed there but it's probably
               | not the correct start location for the larger half-
               | partition.
        
               | orlp wrote:
               | Exactly, which is why I call it bidirectional
               | partitioning: one forward, one backward. It's a very
               | strange case where you can use parallelism (in this case
               | instruction-level parallelism), but only get two
               | independent instances without the ability to recurse
               | further.
               | 
               | You can of course make partitioning embarrassingly
               | parallel, look at IPS4o for that. But it is vastly more
               | complicated, and involves overhead shuffling blocks after
               | the partition.
        
         | hengri wrote:
         | Random question, but is this from the Google bare metal team? I
         | remember interviewing with them and the work they did was
         | fascinating.
        
           | orlp wrote:
           | I'm not sure if you're asking me personally or about the
           | original post, but to be clear: I'm not (currently)
           | associated with Google, nor is my work on pdqsort or
           | glidesort. I'm also not associated with the original post,
           | other than that my work on pdqsort was credited at the bottom
           | of the blog post, which is what my opening sentence was
           | referring to.
           | 
           | I edited my comment to reflect this, I can see how it
           | could've been misinterpreted. It was not my intention to
           | deceive.
        
           | danlark wrote:
           | Hi, the author is here
           | 
           | I am working in MapReduce/Data-pipelining/Dataproc efficiency
           | internally and externally. In cloud you can check out Google
           | Cloud Dataflow
           | 
           | We are working closely with general efficiency and bare metal
           | teams, yes :). They definitely do a fascinating work. This
           | one was my 20% project together with the google efficiency
           | team
        
       | DontGiveTwoFlux wrote:
       | OP describes systems which seed tests with non-determinism to
       | catch users who write tests relying on undefined behavior, such
       | as order of equal elements. Writing tests against these systems
       | is a game changer (once you figure out what's going on and why
       | there are flakes). I'm a huge fan of this subtle change which has
       | spared me at least one bug. It also allows library authors to
       | gain the advantage of not accidentally producing a behavior that
       | induces Hyrum's law. Kudos to the team for making this work at
       | scale.
        
       | hu3 wrote:
       | offtopic: Such a great engineer and yet their blog says: Website
       | Powered by WordPress.com.
       | 
       | I find the pragmatism worthy of praise. I've been meaning to
       | start blogging since forever but I keep planning and researching
       | about blog platforms/tools and never get to the point.
        
         | papito wrote:
         | WP is a never-ending source of security vulnerabilities. 30% of
         | the internet runs on it.
         | 
         | I found Jekyll and never looked back.
        
           | nvarsj wrote:
           | That's why you use managed Wordpress, aka wordpress.com.
        
         | sydthrowaway wrote:
        
           | hatware wrote:
           | Did one person design the SR-71...?
        
             | sydthrowaway wrote:
             | yes
        
           | samhw wrote:
           | In case you genuinely missed the point: he was saying that
           | the person is a great engineer for _not_ wasting their time
           | optimising irrelevant details like their blog. Time
           | allocation is an optimisation problem like any other.
        
         | [deleted]
        
         | haolez wrote:
         | Making use of existing tools and being able to manage their
         | shortcomings is an important engineering skill.
        
         | gzer0 wrote:
         | The paradox of choice. I suffer from this way too often.
         | 
         | https://en.wikipedia.org/wiki/The_Paradox_of_Choice
        
       | a_e_k wrote:
       | > _If you compare by a boolean variable, for example, partition
       | data by existence of something, it's very tempting to write
       | std::sort call._
       | 
       | > _We saw a pattern of sort+resize a lot._
       | 
       | These seem like good opportunities for clang-tidy checks.
        
       | techwiz137 wrote:
       | Do you need to have learned Computer Science to grasp the topic
       | at hand here?
        
         | [deleted]
        
       | dekhn wrote:
       | Great fun read. I love this stuff; when I first learned C++ many
       | years ago I didn't really know much about complexity and what I
       | have learned over the years is that complexity costs depend a lot
       | on the underlying data distributions.
        
       | seanp2k2 wrote:
       | Thanks, I'm going to write all this out on a whiteboard the next
       | time someone asks me to sort a linked list in a technical
       | interview.
       | 
       | Edit: just wanted to clarify that the point I'm trying to make
       | here is that if you want to ask someone if they've memorized the
       | answer to that question, go ahead and ask it in a technical
       | interview, because the _real_ well-considered answer looks like
       | the linked article, and my honest non-smartass answer is that one
       | should use the library function and not try to implement it
       | themselves until they 've done sufficient research into the
       | problem to prove to a reasonable degree that they indeed do need
       | to re-implement sort() and that it's in the best interests of the
       | business to fund such a project.
       | 
       | Would love to read this over the phone to a recruiter and have
       | them tell me that it's not what they have on their sheet.
        
       | 10000truths wrote:
       | I never really understood why a standard library would insist on
       | having a single "sort" function (or at best, a "sort" and "stable
       | sort"). Why not just provide explicit quick sort, merge sort etc.
       | functions and let users choose which they want? Sorting is one of
       | the first things a budding computer scientist learns, and the
       | research on sorting algorithms is well-trodden ground by now, so
       | I don't see how it could be a usability issue. I'd like to be
       | able to choose from these criteria without having to rely on
       | external libraries:
       | 
       | - How does it perform on average?
       | 
       | - How does it perform on pathological inputs?
       | 
       | - Is it stable?
       | 
       | - Does it allocate?
       | 
       | Not to mention constraints that an application can take advantage
       | of:
       | 
       | - What if the data is already mostly sorted? (insertion sort)
       | 
       | - What if writes are very expensive? (cycle sort)
       | 
       | - What if all the data are integers? (radix sort)
       | 
       | And so on. Trying to force a one-size-fits-all solution seems
       | rather silly at this point.
        
         | Hello71 wrote:
         | while i agree with you to some extent, i don't think including
         | a very large set of sorting functions in the standard library
         | is a good idea. especially for dynamic languages, a larger
         | standard library means more disk usage and worse cache
         | utilization. it's also typically relatively easy to use your
         | own sort function: in performance-critical cases, the sort
         | function is usually called directly, not via intermediate
         | libraries (which might need to be patched to call the new sort
         | function); also, sort functions are self-contained, i.e.
         | typically don't require system dependencies.
        
           | samhw wrote:
           | > A larger standard library means more disk usage and worse
           | cache utilization
           | 
           | Does it? How? I suppose one has to store the standard library
           | on the machine, which does notionally use up disk space, but
           | I can't see how it contributes by any reasonable performance-
           | relevant meaning of 'disk usage' (presumably i.e. disk I/O
           | during the runtime of the program). And for cache utilization
           | I simply have no clue at all what you might be driving at.
        
         | dijit wrote:
         | There are such things as sensible defaults though.
         | 
         | Timsort being a somewhat famous example, which uses the fact
         | that python datastructures are known length to figure out the
         | best sorting method for the size.
         | 
         | If you need something more exotic then you likely know more
         | about what you're doing, but most people writing software who
         | need a sort function usually don't need more than timsort.
        
           | morelisp wrote:
           | > uses the fact that python datastructures are known length
           | to figure out the best sorting method for the size.
           | 
           | That... does not match with my knowledge of how timsort
           | works. The sorting method is chosen based on the array
           | length, and every sorting algorithm in every language knows
           | that. It's extremely suboptimal for some types because
           | Python's typing is so weak!
           | 
           | Timsort succeeds because it has many good heuristics to
           | optimize common properties of its merge intervals.
        
         | wongarsu wrote:
         | I think only having "sort" and "stable sort" makes sense for
         | the same reason there's only one hashmap or one vector in most
         | standard libraries: 99% of the time you just want a high-
         | quality implementation of one reasonable choice, without
         | getting into the weeds of it.
         | 
         | Most of the time I don't care if it's timsort or quicksort, or
         | if the hashmap uses Cuckoo hashing or buckets made of linked
         | lists. Of course there are the 1% of cases where I notice that
         | my sort or hashmap is a performance bottleneck and that a
         | different algorithm might perform better. But standard
         | libraries aren't good at serving that need because they have to
         | be broadly applicable and conservative, if I already know that
         | I want top performance for specific requirements I'm almost
         | certainly better served by an external library.
        
         | sam0x17 wrote:
         | I think this goes back to the whole idea of non-CS people being
         | the primary users of standard libraries.
         | 
         | I agree though, there should be the ability to specify which
         | one, and every function in the standard library should have big
         | O notation in the function description. Still makes sense to
         | have a general-purpose good one as the default.
        
         | yupper32 wrote:
         | There are defaults for all sorts of functions. find() functions
         | on an array in many languages use some default. Parameters
         | often have defaults when not set. Strings often have a default
         | encoding.
         | 
         | The vast majority of the time, the default works well enough.
         | Why make things more complicated than they need to be?
         | 
         | It also has the added benefits of helping more junior engineers
         | when reading the code. "Cycle sort? What is that doing?"
         | "Insertion sort? I was taught that was inefficient! I should
         | spend time updating this code to a better sorting algorithm! My
         | tech lead would love my initiative."
         | 
         | Or we just use sort(), and we all move on with our lives.
        
         | tsimionescu wrote:
         | You are thinking about the weeds of sorting now, but that is
         | rarely relevant to the problem at hand.
         | 
         | You can (and most people do) look at it from the other angle.
         | Sort(seq) is a function which returns a permutation of seq such
         | that every element is less-than-or-equal-to the next element.
         | Usually, we use this function because we need this property to
         | hold in our following code. A lot of the time, this function is
         | called for very short sequences, where even the most naive
         | implementation would be fats enough for even the highest scale
         | program. So, in the vast majority of code we only care about
         | the semantics of std::sort; and all of the above have the same
         | semantics.
         | 
         | In fact, the fact that we do often get a stable sort as well
         | shows that semantics are the most important characteristic, as
         | stable and unstable sorts do have slightly different semantics.
        
       | pikseladam wrote:
       | wow. that was really impressive. scale is different kind of
       | monster. it is like an end game boss. Thank you for sharing it. I
       | learned a lot.
        
       | tialaramex wrote:
       | Fun safety problems.
       | 
       | Towards the end of this post, the author mentions a bunch of
       | places where std::sort() blows up if your data or your sort rule
       | doesn't exhibit strict weak ordering. In some cases you get a bad
       | sort, which feels fair enough since you violated the rules the
       | sort algorithm is relying on, but in others it literally has
       | Undefined Behaviour in C++.
       | 
       | Notably Rust's sort() and unstable_sort() do not have this
       | problem, because of course these are safe Rust, and safe Rust
       | doesn't have Undefined Behaviour. But why can this work?
       | 
       | First up, one bullet which gets dodged is that Rust's floating
       | point types (f32 and f64) are not Ord. Rust acknowledges that NaN
       | != NaN and therefore we can't expect to go around sorting a slice
       | of floating point values. In C++ that's on the programmer, in C++
       | the fact you can write a < b means that type counts as ordered,
       | even though it trivially isn't for several built-in types. So if
       | you simply naively try to sort an [f32] Rust says you can't
       | because it isn't Ord, you will need to write a lambda or whatever
       | to Order these f32s and then if some of them are NaN your lambda
       | is responsible for deciding what to do about that.
       | 
       | But even if we insist on getting in harm's way, safe Rust
       | promises to protect us anyway. For example if I make an array of
       | a thousand misfortunate::OnewayGreater<u8>s they're only one byte
       | each, yet they all insist they're Ordered greater than each other
       | and even than themselves, Rust's sort() doesn't cause Undefined
       | Behaviour on this input. It may never succeed, since all the
       | elements to be "sorted" insist they're greater than each other,
       | but it definitely won't segfault, trash unrelated memory or cause
       | chaos.
        
         | nullc wrote:
         | > or cause chaos
         | 
         | Running forever is pretty chaotic.
        
           | ElectricalUnion wrote:
           | Running forever is pretty stable. Halting is chaotic.
        
             | tialaramex wrote:
             | A trivial infinite loop is _Undefined Behaviour_ in C++ but
             | it is a completely reasonable thing to write in Rust and
             | works fine, in fact Rust 's underlying loop _is_ an
             | infinite loop unless you explicitly provide some way to
             | break out of it, which you usually will.
        
       ___________________________________________________________________
       (page generated 2022-04-20 23:00 UTC)