[HN Gopher] Changing std:sort at Google's scale and beyond ___________________________________________________________________ Changing std:sort at Google's scale and beyond Author : ashvardanian Score : 352 points Date : 2022-04-20 15:57 UTC (7 hours ago) (HTM) web link (danlark.org) (TXT) w3m dump (danlark.org) | jeffbee wrote: | Huh. If I found that Reflection::ListFieldsMayFailOnStripped was | a frequent caller of std::sort, to an extent that it was actually | costing measurable CPU time, first thing I would do is find and | fix all callers of _that_. | gipp wrote: | "Fix" how exactly? I'm not quite sure what you're suggesting. | jeffbee wrote: | C++ proto reflection is a convenience. If it actually shows | up at scale in production, that smells like a design error. | moultano wrote: | All things show up at scale in production at Google. | jeffbee wrote: | In one sense yes, but in another sense even though Google | is large their threshold for something being worth fixing | based on efficiency justifications is correspondingly | large. | foota wrote: | Frequently though, you're dealing with a death by a | thousand paper cuts. Perhaps that call comes from some | common library whose implementation requires reflection. | moultano wrote: | It isn't though. A typical developer will never care | about 0.1% of unnecessary CPU usage in their application, | but at Google, depending on what component it's in, that | can easily be a million dollar find. | usefulcat wrote: | If you're measuring in terms of time, electricity, or | hardware, then the threshold would tend to be lower for | Google, not higher. A 0.1% savings on a million machines | is ~1 million times more valuable than a 0.1% savings on | a single machine. | | OTOH if the measurement is risk of breaking something, | then it's not nearly so clear, since larger organizations | tend to have more software and hardware, and therefore | more opportunities for things to get broken. | jeffbee wrote: | You're saying the opposite of what the other guy said, | though. They said that at scale small things are big. I'm | saying that at scale .001% is still .001%. | jcelerier wrote: | > In one sense yes, but in another sense even though | Google is large their threshold for something being worth | fixing based on efficiency justifications is | correspondingly large. | | not everything is a linear relationship | pierrebai wrote: | In the first example of problematic calls to nth_element, there | is another bug: if the container contains nothing, it will call | nth_element with begin(), begin() - 1, ... | | Also, typo: "compare functions much comply" -> "compare functions | must comply" | timmytokyo wrote: | "It took us around 20 years to find something really appropriate, | thanks to Andrei Alexandrescu." | | The way this is written, it sounds as if Andrei Alexandrescu is | responsible for the 20 year delay. | lauv0x wrote: | samueloph wrote: | I'm starting to question my understanding of algorithmic | complexity, the author says things like "complexity being worst | case O(n \log n)" and seems to talk about big O notation used for | average cases as well. | | I've learned that big O notation is always and exclusively about | the worst case, for average and best cases we use Theta Th and | Omega O, respectively. | | Do people not follow these definitions strictly? Has the author | committed a mistake or am I missing something? | | I did experience an interviewer once saying that my complexity | analysis was wrong as the worst case would be extremely rare to | hit, I did try to explain nicely that big O notation is always | about the worst case, and that they were thinking about big Theta | instead, now I wonder if I was wrong. | waynecochran wrote: | You are mixing two things together. The "average/expected case" | and the "worst case" are two different functions, each with | their own O, Th, O. | Corence wrote: | O, Theta, and Omega are independent from best, worst, and | average case. You can mathematically establish upper, lower, or | exact bounds on any of the average, best, or worst cases. | | It is perfectly valid to say "the best case is O(n)" which | means the best case scales no worse than linearly. The "no | worse" here is not describing the best case, but rather the | strictness of your bound. You could say a sort algorithm is | O(n!) since, yes, it does scale no worse than n! but it's not | particularly helpful information. | | Big O notation is used imprecisely frequently (see also people | saying a dataset is O(millions) to describe scale). | SQueeeeeL wrote: | You should look into amortized complexity. That's the | formalized name for worst case runtime over multiple trials. | samhw wrote: | The other replies are great, but just to put it concisely: big | O is 'worst case' in terms of input _size_ , while the 'best | case big O' is 'best case' in terms of the input _contents_ | ('shape'). | bsedlm wrote: | In my opinion, nothing about big-O and related omega, theta, | etc, is at all strict. | | Maybe I value "pure symbolic clairty" too much. I do see a very | _pragmatic_ usefullness to the complexity notation. | tempnow987 wrote: | I always used big O for worst case (by default). I did think | that was common. | | Theta and Omega I use differently however. | | I sometimes say, best case if X then it's O(n) or whatever. | larsrc wrote: | There are variations, that's what we learned back at Uni. | Expected O(f(n)), average O(f(n)), amortised O(f(n)), even | expected amortised O(f(n)). "Expected" implied a randomised | algorithm. | vmsp wrote: | In school, we used big O notation for all cases. I've never | heard about theta or omega notation. | | Wikipedia's bubble sort page -- as an example -- also uses it | for all cases. | | https://en.wikipedia.org/wiki/Bubble_sort | rockinghigh wrote: | You can see the list of notations here: https://en.wikipedia. | org/wiki/Big_O_notation#Family_of_Bachm... | dwohnitmok wrote: | Big O, Theta, and Omega notations all ultimately denote sets of | functions. So e.g. O(n) is the set of all functions which are | eventually overtaken by some linear function. | | When we say "X is O(f(n))" what we really mean is "the function | that characterizes X is contained within the set of functions | denoted by O(f(n))." Anything that can be described by a | function hence can be given Big O, Theta, or Omega notation. So | yes worst case time complexity could be described in this | notation, but so could average time complexity. | | But we could also describe things that have nothing to do with | time or space complexity with big O notation as well. As long | as we've turned it into a function, we can describe it with big | O notation. | londons_explore wrote: | Computer scientists stole some notation from mathematicians, | but then slightly changed and substantially slackened the | definition... | tsimionescu wrote: | No, not really. CS and other branches of maths are using the | exact same definitions of O,o,Th, and o. It is true that O is | used with a different definition in CS, but it's _stricter_ | not more slack. | | As explained elsewhere though, they are not notations for | algorithmic complexity, they are notations for function | categorization. It just happens that a common use for this | notation in CS is to categorize the function | "best/average/worse/... case complexity of an algorithm", but | they are also used for other purposes. | ufo wrote: | In my experience, a lot of the time when programmers say | big O, they actually mean big Theta... | tsimionescu wrote: | Yes, I agree with that. However, if f(n)=Theta(n) => | f(n)=O(n), so they are usually not wrong, technically, | just unnecessarily broad. | adrianN wrote: | Programmers need not be very good computer scientists. | The actual scientists publishing papers on algorithms | usually get it right. | throw149102 wrote: | I like to use the metaphor of driving somewhere, and trying to | estimate when you're going to get there. Maybe you have a hotel | reservation, and you want to make sure you don't get there too | early nor too late. Then (with some times filled in for | example): +------------------------------------ | ---+-------------------------------+--------------------------- | -----+----------------------------------------+ | | . | Best Case (little/no traffic) | Average | Case (average traffic) | Worst Case (car accident, snow-in etc) | | +---------------------------------------+-------------- | -----------------+--------------------------------+------------ | ----------------------------+ | Max time, upper bound | (Big O) | 30 minutes | 45 minutes | | 1 hour 20 minutes | | Both min and | max, tight bound (Theta) | 25 minutes | 35 | minutes | 1 hour | | | Minimum time, lower bound (Omega) | 15 minutes | | 25 minutes | 50 minutes | | +---------------------------------------+-------------- | -----------------+--------------------------------+------------ | ----------------------------+ | | So then you can start asking questions like - if I assume I'm | in the average case, and I can show up earliest at 20 minutes | from now, and show up latest at 50 minutes from now, can I | guarantee that I will make it within that window? The answer in | this case is yes, because our range is 25-45 minutes, so the | earliest I can show up is in 25 minutes and the latest in 45 | minutes. | | This also helps explain why Big O is generally more useful - we | can almost always be early to something, but we can almost | never be late to something. So showing up 5 minutes early | usually is not a problem. | | It also shows why Theta is better than both - in this case, if | we had this table we could completely ignore both the Big O row | and Omega row, because Theta is a better bound for both min and | max time. | | Of course, in algorithmic complexity we replace the times with | functions in terms of some variables, usually just _n_. | tsimionescu wrote: | Your example is nice, but extremely informal, as these are | all constant numbers, while O/Theta/Omega (and o, omega, | theta) are about functions. | | I think the most important thing that needs to be clarified | though is that algorithmic complexity is first and foremost a | function, `WorseCaseComplexity(size of input) = max number of | steps for any input of that size`. This function is often | then classed into a "complexity class" using O/Theta/Omega | etc, but that is only done for the sake of easier comparison | of complexities between algorithms. | powersnail wrote: | I think you are mistaken here. The concept of | worst/best/amortized case is different from O/Theta/Omega. The | latter is about the direction with which it is bounding. | | Big O/Theta/Omega denotes sets of functions, and each of them | can be applied to best/worst/amortized complexity analysis. | tsimionescu wrote: | Worst case, median case, and best case complexity have nothing | to do with O/Th/O. | | When analyzing an algorithm, you want to compute the number of | steps it has to execute in certain cases as a function of its | input size(s). That function will be something like 2n^2 + 67n | + 189 steps. Since this is too much information, you can then | express it instead as O(n^2) (or O(n^100)) or O(n^2) (or O(25n) | ) or Th(n^2). | | You can then also compute the best case running time of the | algorithm as a function of the size of the input, and get 2n^2 | + 67n. Then you can say that the best case running time is also | O(n^2). | | Finally, you can (sometimes) define an "average" running time. | | Now, what do we mean by "shortest running time"? Is that not | always for n=0 or something? No - the best case running time | has to do with the running time in a case where the input has a | certain shape. For some sorting algorithms for example, if the | input is already sorted, they will only go through the list | once and do nothing, so they have a best case running time that | is O(n). Other sorting algorithms will still perform steps even | if the input is already sorted, so even their best case running | time may be O(n log(n)). | | Now, you can say that, given any possible input, the running | time of the algorithm is O(worse case running time) and O(best | case running time). You definitely CAN'T say in general that it | is Th(average case running time) though, as in will sometimes | run in time that is shorter than "average case running time" | (for best case scenarios); or sometimes it will run in time | that is longer than "average case running time" (for worse case | scenarios). | | Basically, the worse/best/average case running times are an | analysis of how properties other than the size of the input | affect its running time. | | Big-O notation and its friends have nothing to do with | complexity analysis. They are a general notation for the | asymptotic behavior of mathematical functions. Big-O is usually | understood as an upper bound on a function, big-O is a lower | bound, and Th is a tight bound (the function f is by definition | always within a factor of k above or below Th(f)). | | Edit: let's take an example. Say we have the following | algorithm: isAllOdd(array x): if x is | empty: return true if x[0] is even: | return false return isAllOdd(x[1:]) | | In the best case, for any size array with the first element | being even, this function finishes after executing 4 | instructions: 2 comparisons, an array access and then a return. | So it's best case complexity is best(n) = 4, which is O(1), | Th(1), O(1). | | In the worse case, all elements of the array are odd, so we | have to go through all n elements to find that out. It will | have to execute 6 instructions for each element in the array (2 | comparisons, 2 array accesses, a call and a return), except for | the last element where it will execute only 2 instructions | (check size then return). So its worst case complexity function | is worst(n) = 6(n-1) + 2, which is O(n), Th(n), O(n). | | In the average case, for every call we are flipping a coin, and | we have n total flips. Since we can assume the coin is fair, we | will expect to see an even number after log(n/2) calls of | isAllOdd. So, our average case complexity is average(n) = | 6*log((n-1)/2) + 2 steps, which is O(log n), Th(log n), O(log | n). (I am ignoring cases where n < 2, which is anyway | irrelevant for asymptotic behavior). | | Now, what is the overall complexity of isAllOdd? We can | generalize the average case calculation if we want to get an | exact value; but whatever that function, overall(n), is, we | know for sure best(n) < overall(n) < worst(n); so overall(n) is | O(n) and O(1). However, we can't define any Th for it - there | is no function which it will differ from by at most a constant | factor for any n. | | Why do the detailed analysis instead of saying this is O(n) and | being done with it? Well, perhaps we know that we are in fact | calling this function with a large number of even numbers, so | we would actually over-estimate its complexity. If we instead | know we are calling it with a large number of odd numbers, we | would also find out from the average case analysis that it | would be better to define a reverse funtion, isAllEven, and | compute isAllOdd = !isAllEven. | karpierz wrote: | Theta(X) is not "average case" behaviour, it means "both | Omega(X) and O(X)". | | Ex: merge sort is Theta(n log(n)), but quick sort is not. | | The author is likely saying "worst case O(n log(n))" as | redundancy to emphasize that this is worst case behaviour. | tsimionescu wrote: | It's not redundant. You can have "average case O(n log(n))" | but "worse case O(n^2)", like quicksort does. Depending on | the pivot function, that "worse case" can be more or less | common. | karpierz wrote: | It is not redundant to say "average case O(n log(n))". It's | not formally a correct usage of O(). But colloquially, it | means that the average number of operations over all | possible inputs on length n is X. | | It is redundant to say "worst case O(n^2)". By definition, | O(X) means that the maximum of the number of operations | over all possible inputs of length n is at most X. | tsimionescu wrote: | > It is not redundant to say "average case O(n log(n))". | It's not formally a correct usage of O(). | | You are wrong. It is perfectly formally correct to say | "average case complexity is 2n log(100n) + 78n + 90, | which is O(n log (n))". | | > By definition, O(X) means that the maximum of the | number of operations over all possible inputs of length n | is at most X. | | No, that is, in fact, a colloquial use of O(X). O(X), | formally, is simply the set of all functions that are | bigger than X up to some constant factor. It has nothing | to do with number of operations in itself. | | Instead, we formally define complexity functions for an | algorithm, which are the minimum/average/maximum number | of operations the algorithm will execute for any input of | size N; call these functions Cmin(n), Cavg(n), Cmax(n). | Then, we can say that C_(n) is in O(X(n)) (or, as a | shorthand, C_(n) = O(X(n))) if there exists a number k | such that C_(n) < k * X(n) for any n > n0. | | So, we can say that the best case complexity of an | algorithm, Cmin(n), is O(n^2); this means that even for | the best possible input of length n, it will execute at | least k * n^2 steps, for some k. Or, we can say that the | worse case complexity of some algorithm, Cmax(n), is | Omega(n^2); that is, that even in the worse case input of | length n, it will execute less than k * n^2 steps, for | some k. | | In fact, the concept "number of steps that an algorithm | will execute for any input of size n" is not a | mathematical function at all (unlike min/max/avg number | of steps), since there isn't, in general, a unique such | number for any given size. | | So, I would argue that saying "complexity O(n^2)" is in | fact ill-defined formally, and that, if you want to be | formal, you _must_ specify "worst case complexity | O(n^2)". | stonemetal12 wrote: | Would argue it isn't perfectly formally correct, because | "average case complexity" isn't well defined. Is average | here mean, median or mode? It is typically used as if it | meant, 90th percentile, or all but a few degenerate | cases. | | But that is a bit nit picky. | Kranar wrote: | You are incorrect about this. Worst case and Big O are | independent of one another. The O, O, Th, etc... are | about describing bounds where O(f) is used for the set of | asymptotic upper bounds on f, O(f) is the set of | asymptotic lower bounds on f, and Th(f) is the | intersection of O(f) and O(f). | | Typically f is used to describe space or time complexity | of an algorithm, but it can be used for any function that | maps positive integers to the non-negative real numbers. | | You can make any combination of best case/worst | case/average case with O, O, Th. In fact, you can even | consider other cases as well such as amortized cases, | 90th percentile cases, any case you want to investigate | that can be defined as a function from positive integers | to non-negative real numbers can be classified as having | an asymptotic upper bound, lower bound, or tight bound. | karpierz wrote: | Yeah, okay. You are correct here. | | "Average-case of algorithm X" can be a function function | (average number of operations for all possible inputs of | length n). And asymptotic bounds can be applied to | arbitrary functions from R -> R. | dudus wrote: | I remember from college there was big-Oh `O(x)` and little-Oh | `o(x)`. Is Omega/Theta the same as little-Oh? | gwern wrote: | I'd like to hear more about the reinforcement learning part. The | patch literally just says 'Reinforcement Learning'. Not very | helpful! | hbn wrote: | Sorry if this seems pedantic in the grand scheme of the topic, | but is the indentation in the code samples messed up, or is there | some kind of logic to it? I've never seen anything like that, and | I find it pretty difficult to read. | bogwog wrote: | Looks normal to me (except it seems like some snippets indent | with 2 spaces while others indent with 4). | | The line spacing is weird though. | danlark wrote: | Sorry, it definitely was a little inconsistent as I stored all | snippets without proper formatting and I tried at the same time | align texts for phones, etc | | Failed at everything. Will improve | evmar wrote: | The "use ASLR for random seed" trick in there is pretty cute, but | I noticed it will just silently not work in non-ASLR settings, | and in particular under Wasm. | forrestthewoods wrote: | An interesting takeaway here is that custom comparators are | really hard to write correctly. There are a lot of constraints | that aren't obvious. | | I suspect there should be a new std::verify_weak_ordering to help | people write tests that perform n^3 tests on potential data to | ensure a comparator is safe to use. | jcelerier wrote: | MSVC and GCC's standard libraries implementations both have a | debug mode which has been checking that for almost two decades. | DrBazza wrote: | Quite a long interesting read. Unless I missed it, I recall that | there's optimal solutions for sorting up to 7?? items. I can't | find the reference, but unsurprisingly I read it a dlang forum | post by Andrei Alexandrescu. Anyone have a link to that aspect of | sorting? | tialaramex wrote: | Under the section heading "Reinforcement learning for small | sorts", there are diagrams of the small sorts, for 3, 4 and 5 | items. | | They also explain that down here you're at the mercy of the | physical hardware. Emitting one fewer x86 instruction does not | necessarily make your program faster, still if it's not slower | then smaller programs are good too. | westurner wrote: | > LLVM history: _Back then we recognized some really interesting | benchmarks and we didn't recall anybody trying to really | benchmark sorts on different data patterns for standard | libraries._ | | Timsort https://en.wikipedia.org/wiki/Timsort : | | > _In the worst case, Timsort takes O(n log n) comparisons to | sort an array of n elements. In the best case, which occurs when | the input is already sorted, it runs in linear time, meaning that | it is an adaptive sorting algorithm. [3]_ | | > _It is advantageous over Quicksort for sorting object | references or pointers because these require expensive memory | indirection to access data and perform comparisons and Quicksort | 's cache coherence benefits are greatly reduced._ [...] | | > _Timsort has been Python 's standard sorting algorithm since | version 2.3 [~2002]. It is also used to sort arrays of non- | primitive type in Java SE 7,[4] on the Android platform,[5] in | GNU Octave,[6] on V8,[7] Swift,[8] and Rust.[9]_ | | Sorting algorithm > Comparison of algorithms | https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_o... | | Schwartzian transform | https://en.wikipedia.org/wiki/Schwartzian_transform : | | > _In Python 2.4 and above, both the sorted() function and the | in-place list.sort() method take a key= parameter that allows the | user to provide a "key function" (like foo in the examples | above). In Python 3 and above, use of the key function is the | only way to specify a custom sort order (the previously supported | cmp= parameter that allowed the user to provide a "comparison | function" was removed). Before Python 2.4, developers would use | the lisp-originated decorate-sort-undecorate (DSU) idiom,[2] | usually by wrapping the objects in a (sortkey, object) tuple_ | | Big-O Cheatsheet https://www.bigocheatsheet.com/ | | Quicksort in Python 7 ways (and many other languages) on | RosettaCode: | https://rosettacode.org/wiki/Sorting_algorithms/Quicksort#Py... | dralley wrote: | IIRC Rust and Swift have a modified timsort implementation to | remove the "galloping" strategy. | tialaramex wrote: | Rust has "an adaptive, iterative merge sort inspired by | timsort" to implement the stable sort() method if you have an | allocator, and a pattern-defeating quicksort to implement | unstable_sort() which is provided even if you don't have an | allocator (no_std embedded environments). | danlark wrote: | Thanks, I'll remove the note about "recalling" | westurner wrote: | xeus-cling didn't exist back then: | https://github.com/jupyter-xeus/xeus-cling#a-c-notebook | | https://github.com/fffaraz/awesome-cpp#debug | | A standard way to benchmark and chart {sorting algorithms, | web framework benchmarks,} would be great. | | The TechEmpower "framework overhead" benchmarks might have at | least average case sorting in there somewhere: https://www.te | chempower.com/benchmarks/#section=data-r20&hw=... | westurner wrote: | awesome-algorithms https://github.com/tayllan/awesome- | algorithms#github-librari... | | awesome-theoretical-computer-science > Machine Learning | Theory, Physics; Grover's; and surely something is faster | than Timsort: https://github.com/mostafatouny/awesome- | theoretical-computer... | samhw wrote: | Christ, the description on that second repo is so | _comically terribly_ spelled that I can't figure out | whether it's a joke: | | > The interdicplinary of Mathematics and Computer | Science, Distinguisehed by its emphasis on mathemtical | technique and rigour. | | (Perhaps a witty satire on the reputation of mathmos for | being godawful at anything literary...) | chkas wrote: | I don't understand why you would need Timsort. If the pivot | element is chosen randomly, Quicksort is always (or with | probability bordering on certainty) O(n log n). | | Update: I confused Timsort with IntroSort. I should have | googled before posting ... | masklinn wrote: | > I don't understand why you would need Timsort. If the pivot | element is chosen randomly, Quicksort is always (or with | probability bordering on certainty) O(n log n). | | For starters, it's a stable sort, which excludes quicksort | right out the gates. | | Second, the primary hypothesis of timsort is that real-world | data is non-random, timsort is biased for partially or | almost-sorted sequences, which it can sort in as low as O(n). | That would be the point westurner was making, timsort's | creation is very much predicated on "really benchmark sorts | on different data patterns". | | Third, for very short sequences (also a common situation) | it's an insertion sort, which is extremely efficient due to | the low overhead (even more so combined with the adaptivity | which insertion sort also has). | orlp wrote: | > For starters, it's a stable sort, which excludes | quicksort right out the gates. | | Negative, it excludes in-place quicksort but out-of-place | quicksort can be stable just fine. | | > Third, for very short sequences (also a common situation) | it's an insertion sort, which is extremely efficient due to | the low overhead (even more so combined with the adaptivity | which insertion sort also has). | | Every modern quicksort implementation also does this (or a | similarly efficient smallsort). | superjan wrote: | Hi, you are ignoring the second point, and for the other | two points you are referring to particular extensions of | quicksort. Quite possibly you can extend quicksort | similar to how timsort extended mergesort, but if that | was commonplace nobody would care about timsort. | orlp wrote: | > and for the other two points you are referring to | particular extensions of quicksort | | Maybe for the third point, but once again, every | implementation does this. For the first point, absolutely | not. Just because in-place partitioning is often the only | variant described, out-of-place partitioning is | absolutely just quicksort, not an extension. | | > Quite possibly you can extend quicksort similar to how | timsort extended mergesort | | Please see my top comment in this thread - I have done | exactly that. Or rather, I have extended powersort (which | is very similar to timsort) with quicksort. | loeg wrote: | I ported Timsort to the Windows kernel back in college. Good | times! | Something1234 wrote: | I'm confused, so the comparison function can modify the list? I | thought most sorting algorithms assumed the contents of the list | was constant. Why would I want my comparator to ever modify the | list under sort? | slavboj wrote: | An "accessed at" field is a trivial use case. | danlark wrote: | You don't, it's just not disallowed :) | infogulch wrote: | "because it prevents the user from writing a comparison | function that modifies the list while sorting" is not a | reason I would expect to prefer rust, and yet here we are. | mbrubeck wrote: | It's possible in Rust too, but only for a subset of element | types: those that support shared mutation (also known as | "internal mutation"). For example, a type containing an | `AtomicU32` or a `Mutex<T>` or a `Cell<T>` could have an | `Ord::cmp` implementation that alters its value, and the | compiler will not prevent this. | orlp wrote: | Yes, I had a nasty bug in an initial version of glidesort | where I (naively) assumed that I could create a temporary | copy of an element by doing a memcpy and call the | comparison function on this temporary copy a bunch of | times before throwing it away, since I assumed the | comparison function would not mutate the copy. | Unfortunately interior mutability means this would lead | to potential double drops, and it was just not sound. | [deleted] | jmalicki wrote: | If you examine the modification more closely, it is just | finding the list that would exhibit worst case performance - | once created, that list would always exhibit the worst case as | a constant list. | | Once you have the resulting list, the indices are sorted, then | assigning each num_solid value to that index would result in | the list being constant, and then could be sorted to give such | worst case behavior. | | I wish the article was more clear on this. | orlp wrote: | Hi all, I'm the author of pdqsort that's credited in the post (to | be clear: the authors acknowledged pdqsort at the bottom, I am | _not_ associated with the posted work directly). I recently held | a talk at my institute on efficient in-memory sorting and the | ideas in pdqsort, in case you 're interested in hearing some of | the theory behind it all: https://www.youtube.com/watch?v=jz- | PBiWwNjc | | Next week I will hold another talk in the Dutch seminar on data | systems design (https://dsdsd.da.cwi.nl/) on glidesort, a new | _stable_ sorting algorithm I 've been working on. It is a | combination of adaptive quicksort (like pdqsort, fully adaptive | for many equal elements) and an adaptive mergesort (like timsort, | fully adaptive for long pre-sorted runs). It is the first | practical implementation of an algorithm I'm aware of that's | fully adaptive for both. Like pdqsort it uses modern architecture | aware branchless sorting, and it can use an arbitrary buffer | size, becoming faster as you give it more memory (although if | given a constant buffer size it will degrade to O(n (log n)^2) in | theory, in practice for realistic workloads it's just a near- | constant factor (c ~<= 3-5) slower). | | The source code isn't publish-ready yet, I have to still do some | extra correctness vetting and testing, and in particular | exception-safety is still not yet fully implemented. This is | important because I wrote it in Rust where we must always give | back a valid initialized array, even if a comparison operator | caused a panic. But I do have some performance numbers to quote, | that won't significantly change. | | For sorting 2^24 randomly shuffled distinct u32s using a buffer | of 2^23 elements (n/2), glidesort beats Rust's stdlib slice::sort | (which is a classic timsort also using a buffer of n/2) by a | factor of 3.5 times. When stably sorting the same numbers | comparing only their least significant 4 bits, it beats stdlib | slice::sort by 12.5 times using 6.5 times fewer comparisons, both | numbers on my Apple M1 Macbook. All of this is just using single- | threaded code with a generic comparison function. No SIMD, no | threads, no type-specific optimizations. | | Finally, glidesort with a buffer size of >= n/2 is _faster_ than | pdqsort. | mlochbaum wrote: | Any chance you could comment on fluxsort[0], another fast | quicksort? It's stable and uses a buffer about the size of the | original array, which sounds like it puts it in a similar | category as glidesort. Benchmarks against pdqsort at the end of | that README; I can verify that it's faster on random data by | 30% or so, and the stable partitioning should mean it's at | least as adaptive (but the current implementation uses an | initial analysis pass followed by adaptive mergesort rather | than optimistic insertion sort to deal with nearly-sorted data, | which IMO is fragile). There's an in-place effort called | crumsort[1] along similar lines, but it's not stable. | | I've been doing a lot of work on sorting[2], in particular | working to hybridize various approaches better. Very much | looking forward to seeing how glidesort works. | | [0] https://github.com/scandum/fluxsort | | [1] https://github.com/scandum/crumsort | | [2] | https://mlochbaum.github.io/BQN/implementation/primitive/sor... | orlp wrote: | I have stolen a lot of ideas from scandum and extended his | ideas in new ways. He is definitely a mad genius. Glidesort | (on my M1 machine at least) matches fluxsort within a couple | % for random data, but glidesort is robust in that it will | always take advantage of pre-sorted runs and many equal | elements (at least if it has buffer memory), no matter where | they are in the array. | | In particular, I was inspired by three things from scandum: | | 1. fluxsort's out-of-place stable partitioning. From this I | got reminded that not only is out-of-place stable | partitioning a thing, it's highly competitive. I've always | had this as an idea in the back of my mind, but never went | through with it because I kept getting discouraged by C++'s | distinction of moving to uninitialized memory vs. moving into | a moved-from value (which is why I implemented glidesort in | Rust). | | 2. quadsort's "ping pong merge", which reduces unnecessary | memcpys by merging both on the way _out_ and on the way _in_ | the original array. I did have this idea before, but always | dismissed it because I thought keeping track of what 's where | would be a massive pain. Simply waiting until there's 4 | things to merge eliminates this problem and is just genius. | | 3. quadsort's "branchless parity merge", which merges from | both ends of the array if the merge is perfectly balanced. I | make no claim that I thought of this, it's just genius. I had | two key takeaways from this: you can make some very fast | small sorting algorithms with merges, and interleaving loops | to reduce data dependencies are significantly faster. | | So I combined #1 & #3 into what I call bidirectional stable | partitioning, where I partition from _both_ sides of the | array into an out-of-place buffer through interleaved loops. | | I extended the adaptiveness and applicability of #2 heavily | by replacing the 'merge' operation in powersort | (https://arxiv.org/abs/1805.04154) with a 'virtual merge' | operation that delays merges until necessary. This is also | what allows me to use quicksort in a bottom-up adaptive | mergesort, because I don't eagerly sort small runs! Instead I | simply keep unsorted runs around, 'merging' unsorted runs | simply by concatenating them - purely in bookkeeping. | | I heavily extended 3 for the mergesort part by realizing we | don't need perfectly balanced merges, we can just take the | `min` of the two runs and start off with a merge from both | sides, and then look further. I also did more interleaving by | doing a binary search to compute independent parallel merges | where sensible, and interleaving those loops. | | As a quick preview, here is a visualization of glidesort | using a buffer size of n/2, where I have artificially limited | the concatenation of unsorted runs to n/8 so that it won't | just look only like quicksort, and both the quicksort and | mergesort aspects are shown: https://cdn.discordapp.com/attac | hments/273539705595756544/96... | mlochbaum wrote: | Thanks for the discussion! Can't say I follow everything, | but using parity merge for part of an unbalanced merge | makes a lot of sense and that alone is worth it. | | Stepped through the video a few times at 1/4 speed. The n/8 | thing is a bit confusing, first because I didn't read it | and second because it makes it hard to tell a partition | result from the beginning of the next segment. I think I | can follow what's going on, but I don't get the purpose of | the bidirectional partition. It doesn't use less memory, | does it? So is there something to do with fitting in with | mergesort better? I'm not familiar with powersort; I'll | read up on it. | orlp wrote: | > but I don't get the purpose of the bidirectional | partition. It doesn't use less memory, does it? So is | there something to do with fitting in with mergesort | better | | Nope, it does the same amount of comparisons, same number | of operations, same memory, etc. What it does do is it | allows you to interleave two independent loops, which is | also what makes the parity merge fast (I think scandum | misidentifies the loop unrolling for this effect for | large arrays - you can loop unroll merging large arrays | either way - for small constant-size merging it is | important however). | | A modern CPU has a very long pipeline, and even though we | like to pretend all instructions are one cycle with no | latency, in reality there are real latencies to | instructions and memory accesses, and multiple | instructions can be executed in the same cycle. Since | each iteration of the partition depends on the previous | one (which pointer did we increment? we have to know | before we can execute the next store), you can hide these | latencies better if you interleave two independent loops. | In addition you can use more instruction-level | parallelism. | mlochbaum wrote: | Got it. I'd considered that but something about your | explanation threw me off. A subtlety I missed was that | you can't just do two forward partitions, because the | second one begins exactly halfway through the array--one | of the results would be placed there but it's probably | not the correct start location for the larger half- | partition. | orlp wrote: | Exactly, which is why I call it bidirectional | partitioning: one forward, one backward. It's a very | strange case where you can use parallelism (in this case | instruction-level parallelism), but only get two | independent instances without the ability to recurse | further. | | You can of course make partitioning embarrassingly | parallel, look at IPS4o for that. But it is vastly more | complicated, and involves overhead shuffling blocks after | the partition. | hengri wrote: | Random question, but is this from the Google bare metal team? I | remember interviewing with them and the work they did was | fascinating. | orlp wrote: | I'm not sure if you're asking me personally or about the | original post, but to be clear: I'm not (currently) | associated with Google, nor is my work on pdqsort or | glidesort. I'm also not associated with the original post, | other than that my work on pdqsort was credited at the bottom | of the blog post, which is what my opening sentence was | referring to. | | I edited my comment to reflect this, I can see how it | could've been misinterpreted. It was not my intention to | deceive. | danlark wrote: | Hi, the author is here | | I am working in MapReduce/Data-pipelining/Dataproc efficiency | internally and externally. In cloud you can check out Google | Cloud Dataflow | | We are working closely with general efficiency and bare metal | teams, yes :). They definitely do a fascinating work. This | one was my 20% project together with the google efficiency | team | DontGiveTwoFlux wrote: | OP describes systems which seed tests with non-determinism to | catch users who write tests relying on undefined behavior, such | as order of equal elements. Writing tests against these systems | is a game changer (once you figure out what's going on and why | there are flakes). I'm a huge fan of this subtle change which has | spared me at least one bug. It also allows library authors to | gain the advantage of not accidentally producing a behavior that | induces Hyrum's law. Kudos to the team for making this work at | scale. | hu3 wrote: | offtopic: Such a great engineer and yet their blog says: Website | Powered by WordPress.com. | | I find the pragmatism worthy of praise. I've been meaning to | start blogging since forever but I keep planning and researching | about blog platforms/tools and never get to the point. | papito wrote: | WP is a never-ending source of security vulnerabilities. 30% of | the internet runs on it. | | I found Jekyll and never looked back. | nvarsj wrote: | That's why you use managed Wordpress, aka wordpress.com. | sydthrowaway wrote: | hatware wrote: | Did one person design the SR-71...? | sydthrowaway wrote: | yes | samhw wrote: | In case you genuinely missed the point: he was saying that | the person is a great engineer for _not_ wasting their time | optimising irrelevant details like their blog. Time | allocation is an optimisation problem like any other. | [deleted] | haolez wrote: | Making use of existing tools and being able to manage their | shortcomings is an important engineering skill. | gzer0 wrote: | The paradox of choice. I suffer from this way too often. | | https://en.wikipedia.org/wiki/The_Paradox_of_Choice | a_e_k wrote: | > _If you compare by a boolean variable, for example, partition | data by existence of something, it's very tempting to write | std::sort call._ | | > _We saw a pattern of sort+resize a lot._ | | These seem like good opportunities for clang-tidy checks. | techwiz137 wrote: | Do you need to have learned Computer Science to grasp the topic | at hand here? | [deleted] | dekhn wrote: | Great fun read. I love this stuff; when I first learned C++ many | years ago I didn't really know much about complexity and what I | have learned over the years is that complexity costs depend a lot | on the underlying data distributions. | seanp2k2 wrote: | Thanks, I'm going to write all this out on a whiteboard the next | time someone asks me to sort a linked list in a technical | interview. | | Edit: just wanted to clarify that the point I'm trying to make | here is that if you want to ask someone if they've memorized the | answer to that question, go ahead and ask it in a technical | interview, because the _real_ well-considered answer looks like | the linked article, and my honest non-smartass answer is that one | should use the library function and not try to implement it | themselves until they 've done sufficient research into the | problem to prove to a reasonable degree that they indeed do need | to re-implement sort() and that it's in the best interests of the | business to fund such a project. | | Would love to read this over the phone to a recruiter and have | them tell me that it's not what they have on their sheet. | 10000truths wrote: | I never really understood why a standard library would insist on | having a single "sort" function (or at best, a "sort" and "stable | sort"). Why not just provide explicit quick sort, merge sort etc. | functions and let users choose which they want? Sorting is one of | the first things a budding computer scientist learns, and the | research on sorting algorithms is well-trodden ground by now, so | I don't see how it could be a usability issue. I'd like to be | able to choose from these criteria without having to rely on | external libraries: | | - How does it perform on average? | | - How does it perform on pathological inputs? | | - Is it stable? | | - Does it allocate? | | Not to mention constraints that an application can take advantage | of: | | - What if the data is already mostly sorted? (insertion sort) | | - What if writes are very expensive? (cycle sort) | | - What if all the data are integers? (radix sort) | | And so on. Trying to force a one-size-fits-all solution seems | rather silly at this point. | Hello71 wrote: | while i agree with you to some extent, i don't think including | a very large set of sorting functions in the standard library | is a good idea. especially for dynamic languages, a larger | standard library means more disk usage and worse cache | utilization. it's also typically relatively easy to use your | own sort function: in performance-critical cases, the sort | function is usually called directly, not via intermediate | libraries (which might need to be patched to call the new sort | function); also, sort functions are self-contained, i.e. | typically don't require system dependencies. | samhw wrote: | > A larger standard library means more disk usage and worse | cache utilization | | Does it? How? I suppose one has to store the standard library | on the machine, which does notionally use up disk space, but | I can't see how it contributes by any reasonable performance- | relevant meaning of 'disk usage' (presumably i.e. disk I/O | during the runtime of the program). And for cache utilization | I simply have no clue at all what you might be driving at. | dijit wrote: | There are such things as sensible defaults though. | | Timsort being a somewhat famous example, which uses the fact | that python datastructures are known length to figure out the | best sorting method for the size. | | If you need something more exotic then you likely know more | about what you're doing, but most people writing software who | need a sort function usually don't need more than timsort. | morelisp wrote: | > uses the fact that python datastructures are known length | to figure out the best sorting method for the size. | | That... does not match with my knowledge of how timsort | works. The sorting method is chosen based on the array | length, and every sorting algorithm in every language knows | that. It's extremely suboptimal for some types because | Python's typing is so weak! | | Timsort succeeds because it has many good heuristics to | optimize common properties of its merge intervals. | wongarsu wrote: | I think only having "sort" and "stable sort" makes sense for | the same reason there's only one hashmap or one vector in most | standard libraries: 99% of the time you just want a high- | quality implementation of one reasonable choice, without | getting into the weeds of it. | | Most of the time I don't care if it's timsort or quicksort, or | if the hashmap uses Cuckoo hashing or buckets made of linked | lists. Of course there are the 1% of cases where I notice that | my sort or hashmap is a performance bottleneck and that a | different algorithm might perform better. But standard | libraries aren't good at serving that need because they have to | be broadly applicable and conservative, if I already know that | I want top performance for specific requirements I'm almost | certainly better served by an external library. | sam0x17 wrote: | I think this goes back to the whole idea of non-CS people being | the primary users of standard libraries. | | I agree though, there should be the ability to specify which | one, and every function in the standard library should have big | O notation in the function description. Still makes sense to | have a general-purpose good one as the default. | yupper32 wrote: | There are defaults for all sorts of functions. find() functions | on an array in many languages use some default. Parameters | often have defaults when not set. Strings often have a default | encoding. | | The vast majority of the time, the default works well enough. | Why make things more complicated than they need to be? | | It also has the added benefits of helping more junior engineers | when reading the code. "Cycle sort? What is that doing?" | "Insertion sort? I was taught that was inefficient! I should | spend time updating this code to a better sorting algorithm! My | tech lead would love my initiative." | | Or we just use sort(), and we all move on with our lives. | tsimionescu wrote: | You are thinking about the weeds of sorting now, but that is | rarely relevant to the problem at hand. | | You can (and most people do) look at it from the other angle. | Sort(seq) is a function which returns a permutation of seq such | that every element is less-than-or-equal-to the next element. | Usually, we use this function because we need this property to | hold in our following code. A lot of the time, this function is | called for very short sequences, where even the most naive | implementation would be fats enough for even the highest scale | program. So, in the vast majority of code we only care about | the semantics of std::sort; and all of the above have the same | semantics. | | In fact, the fact that we do often get a stable sort as well | shows that semantics are the most important characteristic, as | stable and unstable sorts do have slightly different semantics. | pikseladam wrote: | wow. that was really impressive. scale is different kind of | monster. it is like an end game boss. Thank you for sharing it. I | learned a lot. | tialaramex wrote: | Fun safety problems. | | Towards the end of this post, the author mentions a bunch of | places where std::sort() blows up if your data or your sort rule | doesn't exhibit strict weak ordering. In some cases you get a bad | sort, which feels fair enough since you violated the rules the | sort algorithm is relying on, but in others it literally has | Undefined Behaviour in C++. | | Notably Rust's sort() and unstable_sort() do not have this | problem, because of course these are safe Rust, and safe Rust | doesn't have Undefined Behaviour. But why can this work? | | First up, one bullet which gets dodged is that Rust's floating | point types (f32 and f64) are not Ord. Rust acknowledges that NaN | != NaN and therefore we can't expect to go around sorting a slice | of floating point values. In C++ that's on the programmer, in C++ | the fact you can write a < b means that type counts as ordered, | even though it trivially isn't for several built-in types. So if | you simply naively try to sort an [f32] Rust says you can't | because it isn't Ord, you will need to write a lambda or whatever | to Order these f32s and then if some of them are NaN your lambda | is responsible for deciding what to do about that. | | But even if we insist on getting in harm's way, safe Rust | promises to protect us anyway. For example if I make an array of | a thousand misfortunate::OnewayGreater<u8>s they're only one byte | each, yet they all insist they're Ordered greater than each other | and even than themselves, Rust's sort() doesn't cause Undefined | Behaviour on this input. It may never succeed, since all the | elements to be "sorted" insist they're greater than each other, | but it definitely won't segfault, trash unrelated memory or cause | chaos. | nullc wrote: | > or cause chaos | | Running forever is pretty chaotic. | ElectricalUnion wrote: | Running forever is pretty stable. Halting is chaotic. | tialaramex wrote: | A trivial infinite loop is _Undefined Behaviour_ in C++ but | it is a completely reasonable thing to write in Rust and | works fine, in fact Rust 's underlying loop _is_ an | infinite loop unless you explicitly provide some way to | break out of it, which you usually will. ___________________________________________________________________ (page generated 2022-04-20 23:00 UTC)