[HN Gopher] Beating TimSort at Merging ___________________________________________________________________ Beating TimSort at Merging Author : andrewstetsenko Score : 187 points Date : 2021-07-13 16:46 UTC (6 hours ago) (HTM) web link (earthly.dev) (TXT) w3m dump (earthly.dev) | danbruc wrote: | The first Python implementation is bad - removing the first | element in each iteration is O(n), the C implementation gets this | right by maintaining one index into each list instead of | modifying the lists. | XelNika wrote: | Yes, it says so in footnote 1 in the article. It's also not the | benchmarked code so it doesn't really matter. | danbruc wrote: | I missed the footnote. My point was more that one could have | compared this with the C implementation as heapq.merge is a | different algorithm and so there is no actual comparison of | the same algorithm in C and Python. | min-cut wrote: | Not only that, but also list(a + b) is producing two different | new lists since (a + b) produces a list and list() constructs a | new list copy. The benchmark would be faster if the OP just did | | def sort_test(): m2 = a + b; m2.sort() | | instead of | | def sort_test(): m2 = list(a + b); m2.sort() | | EDIT: It seems like OP fixed this issue in perf.py, but left it | in test.py | agbell wrote: | You can check it, but I'm pretty sure the difference is | negligible. But yeah, all benchmarks are using the code in | perf and the pop code is just to demonstrate. It is not | benched. | | Edit: dropping the extra list() from the blog code examples. | min-cut wrote: | I see about a 10% performance improvement on my local | machine on the input in test.py when not constructing the | unnecessary second list. | | I don't really buy that it reads nicer in prose since you | can just do (a + b).sort() if you want. Plus, I feel like | it's important for readability to not be unnecessarily | redundant. Having list(a + b) code also risks creating | misconceptions about how lists can be constructed and used | in Python. | agbell wrote: | Yeah, good point. I think you are right about it being | incorrect. Updating... | tim-peters wrote: | To get more gonzo, in CPython's list.sort(), the C code that | actually merges two lists (which happen, in context, to be two | contiguous array slices) is in listobject.c's `merge_at(()` | function. That brings in the world of "galloping" (exponential | search) optimizations, much faster than one-pair-at-a-time | compares in many real-world cases. | | So that's a whole lot of additional complication, but it's the | heart of what _really_ makes timsort shine in its "jaw dropping" | cases. | | Tim (of "timsort" - which was an inside joke name at the start, | because I never expected anything outside of CPython would use it | ;-) ). | j-pb wrote: | Is there a reason to use galloping over an array vs a seek in a | search tree? | | Better real world factors? | | Galloping, Demaine set intersection, worst case optimal joins, | they all seem to be different aspects of the same underlying | principle. | | So I'm very curious about the peculiarities in the sorting | case. | borski wrote: | Gosh, comments like these from the actual author are a reminder | of why I love HN. :) | agbell wrote: | Wow. Thanks for writing it. I believe this is the part you are | referring to: | https://github.com/python/cpython/blob/main/Objects/listobje... | tim-peters wrote: | Right, `merge_at()`. But that in turn calls small mountains | of other C code. It is, alas, a complicated approach. Folding | it in would be substantially more work than you've already | done. | | The point: for inputs like your: | | a = list(range(-100, 1700)) b = list(range(1400, 1800)) | | it would first use a variant of binary search to find where | b[0] belongs in a. "Almost all of a" is consumed by that via | a relatively tiny number of compares. It would similarly do a | binary-ish search to find where a[-1] belongs in b. The tail | end of b, from 1700 through 1799, then also never needs to be | looked at again. | | In the lucky case that two input lists don't overlap at all, | and a[-1] <= b[0], that's enough so that a "one-pair-at-a- | time" compare _never_ needs to be done. In context (the two | lists are contiguous slices of an array), no pointers at all | need to be copied in that case either - it deduces that the | combined array slices are already in sorted order in total | time O(log(len(a)) + log(len(b))), and it's done. | agbell wrote: | Very Cool! | | Edit: Ok, I get it now. Don't just pop, binary search into | it, so you can fast forward through. I'm not sure I'll get | around to implementing that, at least not in the immediate | future, but that makes total sense. | jhokanson wrote: | As someone that rarely works with Python lists, as opposed to | numpy arrays, I was pleasantly surprised to see numpy does what I | would expect in providing a mergesort option. I'm surprised | Python doesn't, other than via heapq and only implemented in | Python (according to my reading of the post and a very quick | Google search). | | Oops, just for fun the numpy documentation currently states: "The | datatype determines which of 'mergesort' or 'timsort' is actually | used, even if 'mergesort' is specified. User selection at a finer | scale is not currently available." Awesome ... | | Also, apparently mergesort may also be done using radix sort for | integers "'mergesort' and 'stable' are mapped to radix sort for | integer data types." | kzrdude wrote: | mergesort mandates that it's a stable sort, so I guess they get | away with replacing "like for like" that way. For varying | definitions of like. | masklinn wrote: | Why would Python provide a mergesort option when timsort was | the replacement for a standard mergesort? | | `heapq.merge` is not a mergesort, it's a merge of sorted | sequences (aka just the merge part of a mergesort). | TheRealPomax wrote: | Turns out that at least in terms of performance, everyone using | numpy is fine with this. It just needs to run fast enough, not | as fast as possible ;) | tehjoker wrote: | This is not true. I have written many libraries to improve on | the performance of numpy for image processing applications. | Sort backs many operations, such as unique, that result in | painfully slow code. | CogitoCogito wrote: | I too have written a lot of extension code to speed up | numpy code. It's often not even especially difficult since | any code at that level tends to be very algorithmic and | looks essentially the same if written in numpy or C++. | Having control of the memory layout and algorithms can be a | huge win. | | Of course I don't _usually_ do this, but that's just be | most of the code I write doesn't need it. But it's not at | all some crazy sort of thing to do. | toxik wrote: | Tangential but np.bincount is typically the fast version | of np.unique. Not entirely the same thing, but it's worth | knowing about it. | dang wrote: | Past TimSort threads, for anyone interested: | | _Timsort, the Python sorting algorithm_ - | https://news.ycombinator.com/item?id=21196555 - Oct 2019 (131 | comments) | | _On the Worst-Case Complexity of TimSort_ - | https://news.ycombinator.com/item?id=17883461 - Aug 2018 (74 | comments) | | _Timsort is a sorting algorithm that is efficient for real-world | data_ - https://news.ycombinator.com/item?id=17436591 - July 2018 | (77 comments) | | _Functional verification with mechanical proofs of TimSort | [pdf]_ - https://news.ycombinator.com/item?id=9778243 - June 2015 | (1 comment) | | _Timsort_ - https://news.ycombinator.com/item?id=3214527 - Nov | 2011 (27 comments) | | _Java has switched from Mergesort to TimSort_ - | https://news.ycombinator.com/item?id=752677 - Aug 2009 (13 | comments) | rkalla wrote: | Excellent read. | thomas536 wrote: | Are there similar relatively simple methods or simd tricks for | multi-way merges? | dragontamer wrote: | I dunno about "multiway merge". But your standard 2-way merge | is SIMD-optimized by having a binary-search across the | "diagonals" of the so-called mergepath. | | Leading to O(lg(p * sqrt(2))) time for the longest comparison | diagonal, where "p" is the number of processors. Including | thread divergence, that's O(N/p * lg(p * sqrt(2))) total work | done, or assuming constant-p, O(N) of work per 2-way SIMD merge | (with a large "p" providing an arbitrarily large speedup: but | in practice is probably limited to 1024 for modern GPU | architectures. Since modern GPUs only really "gang up" into | 1024-CUDA thread blocks / workgroups efficiently) | | https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf | | This provides a natural way at allowing ~1024 CUDA threads in a | singular block to sort a list at O(n * log(n)) efficiently. | agbell wrote: | TimSort is cool. In it's worst case it looks no better than merge | sort, but give it some partially order data, or small amounts of | data, where it uses insertion sort, and it really shines. | lyxell wrote: | Interesting read. This post seems to be written by Adam Gordon | Bell who is also the host of the excellect CoRecursive podcast. | Recently featured at HN in The Untold Story of SQLite [0]. | | [0]: https://news.ycombinator.com/item?id=27718701 | agbell wrote: | That's me. Thanks for reading it and listening to the podcast. | The SQLite episode is now nearly the most listened to, thanks | to hacker news. | | On Merging lists: I was caught off guard by recommendation to | use sort to merge sorted lists. Everyone I mentioned it to was | surprised as well so I thought it was worthy of some | investigation. This is my first C extension in Python and I got | help so someone else might be able to do even better than this. | | It is interesting when our normal short hands for thinking | about runtime complexity break down. | CogitoCogito wrote: | > It is interesting when our normal short hands for thinking | about runtime complexity break down. | | Actually I think more interesting would be to write a python | version following the algorithm of your C extension (i.e. | only allow two lists, don't allow generators, don't return | generators, etc). You could probably do that quite quickly | and generate the same graphs. I would expect that python | version to lie somewhere between your C extension and the | heapq.merge() version which is more general. | | edit: If you were to do this, I wouldn't recommend using the | python version in your blog since it pops the lists from the | front. This seems to be O(n) in general: | | https://wiki.python.org/moin/TimeComplexity | | I did a 10 minute glance at the python source code and | couldn't totally verify this (it needs more than 10 | minutes...), but it makes sense for popping from anywhere but | the back to cause a list copy to occur. Maybe this isn't | technically necessary when popping from the front, but I | couldn't verify it. Either way it's easy to avoid by just not | mutating the input lists anyway so I don't see a good reason | not to go that way. | agbell wrote: | So we could determine what gains I am getting are due to C | and the specific compares and which are due just being 2 | lists? That would be an interesting test. | | Maybe I'll do a follow-up. Thanks for reading the article. | I suspect you know way more about C extensions in Python | than I do. I saw you have a talk on this topic. | | I think you are totally right that pop is not the way to | go, but using an offset to track the head, like the C | example does. I actually put that in a footnote, because I | was considering testing my python version but the SO answer | mentioned pop being expensive. | | I think the simple version is still great as psuedo-code | for communicating a solution though and hopefully it makes | the c code a bit easier to follow once you've seen the | python version. | CogitoCogito wrote: | Well the version I'm saying is essentially the same as | the pop version you have so I don't think it's too | complicated. You would just do something like this | instead: def merge_sorted_lists(l1, | l2): sorted_list = [] i = 0 | j = 0 while i < len(l1) or j < len(l2): | if i == len(l1): | sorted_list.append(l2[j]) j += 1 | continue if j == len(l2): | sorted_list.append(l1[i]) i += 1 | continue if l1[i] < l2[j]: | sorted_list.append(l1[i]) i += 1 | else: sorted_list.append(l2[j]) | j += 1 return sorted_list | | I haven't tested that (you probably should before using | it), but it looks mostly right and is essentially the | same algorithm as yours. | forrestthewoods wrote: | Great post. I almost did a spit take when I read: | | > Timsort overcomes this disadvantage by being written in C | rather than Python. | | Yeah, Python is SLOW. Thankfully the author dug into C. That was | nice to see. | | Edit: lol I have no idea why this is being downvoted. Y'all | weird. | ww520 wrote: | That's what I suspected when reading the first part of the post | where it states that heapq.merge() is slower than sort(). Hmm, | isn't it comparing Apple to Orange since heapq.merge() is | implement in Python and the system sort() would be in C? | CGamesPlay wrote: | I definitely used this to my advantage in ACM ICPC contests. Most | of the students and professors wrote the problems and benchmarked | against Java. However, on several problems, it boiled down to | "write this specific algorithm for this class of problem, or use | C." I once netted my team an extra problem after we timed out in | Java and I just rewrote the teammate's solution in C++. | peter422 wrote: | Outside of some very specific parsing problems where Java | standard library was very helpful, writing solutions in C was | the way to go. | | Though while it could allow certain parts of the code to be | less efficient, if you had completely the wrong algorithm it | wasn't going to save you. | CogitoCogito wrote: | It seems like the built in heapq.merge() should be as fast as the | implementation here. I mean if it's really just merging sorted | lists, why wouldn't it be as fast as this implementation? | | edit: Okay one reason is that heapq.merge is written in python: | | https://github.com/python/cpython/blob/6252670732c68420c2a8b... | | It might also be since the python version has more options (like | arbitrarily many iterables), but I presume it's the fact that the | extension is written in C. | typon wrote: | Because it's written in python | CogitoCogito wrote: | Re-writing it in C wouldn't automatically make much of a | difference. Of course a C implementation wouldn't be slower | (at minimum you could flatten out the byte-code sequence and | use the generated C api calls), but it wouldn't necessarily | be noticeably faster. Given the features of the python | version, it might very well be the case that writing the same | code in C wouldn't speed things up if the same use-cases were | supported. Though I would be curious to know if it actually | would speed up significantly. | nightpool wrote: | TFA explains this, although they never actually links the | StackOverflow answer that they quote from. | | > CPython's list.sort is implemented in C (avoiding interpreter | overhead), while heapq.merge is mostly implemented in Python, | and optimizes for the "many iterables" case in a way that slows | the "two iterables" case. | | Going to the original StackOverflow answer, we can see the same | user (ShadowRanger) expanding on this in a comment: | https://stackoverflow.com/questions/464342/combining-two-sor... | | > The selling point for heapq.merge is that it doesn't require | either the inputs or the outputs to be list; it can consume | iterators/generators and produces a generator, so huge | inputs/outputs (not stored in RAM at once) can be combined | without swap thrashing. It also handles merging an arbitrary | number of input iterables with lower overhead than might be | expected (it uses a heap to coordinate the merging, so the | overhead scales with the log of the number of iterables, not | linearly, but as noted, that doesn't matter for the "two | iterable" case). | CogitoCogito wrote: | Yeah looking at the all the features of the heapq.merge() | version, I kind of doubt implementing it in C would speed it | up much at all. | | edit: Yeah I now see where the author mentioned the design | philosophy of the python version. It was a bit burried, but | it answers my question like you say. I personally find that | to be the most interesting part. | [deleted] | kzrdude wrote: | Could Cython be used for writing the fast module here instead? | benatkin wrote: | Doesn't explain why Earthly.dev is under the BSL. | quibono wrote: | I don't know why you would expect a post with a title like this | to touch on that, even though being wary of the BSL licence is | fair. ___________________________________________________________________ (page generated 2021-07-13 23:00 UTC)