[HN Gopher] Beating TimSort at Merging
       ___________________________________________________________________
        
       Beating TimSort at Merging
        
       Author : andrewstetsenko
       Score  : 187 points
       Date   : 2021-07-13 16:46 UTC (6 hours ago)
        
 (HTM) web link (earthly.dev)
 (TXT) w3m dump (earthly.dev)
        
       | danbruc wrote:
       | The first Python implementation is bad - removing the first
       | element in each iteration is O(n), the C implementation gets this
       | right by maintaining one index into each list instead of
       | modifying the lists.
        
         | XelNika wrote:
         | Yes, it says so in footnote 1 in the article. It's also not the
         | benchmarked code so it doesn't really matter.
        
           | danbruc wrote:
           | I missed the footnote. My point was more that one could have
           | compared this with the C implementation as heapq.merge is a
           | different algorithm and so there is no actual comparison of
           | the same algorithm in C and Python.
        
         | min-cut wrote:
         | Not only that, but also list(a + b) is producing two different
         | new lists since (a + b) produces a list and list() constructs a
         | new list copy. The benchmark would be faster if the OP just did
         | 
         | def sort_test(): m2 = a + b; m2.sort()
         | 
         | instead of
         | 
         | def sort_test(): m2 = list(a + b); m2.sort()
         | 
         | EDIT: It seems like OP fixed this issue in perf.py, but left it
         | in test.py
        
           | agbell wrote:
           | You can check it, but I'm pretty sure the difference is
           | negligible. But yeah, all benchmarks are using the code in
           | perf and the pop code is just to demonstrate. It is not
           | benched.
           | 
           | Edit: dropping the extra list() from the blog code examples.
        
             | min-cut wrote:
             | I see about a 10% performance improvement on my local
             | machine on the input in test.py when not constructing the
             | unnecessary second list.
             | 
             | I don't really buy that it reads nicer in prose since you
             | can just do (a + b).sort() if you want. Plus, I feel like
             | it's important for readability to not be unnecessarily
             | redundant. Having list(a + b) code also risks creating
             | misconceptions about how lists can be constructed and used
             | in Python.
        
               | agbell wrote:
               | Yeah, good point. I think you are right about it being
               | incorrect. Updating...
        
       | tim-peters wrote:
       | To get more gonzo, in CPython's list.sort(), the C code that
       | actually merges two lists (which happen, in context, to be two
       | contiguous array slices) is in listobject.c's `merge_at(()`
       | function. That brings in the world of "galloping" (exponential
       | search) optimizations, much faster than one-pair-at-a-time
       | compares in many real-world cases.
       | 
       | So that's a whole lot of additional complication, but it's the
       | heart of what _really_ makes timsort shine in its "jaw dropping"
       | cases.
       | 
       | Tim (of "timsort" - which was an inside joke name at the start,
       | because I never expected anything outside of CPython would use it
       | ;-) ).
        
         | j-pb wrote:
         | Is there a reason to use galloping over an array vs a seek in a
         | search tree?
         | 
         | Better real world factors?
         | 
         | Galloping, Demaine set intersection, worst case optimal joins,
         | they all seem to be different aspects of the same underlying
         | principle.
         | 
         | So I'm very curious about the peculiarities in the sorting
         | case.
        
         | borski wrote:
         | Gosh, comments like these from the actual author are a reminder
         | of why I love HN. :)
        
         | agbell wrote:
         | Wow. Thanks for writing it. I believe this is the part you are
         | referring to:
         | https://github.com/python/cpython/blob/main/Objects/listobje...
        
           | tim-peters wrote:
           | Right, `merge_at()`. But that in turn calls small mountains
           | of other C code. It is, alas, a complicated approach. Folding
           | it in would be substantially more work than you've already
           | done.
           | 
           | The point: for inputs like your:
           | 
           | a = list(range(-100, 1700)) b = list(range(1400, 1800))
           | 
           | it would first use a variant of binary search to find where
           | b[0] belongs in a. "Almost all of a" is consumed by that via
           | a relatively tiny number of compares. It would similarly do a
           | binary-ish search to find where a[-1] belongs in b. The tail
           | end of b, from 1700 through 1799, then also never needs to be
           | looked at again.
           | 
           | In the lucky case that two input lists don't overlap at all,
           | and a[-1] <= b[0], that's enough so that a "one-pair-at-a-
           | time" compare _never_ needs to be done. In context (the two
           | lists are contiguous slices of an array), no pointers at all
           | need to be copied in that case either - it deduces that the
           | combined array slices are already in sorted order in total
           | time O(log(len(a)) + log(len(b))), and it's done.
        
             | agbell wrote:
             | Very Cool!
             | 
             | Edit: Ok, I get it now. Don't just pop, binary search into
             | it, so you can fast forward through. I'm not sure I'll get
             | around to implementing that, at least not in the immediate
             | future, but that makes total sense.
        
       | jhokanson wrote:
       | As someone that rarely works with Python lists, as opposed to
       | numpy arrays, I was pleasantly surprised to see numpy does what I
       | would expect in providing a mergesort option. I'm surprised
       | Python doesn't, other than via heapq and only implemented in
       | Python (according to my reading of the post and a very quick
       | Google search).
       | 
       | Oops, just for fun the numpy documentation currently states: "The
       | datatype determines which of 'mergesort' or 'timsort' is actually
       | used, even if 'mergesort' is specified. User selection at a finer
       | scale is not currently available." Awesome ...
       | 
       | Also, apparently mergesort may also be done using radix sort for
       | integers "'mergesort' and 'stable' are mapped to radix sort for
       | integer data types."
        
         | kzrdude wrote:
         | mergesort mandates that it's a stable sort, so I guess they get
         | away with replacing "like for like" that way. For varying
         | definitions of like.
        
         | masklinn wrote:
         | Why would Python provide a mergesort option when timsort was
         | the replacement for a standard mergesort?
         | 
         | `heapq.merge` is not a mergesort, it's a merge of sorted
         | sequences (aka just the merge part of a mergesort).
        
         | TheRealPomax wrote:
         | Turns out that at least in terms of performance, everyone using
         | numpy is fine with this. It just needs to run fast enough, not
         | as fast as possible ;)
        
           | tehjoker wrote:
           | This is not true. I have written many libraries to improve on
           | the performance of numpy for image processing applications.
           | Sort backs many operations, such as unique, that result in
           | painfully slow code.
        
             | CogitoCogito wrote:
             | I too have written a lot of extension code to speed up
             | numpy code. It's often not even especially difficult since
             | any code at that level tends to be very algorithmic and
             | looks essentially the same if written in numpy or C++.
             | Having control of the memory layout and algorithms can be a
             | huge win.
             | 
             | Of course I don't _usually_ do this, but that's just be
             | most of the code I write doesn't need it. But it's not at
             | all some crazy sort of thing to do.
        
               | toxik wrote:
               | Tangential but np.bincount is typically the fast version
               | of np.unique. Not entirely the same thing, but it's worth
               | knowing about it.
        
       | dang wrote:
       | Past TimSort threads, for anyone interested:
       | 
       |  _Timsort, the Python sorting algorithm_ -
       | https://news.ycombinator.com/item?id=21196555 - Oct 2019 (131
       | comments)
       | 
       |  _On the Worst-Case Complexity of TimSort_ -
       | https://news.ycombinator.com/item?id=17883461 - Aug 2018 (74
       | comments)
       | 
       |  _Timsort is a sorting algorithm that is efficient for real-world
       | data_ - https://news.ycombinator.com/item?id=17436591 - July 2018
       | (77 comments)
       | 
       |  _Functional verification with mechanical proofs of TimSort
       | [pdf]_ - https://news.ycombinator.com/item?id=9778243 - June 2015
       | (1 comment)
       | 
       |  _Timsort_ - https://news.ycombinator.com/item?id=3214527 - Nov
       | 2011 (27 comments)
       | 
       |  _Java has switched from Mergesort to TimSort_ -
       | https://news.ycombinator.com/item?id=752677 - Aug 2009 (13
       | comments)
        
       | rkalla wrote:
       | Excellent read.
        
       | thomas536 wrote:
       | Are there similar relatively simple methods or simd tricks for
       | multi-way merges?
        
         | dragontamer wrote:
         | I dunno about "multiway merge". But your standard 2-way merge
         | is SIMD-optimized by having a binary-search across the
         | "diagonals" of the so-called mergepath.
         | 
         | Leading to O(lg(p * sqrt(2))) time for the longest comparison
         | diagonal, where "p" is the number of processors. Including
         | thread divergence, that's O(N/p * lg(p * sqrt(2))) total work
         | done, or assuming constant-p, O(N) of work per 2-way SIMD merge
         | (with a large "p" providing an arbitrarily large speedup: but
         | in practice is probably limited to 1024 for modern GPU
         | architectures. Since modern GPUs only really "gang up" into
         | 1024-CUDA thread blocks / workgroups efficiently)
         | 
         | https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf
         | 
         | This provides a natural way at allowing ~1024 CUDA threads in a
         | singular block to sort a list at O(n * log(n)) efficiently.
        
       | agbell wrote:
       | TimSort is cool. In it's worst case it looks no better than merge
       | sort, but give it some partially order data, or small amounts of
       | data, where it uses insertion sort, and it really shines.
        
       | lyxell wrote:
       | Interesting read. This post seems to be written by Adam Gordon
       | Bell who is also the host of the excellect CoRecursive podcast.
       | Recently featured at HN in The Untold Story of SQLite [0].
       | 
       | [0]: https://news.ycombinator.com/item?id=27718701
        
         | agbell wrote:
         | That's me. Thanks for reading it and listening to the podcast.
         | The SQLite episode is now nearly the most listened to, thanks
         | to hacker news.
         | 
         | On Merging lists: I was caught off guard by recommendation to
         | use sort to merge sorted lists. Everyone I mentioned it to was
         | surprised as well so I thought it was worthy of some
         | investigation. This is my first C extension in Python and I got
         | help so someone else might be able to do even better than this.
         | 
         | It is interesting when our normal short hands for thinking
         | about runtime complexity break down.
        
           | CogitoCogito wrote:
           | > It is interesting when our normal short hands for thinking
           | about runtime complexity break down.
           | 
           | Actually I think more interesting would be to write a python
           | version following the algorithm of your C extension (i.e.
           | only allow two lists, don't allow generators, don't return
           | generators, etc). You could probably do that quite quickly
           | and generate the same graphs. I would expect that python
           | version to lie somewhere between your C extension and the
           | heapq.merge() version which is more general.
           | 
           | edit: If you were to do this, I wouldn't recommend using the
           | python version in your blog since it pops the lists from the
           | front. This seems to be O(n) in general:
           | 
           | https://wiki.python.org/moin/TimeComplexity
           | 
           | I did a 10 minute glance at the python source code and
           | couldn't totally verify this (it needs more than 10
           | minutes...), but it makes sense for popping from anywhere but
           | the back to cause a list copy to occur. Maybe this isn't
           | technically necessary when popping from the front, but I
           | couldn't verify it. Either way it's easy to avoid by just not
           | mutating the input lists anyway so I don't see a good reason
           | not to go that way.
        
             | agbell wrote:
             | So we could determine what gains I am getting are due to C
             | and the specific compares and which are due just being 2
             | lists? That would be an interesting test.
             | 
             | Maybe I'll do a follow-up. Thanks for reading the article.
             | I suspect you know way more about C extensions in Python
             | than I do. I saw you have a talk on this topic.
             | 
             | I think you are totally right that pop is not the way to
             | go, but using an offset to track the head, like the C
             | example does. I actually put that in a footnote, because I
             | was considering testing my python version but the SO answer
             | mentioned pop being expensive.
             | 
             | I think the simple version is still great as psuedo-code
             | for communicating a solution though and hopefully it makes
             | the c code a bit easier to follow once you've seen the
             | python version.
        
               | CogitoCogito wrote:
               | Well the version I'm saying is essentially the same as
               | the pop version you have so I don't think it's too
               | complicated. You would just do something like this
               | instead:                   def merge_sorted_lists(l1,
               | l2):             sorted_list = []             i = 0
               | j = 0             while i < len(l1) or j < len(l2):
               | if i == len(l1):
               | sorted_list.append(l2[j])                     j += 1
               | continue                 if j == len(l2):
               | sorted_list.append(l1[i])                     i += 1
               | continue                 if l1[i] < l2[j]:
               | sorted_list.append(l1[i])                     i += 1
               | else:                     sorted_list.append(l2[j])
               | j += 1             return sorted_list
               | 
               | I haven't tested that (you probably should before using
               | it), but it looks mostly right and is essentially the
               | same algorithm as yours.
        
       | forrestthewoods wrote:
       | Great post. I almost did a spit take when I read:
       | 
       | > Timsort overcomes this disadvantage by being written in C
       | rather than Python.
       | 
       | Yeah, Python is SLOW. Thankfully the author dug into C. That was
       | nice to see.
       | 
       | Edit: lol I have no idea why this is being downvoted. Y'all
       | weird.
        
         | ww520 wrote:
         | That's what I suspected when reading the first part of the post
         | where it states that heapq.merge() is slower than sort(). Hmm,
         | isn't it comparing Apple to Orange since heapq.merge() is
         | implement in Python and the system sort() would be in C?
        
       | CGamesPlay wrote:
       | I definitely used this to my advantage in ACM ICPC contests. Most
       | of the students and professors wrote the problems and benchmarked
       | against Java. However, on several problems, it boiled down to
       | "write this specific algorithm for this class of problem, or use
       | C." I once netted my team an extra problem after we timed out in
       | Java and I just rewrote the teammate's solution in C++.
        
         | peter422 wrote:
         | Outside of some very specific parsing problems where Java
         | standard library was very helpful, writing solutions in C was
         | the way to go.
         | 
         | Though while it could allow certain parts of the code to be
         | less efficient, if you had completely the wrong algorithm it
         | wasn't going to save you.
        
       | CogitoCogito wrote:
       | It seems like the built in heapq.merge() should be as fast as the
       | implementation here. I mean if it's really just merging sorted
       | lists, why wouldn't it be as fast as this implementation?
       | 
       | edit: Okay one reason is that heapq.merge is written in python:
       | 
       | https://github.com/python/cpython/blob/6252670732c68420c2a8b...
       | 
       | It might also be since the python version has more options (like
       | arbitrarily many iterables), but I presume it's the fact that the
       | extension is written in C.
        
         | typon wrote:
         | Because it's written in python
        
           | CogitoCogito wrote:
           | Re-writing it in C wouldn't automatically make much of a
           | difference. Of course a C implementation wouldn't be slower
           | (at minimum you could flatten out the byte-code sequence and
           | use the generated C api calls), but it wouldn't necessarily
           | be noticeably faster. Given the features of the python
           | version, it might very well be the case that writing the same
           | code in C wouldn't speed things up if the same use-cases were
           | supported. Though I would be curious to know if it actually
           | would speed up significantly.
        
         | nightpool wrote:
         | TFA explains this, although they never actually links the
         | StackOverflow answer that they quote from.
         | 
         | > CPython's list.sort is implemented in C (avoiding interpreter
         | overhead), while heapq.merge is mostly implemented in Python,
         | and optimizes for the "many iterables" case in a way that slows
         | the "two iterables" case.
         | 
         | Going to the original StackOverflow answer, we can see the same
         | user (ShadowRanger) expanding on this in a comment:
         | https://stackoverflow.com/questions/464342/combining-two-sor...
         | 
         | > The selling point for heapq.merge is that it doesn't require
         | either the inputs or the outputs to be list; it can consume
         | iterators/generators and produces a generator, so huge
         | inputs/outputs (not stored in RAM at once) can be combined
         | without swap thrashing. It also handles merging an arbitrary
         | number of input iterables with lower overhead than might be
         | expected (it uses a heap to coordinate the merging, so the
         | overhead scales with the log of the number of iterables, not
         | linearly, but as noted, that doesn't matter for the "two
         | iterable" case).
        
           | CogitoCogito wrote:
           | Yeah looking at the all the features of the heapq.merge()
           | version, I kind of doubt implementing it in C would speed it
           | up much at all.
           | 
           | edit: Yeah I now see where the author mentioned the design
           | philosophy of the python version. It was a bit burried, but
           | it answers my question like you say. I personally find that
           | to be the most interesting part.
        
         | [deleted]
        
       | kzrdude wrote:
       | Could Cython be used for writing the fast module here instead?
        
       | benatkin wrote:
       | Doesn't explain why Earthly.dev is under the BSL.
        
         | quibono wrote:
         | I don't know why you would expect a post with a title like this
         | to touch on that, even though being wary of the BSL licence is
         | fair.
        
       ___________________________________________________________________
       (page generated 2021-07-13 23:00 UTC)