[HN Gopher] Fluxsort: A stable adaptive partitioning comparison ...
       ___________________________________________________________________
        
       Fluxsort: A stable adaptive partitioning comparison sort
        
       Author : signa11
       Score  : 58 points
       Date   : 2021-07-25 12:03 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mettamage wrote:
       | Perhaps others think differently about this but after all the
       | interview prep I have done, I can appreciate posts like this a
       | lot.
       | 
       | It seems I have gained a new appreciation of a topic! Which is
       | something I am grateful for :)
        
       | gigatexal wrote:
       | Hmm I kinda want to create a Python package for this.
        
       | Blikkentrekker wrote:
       | > _Fluxsort starts out with an analyzer that detects fully sorted
       | arrays and sorts reverse order arrays using n comparisons. It
       | also obtains a measure of presortedness and switches to quadsort
       | if the array is less than 25% random._
       | 
       | This makes me wonder? is it worth it to do a neural a.i. sort by
       | simply training it?
       | 
       | I assume it would be hard to prove that the resulting a.i. is
       | actually infallible and always produces a sorted output, but
       | perhaps it could for practical measures sort faster than most
       | current algorithms.
        
         | somethingAlex wrote:
         | There's no neural network you could train using supervised
         | learning to solve sorting in the general case.
         | 
         | To train a neural network, the training set and test set need
         | to come from a common probability distribution but the inputs
         | of sorting algorithms can be random. There's no relation
         | between what you could possibly train your network on and what
         | you'd ask it to do.
         | 
         | Put another way - the input and output of your model would have
         | zero correlation. There's nothing to train the model to pick up
         | on.
         | 
         | Interesting related work: Graves[0] trained a "neural Turing
         | machine" to encode instructions for a basic sorting algorithm.
         | 
         | [0] https://arxiv.org/abs/1410.5401
        
         | stingraycharles wrote:
         | You mean using a neural network for sorting? I think it will be
         | very expensive, as sorting algorithms usually take low level
         | performance optimizations such as cache locality into
         | consideration, and neural networks are a poor fit for this type
         | of thing.
        
           | HelloNurse wrote:
           | The currently en vogue deep and large neural networks are not
           | only designed to have an affordable but significant
           | evaluation cost (probably not competitive with really cheap
           | techniques, like the partitioning and medians of this
           | Fluxsort) but also to justify this cost with the recognition
           | of fairly subtle patterns: wasted on rough and general worst-
           | case guarantees, positively dangerous with arbitrarily
           | crafted inputs.
        
             | stingraycharles wrote:
             | Do you have an example of a sorting algorithm that uses a
             | neural network that's on par with other sorting algorithms
             | in terms of performance?
             | 
             | It just seems so counter-intuitive to me that neural
             | networks are an efficient way to handle this, but I would
             | be happy if proven wrong.
        
       | mlochbaum wrote:
       | This is BlockQuicksort with stable partitioning, pumped up with
       | selective benchmarks (if I sound irritated about this, it's
       | because I've just called out the author[0] on this exact issue).
       | I understand that worse performance should be expected for a
       | stable sort, but what possible reason is there to leave out
       | benchmarks against the most closely related algorithm?
       | 
       | Worse, a stable version of BlockQuicksort is pointless, which is
       | to say that it is actually impossible for there to be a practical
       | use case for Fluxsort. The branchless partitioning technique is
       | relevant only if comparison doesn't depend on branching, which
       | means the input array needs to consist machine-native types, and
       | in fact Fluxsort limits itself to integers and long double. On
       | these inputs stability isn't an observable property! The result
       | of Fluxsort will be indistinguishable from that of any other
       | quicksort (okay, I guess it will keep negative and positive zeros
       | in the original order for floats, but an integer-based comparison
       | could also accomplish this). So all you get is a slower sort that
       | uses more memory. There _are_ use cases for stable partitioning,
       | such as sort-by or arg-sort, but as written Fluxsort can 't do
       | these.
       | 
       | The statistical method to decide whether to use quicksort or
       | Quadsort--checking the ratio of comparisons between adjacent
       | pairs of elements that are smaller versus greater--is really
       | interesting. It applies to the mergesort versus quicksort problem
       | in general.
       | 
       | The use of the ptx pointer is definitely not novel and I think
       | it's a stretch to even describe it as a brain-twister. It's the
       | obvious way to do branchless stable partitioning.
       | 
       | This is good work! It's well-written and well-explained, and
       | clearly points in the direction of practical applications.
       | However, it's presented in a way that suggests that it's a good
       | drop-in sorting algorithm, when instead it's more of a research
       | effort. Beware!
       | 
       | [0] https://news.ycombinator.com/item?id=27793636
        
         | mlochbaum wrote:
         | Okay, with scandum's prodding I've looked into this more deeply
         | and need to add much more context. Firstly, I was very much
         | wrong and apologies are in order. What I wrote was based on the
         | assumption that Fluxsort trades speed for stability. It
         | doesn't, apparently: the partitioning seems to be faster and
         | _also_ stable. Possibly more so than the benchmarks show. It 's
         | a crude measure, but I ran perf top during benchmarks, and
         | pdqsort spent significantly more time in its partition than
         | Flux did in its own. So the only real drawback is O(n) rather
         | than O(log(n)) memory usage. Memory's cheap, although I guess
         | some users will decide this isn't acceptable.
         | 
         | A fast partition that _happens_ to be stable is great to have.
         | The in-place scheme used by pdqsort has one advantage, which
         | that it naturally puts reverse-sorted data in order. Other than
         | that it tends to mangle everything. Stable partitioning doesn
         | 't do this, and might even be able to draw out order from
         | interleaved sequences that a merge sort (or insertion sort)
         | could later take advantage of.
         | 
         | BlockQuicksort uses a fast index generation method to produce
         | indices for elements that are then swapped. Fluxsort skips all
         | this and moves an element to _both_ possible positions
         | immediately; it _then_ uses the comparison result to increment
         | one partition index allowing the other position to be
         | overwritten. It 's the simpler organization of stable
         | partitioning that allows this to be fast. There's also some
         | cleverness in not immediately moving the second partition to
         | put it after the first, but instead partitioning it from where
         | it sits. I've looked into stable partitioning, and I know
         | branchless algorithms well enough that I could have written
         | this. But I didn't!
        
         | scandum wrote:
         | I get the impression you wrote your response rather hastily.
         | 
         | Here's a bench of fluxsort vs pdqsort on strings:
         | |      Name |    Items | Type |     Best |  Average |  Compares
         | | Samples |     Distribution |         | --------- | -------- |
         | ---- | -------- | -------- | --------- | ------- |
         | ---------------- |         |  fluxsort |   100000 |   64 |
         | 0.011804 | 0.012044 |   2003293 |     100 |    random string |
         | |   pdqsort |   100000 |   64 | 0.012423 | 0.012512 |   1859192
         | |     100 |    random string |
         | 
         | If you can sort strings you can sort tables, where stability
         | matters.
         | 
         | Here's a graph of relative performance on 100K 32 bit integers.
         | 
         | https://media.discordapp.net/attachments/737457171377160203/...
         | 
         | As for the ptx pointer's use, I'm not aware of a prior
         | implementation.
         | 
         | As for selective benchmarking, that's a selective accusation.
        
           | mlochbaum wrote:
           | Gah, I did misunderstand some things. The picture is now
           | worse than I realized! EDIT: and yet also better; see
           | https://news.ycombinator.com/item?id=27951017 . It's hard to
           | get it working right but this _is_ a really fast algorithm.
           | 
           | fluxsort takes a function pointer and _has no way to inline
           | it_. The fluxsort.c is included from fluxsort.h (this is not
           | standard style by the way; template files should have a .h
           | suffix), so the compiler _might_ choose to inline it, and the
           | benchmarks use noinline on the comparison functions to avoid
           | this.
           | 
           | BlockQuicksort makes no sense without inlined comparisons.
           | Yes, pdqsort allows a comparison function in order to be a
           | general-purpose sort, but this isn't the case that it
           | targets.
           | 
           | Are these benchmarks run with inline comparisons? Seems
           | pretty hard to get bench.c to inline anything since test_sort
           | takes a comparison function as input (though I'm working on
           | it).
        
             | scandum wrote:
             | You can uncomment                   //#define cmp(a,b)
             | (*(a) > *(b))
             | 
             | in quadsort.h for primitive inline comparisons.
        
             | acmj wrote:
             | My biggest complaint about all scandum's sorting algorithms
             | is that it doesn't support guaranteed inlined comparisons.
             | Without this feature, the result depends on the compiler
             | behavior, making the results hard to interpret. This point
             | has been raised many times at reddit/hacker news but the
             | dev is strong minded.
        
           | mlochbaum wrote:
           | I think I mistook the README's comment about ptx as being
           | about the branchless partition logic when it's talking about
           | the overall recursive structure (i.e. which arguments are
           | passed to flux_partition calls). That could be new although I
           | kind of doubt it. The way it uses the two buffers is fairly
           | similar to what I've seen called a "ping-pong merge".
        
           | mlochbaum wrote:
           | Got the benchmarks working with a hardcoded 4-byte int
           | pdqsort. pdqsort is of course faster, but this doesn't tell
           | us anything about how good fluxsort really is because it
           | doesn't inline the comparison (I've removed __attribute__
           | ((noinline)) and comparisons++ as bench.c suggests, which
           | improves the speed but doesn't inline it).
           | |      Name |    Items | Type |     Best |  Average |
           | Loops | Samples |     Distribution |         | --------- |
           | -------- | ---- | -------- | -------- | --------- | ------- |
           | ---------------- |         |     qsort |   100000 |   32 |
           | 0.013530 | 0.013901 |         1 |      10 |     random order
           | |         |  fluxsort |   100000 |   32 | 0.004837 | 0.005009
           | |         1 |      10 |     random order |         |
           | pdqsort |   100000 |   32 | 0.003661 | 0.003745 |         1 |
           | 10 |     random order |
           | 
           | Your graph doesn't show how fast pdqsort orders 4-byte
           | integers. It shows how fast it orders them when required to
           | use a comparison function, which is an arbitrary restriction
           | and not what pdqsort is designed for.
        
             | scandum wrote:
             | The graph is with //#define cmp(a,b) ( _(a) > _(b)) in
             | quadsort.h uncommented.
             | 
             | Looking forward to your next spin.
        
               | mlochbaum wrote:
               | You're right, with that setting fluxsort beats pdqsort
               | slightly for either 1e5 or 1e6 random 4-byte ints. I'm
               | really sorry; I shouldn't have assumed I knew what your
               | benchmark looked like. Although I guess now I'm even more
               | confused about why there are no pdqsort benchmarks in the
               | repository? Beating it is really impressive!
               | 
               | I'll be running my own timings, but do you know where the
               | improvement comes from? Is the base case faster than
               | insertion sort, is the partition faster, or both? I
               | wasn't expecting a stable partition to beat an unstable
               | one because it does twice as much data movement, but it
               | wouldn't be the first time an algorithm using more memory
               | beats one using less.
        
               | scandum wrote:
               | Main reasons for not benching against pdqsort:
               | 
               | 1. stability 2. worse performance on long doubles, and I
               | don't know why 3. A variety of hard to explain
               | performance differences. 4. pdqsort does better on
               | generic data, which can be very important.
               | 
               | So it's tricky to present a fair benchmark when two sorts
               | behave very differently.
               | 
               | As to performance advantages of fluxsort:
               | 
               | 1. It has a faster insertion sort. 2. Branchless
               | pseudomedian of 15 gives an advantage. 3. Partial loop
               | unrolling with: while (ptx + 8 < pte) 4. Data movement
               | should be nearly identical, if not better, with the
               | recursive calls through the ptx pointer. In the optimal
               | case the memcpy only triggers when the partition shrinks
               | below 24 elements, and in half of those cases the
               | partition will already be in main memory.
               | 
               | So on random you could expect n / 2 extra data movements
               | on top of ~ n log n moves.
        
       ___________________________________________________________________
       (page generated 2021-07-25 23:01 UTC)