[HN Gopher] Vectorized and performance-portable Quicksort
       ___________________________________________________________________
        
       Vectorized and performance-portable Quicksort
        
       Author : slackerIII
       Score  : 321 points
       Date   : 2022-06-04 16:48 UTC (6 hours ago)
        
 (HTM) web link (opensource.googleblog.com)
 (TXT) w3m dump (opensource.googleblog.com)
        
       | DeathArrow wrote:
       | This might interest someone: "Vectorized Operations extension for
       | PostgreSQL"
       | 
       | https://github.com/postgrespro/vops
        
         | polskibus wrote:
         | This looks awesome but could do with a bit of
         | productionalization. For example ready to use binaries. Do you
         | know if there are any plans to merge this work with core
         | postgres?
        
       | orlp wrote:
       | SIMD based sorting isn't new, e.g. djbsort:
       | https://sorting.cr.yp.to/. I find it strange the paper doesn't
       | cite it or compare to it at all.
       | 
       | EDIT: to be fair it does seem that Bramas' first published work
       | on SIMD sorting (that I could find) predates djbsort,
       | https://arxiv.org/abs/1704.08579. A comparison would still be
       | nice though.
        
         | TontonNestor wrote:
         | djbsort uses a sorting network, not a quicksort. It is also
         | designed to be in constant time, which is not the case with the
         | linked blog post. So the restrictions are different.
         | 
         | This is not my field, so I can't judge the relevance of this,
         | but the author cites the state of the art. It just seems that
         | djbsort is not the state of the art and therefore does not need
         | to be cited.
         | 
         | I still don't understand the hype around Bernstein, he is quite
         | relevant in cryptography/cryptography implementations but not
         | everything he does is gold.
        
           | janwas wrote:
           | :) FWIW we also use a constant-time sorting network as the
           | base case of the Quicksort recursion, that's a widely used
           | optimization.
        
         | janwas wrote:
         | What exactly should be cited? I had a quick look and djbsort
         | seems to be a constant-time algorithm, presumably a sorting
         | network. They report about 2 GB/s for 768 int32 on 3 GHz
         | Skylake; IIRC our sorting network reaches 3-4 GB/s, but for a
         | smaller input size (256 int32 or int64). Hard to compare
         | directly, but djbsort is likely slower, and the sorting network
         | is anyway a small fraction of the total time in a Quicksort.
        
         | explaingarlic wrote:
         | The whole "stick SIMD into it and compare it with the stdlib"
         | thing is a really common learning trope. Feels more like a
         | portfolio filler, albeit probably not a useless one.
        
           | janwas wrote:
           | We also compare with state of the art platform-specific code.
           | The interesting thing here is that they turn out to be slower
           | than our approach using portable intrinsics
           | (github.com/google/highway) :)
        
         | mixedCase wrote:
         | The article acknowledges SIMD based sorting and states what
         | makes this novel.
        
         | nostrademons wrote:
         | It's not new within Google either. I remember poking through
         | Jeff Dean's original MapReduce code (written c. 2003) and
         | finding a SIMD-based QuickSort that was the in-memory portion
         | of the external sort algorithm.
         | 
         | It looks like this algorithm is interesting in the specifics of
         | how it works, eg. the use of a compress-store instruction for
         | partitioning. The algorithm I mentioned above mostly focused on
         | speeding up comparisons through the use of SIMD instructions,
         | rather than changing the partitioning part of the QuickSort
         | algorithm itself. Not sure how that compares to djbsort, I
         | didn't see technical details of the algorithm on the page.
        
           | jeffbee wrote:
           | The blog post does seem to be quite explicit about the fact
           | that their contribution is in the field of portability, not
           | novelty.
        
           | janwas wrote:
           | Interesting, I did not know that, thanks for the pointer.
           | There is also an effort underway to update LLVM std::sort to
           | BlockQuicksort, which can use some SIMD instructions
           | similarly to what you describe.
        
           | convolvatron wrote:
           | nothing happened before google
           | 
           | https://dl.acm.org/doi/10.1145/113379.113380
        
             | janwas wrote:
             | hah, our paper actually cites "Radix sort for vector
             | multiprocessors" by the same author, also from 1991 :)
        
       | skybrian wrote:
       | Will browsers start using this code for typed array sorting?
       | Maybe they already do?
        
         | xxs wrote:
         | Sorting integers is actually rare in practice, esp. not very
         | useful for a browser.
        
           | skybrian wrote:
           | The method exists. They could make it faster without breaking
           | anything.
           | 
           | https://developer.mozilla.org/en-
           | US/docs/Web/JavaScript/Refe...
        
       | wly_cdgr wrote:
       | SpaceChem players have known this for years
        
       | unnouinceput wrote:
       | Quote: "Today we're sharing open source code that can sort arrays
       | of numbers about ten times as fast as the C++ std::sort...." and
       | proceeds to explain about SIMD instructions.
       | 
       | Well, to me sounds that C++ std::sort is simply lacking an
       | optimization which can very easy be implemented in next iteration
       | of the compiler/linker. I had high hopes this would be a really
       | breakthrough algorithm, not lack of a simple optimization using
       | some instruction set that compiler/linker didn't implement. If
       | they want, CPU vendors would make an even more awesome and faster
       | instruction set, let's say SIMD2, which would leave this one in
       | the dust.
        
         | janwas wrote:
         | CPU vendors have recently introduced SVE (Arm) and RVV (RISC-V)
         | and we support those in Highway and thus also vqsort.
         | 
         | It's definitely interesting to discuss std::sort benefitting
         | from such optimizations, but "very easy" seems rather
         | optimistic.
        
           | benstrumental wrote:
           | Do you expect these optimizations making their way into
           | std::sort eventually?
        
       | janwas wrote:
       | Author here, happy to discuss.
        
         | tehsauce wrote:
         | Do you have an idea if it's possible that this fast SIMD sort
         | could be applied for faster sorting on GPUs?
        
           | [deleted]
        
         | TinkersW wrote:
         | I just wanted to say awesome job, I'm going to use it, ignore
         | all the ignorant SIMD haters;0
        
         | bhl wrote:
         | Regarding columnar databases, is there an optimal data
         | structure in between storing data as rows and storing data as
         | columns that would make it fast to sort and filter by either
         | dimension? Some mix of B-trees or interval trees?
        
         | nickysielicki wrote:
         | Highway looks awesome from a quick glance. Definitely going to
         | set some time aside to play with it and read it.
         | 
         | Sort of tangentially related, what sort of tools are you using
         | for disassembly in 2022? Is there anything better than objdump
         | today? What I really want is a version of godbolt that reads
         | compile_commands.json and can show me the godbolt-style color-
         | coded disassembly for an entire binary, any function. I find
         | that when I'm asking these sort of questions I'm wasting a lot
         | of time fitting my problem into godbolt so I can just get a
         | rough idea of what's coming out, because it's so much faster
         | (maybe 10x) to read the color tagged version from godbolt than
         | it is to do the same with objdump.
         | 
         | Edit: along the same lines, what can people do (or rather, what
         | do you do) about situations like this:
         | https://github.com/google/highway/blob/master/hwy/examples/b...
         | where things might get better in the future but for now we have
         | to implement it another way? How do you avoid regressions and
         | how do you track workarounds? I want some sort of database that
         | tries competing implementations on every build and emits a
         | compiler warning when things get better. How are you dealing
         | with this?
        
           | janwas wrote:
           | Thanks! Feel free to raise Github issues if you'd like to
           | ask/discuss anything.
           | 
           | I'm also a huge fan of Godbolt/Compiler Explorer. Highway is
           | integrated there, so you can just copy in your functions.
           | Here's the last throwaway test that I was using:
           | https://gcc.godbolt.org/z/5azbK95j9
           | 
           | > things might get better in the future but for now we have
           | to implement it another way
           | 
           | There's several possible answers. 1) For the issue you
           | pointed to, that we cannot have arrays of N vectors, it's a
           | reasonable workaround to instead allocate an array of N
           | vectors' worth of elements, and trust the compiler to elide
           | unnecessary Load/Store. This often works using clang, and
           | would avoid having to manually unroll here. I do prefer to
           | minimize the amount of compiler magic required for good
           | performance, though, so typically we're unrolling manually as
           | shown in that code.
           | 
           | 2) If there are compiler bugs, typically the workarounds have
           | to stay in because in the open-source world people are still
           | using that compiler years later.
           | 
           | Automatically detecting when things get better is an
           | interesting idea but I am not yet aware of such
           | infrastructure.
        
             | celrod wrote:
             | // Compiler doesn't make independent sum* accumulators, so
             | unroll manually.       // We cannot use an array because V
             | might be a sizeless type. For reasonable       // code, we
             | unroll 4x, but 8x might help (2 FMA ports * 4 cycle
             | latency).
             | 
             | That code needs 2 loads per FMA. So a CPU with 2 FMA ports
             | would need at least 4 load ports to be able to feed the 2
             | FMA ports. Given that most CPUs with 2 FMA ports have just
             | 2 load ports, unrolling by 4 should be more or less ideal.
             | 
             | But, ideally, the compiler could make the decision based on
             | the target architecture.
             | 
             | Without enabling associative math, it isn't legal to
             | duplicate floating point accumulators and change the order
             | of the accumulation. Perhaps compiling under `-funsafe-
             | math` would help. If you're using GCC, you'll probably need
             | `-fvariable-expansion-in-unroller`, too.
             | 
             | I think highway looks great. I'm sure I'll procrastinate on
             | something important to play with it reasonably soon.
        
         | superjan wrote:
         | Great work! Does it also support sorting object pointers by a
         | numeric key?
        
         | mchusma wrote:
         | Very cool work! What do you see as the typical process of
         | getting code like this into use in well known codebases? I'm
         | trying to get a sense of when a typical engineer (operating
         | with higher level language or libraries) might get to take
         | advantage of this.
         | 
         | Thanks for open sourcing!
        
           | janwas wrote:
           | Thanks! Highway is included in Chrome and Firefox. Not sure
           | what you mean by process? It's pretty easy for example using
           | git submodules - you tell Git the Highway version you want to
           | depend on, notify your CMake or Bazel build system of the hwy
           | library dependency, and that's it.
           | 
           | One requirement is C++11 or later - some people have asked
           | about supporting C but I believe that's infeasible.
        
         | DennisL123 wrote:
         | Well done, Jan!
        
         | BeeOnRope wrote:
         | Interesting post and paper, thanks!
         | 
         | Sometimes the state of the art is not found in another paper
         | but somewhere else, e.g,. there is vxsort by Dan Shechter
         | (damageboy):
         | 
         | https://github.com/damageboy/vxsort-cpp/tree/master
         | 
         | He uses a similar approach and while I'm not sure how it
         | compares to the older Blacher et al version, I expect it to be
         | in the ballpark.
         | 
         | He's written an excellent series of blog posts on the design &
         | implementation:
         | 
         | https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1
        
         | johntortugo wrote:
         | I think it would interesting to have a power consumption
         | comparison as well.
        
           | janwas wrote:
           | Definitely, I have on my TODO to give turbostat a try.
        
         | SemanticStrengh wrote:
        
       | jart wrote:
       | I wonder how fast it is compared to djbsort
       | https://github.com/jart/cosmopolitan/blob/master/libc/nexgen...
       | and longsort
       | https://github.com/jart/cosmopolitan/blob/e011973593407f576d...
       | djbsort is outrageously fast for 32-bit ints with avx2 (which
       | unlike avx512 it's the avx we can reliably use in open source).
       | But there's never been a clear instruction set to use on Intel /
       | AMD for sorting 64-bit ints that's reliably fast. So if this
       | thing can actually sort 64-bit integers 10x faster on avx2 I'd be
       | so thrilled.
        
         | wtallis wrote:
         | > it's the avx we can reliably use in open source
         | 
         | I'm not sure what you mean by that. You can't assume the
         | presence of AVX or AVX2 without explicitly checking for it,
         | because Intel was still disabling those features on new low-end
         | Pentium and Celeron parts at least a recently as _Comet Lake_
         | (2020). Sure, AVX2 support is much more widespread than AVX512
         | support, but that has nothing to do with open-source and it 's
         | a bit strange to describe that in terms of reliability.
        
           | jart wrote:
           | Some of us like to think of ourselves writing open source as
           | serving the public interest. It's hard to do that if you're
           | focusing on an ISA the public doesn't have. I haven't seen
           | any consumer hardware that has AVX512.
        
             | eklitzke wrote:
             | Lots of consumer hardware has AVX512 (I have an 11th gen
             | Intel laptop CPU that has it).
             | 
             | Regardless, Clang and GCC both support function multi-
             | versioning where you supply multiple versions of a function
             | and specify which CPU features each implementation needs,
             | and the best version of the function will be selected at
             | runtime based on the results of cpuid. For example, you can
             | use this to write a function that uses no vector
             | instructions, SEE, AVX2, or AVX512 and all versions will be
             | compiled into the executable and the best version you can
             | actually use will be selected at runtime. This is how glibc
             | selects the optimal version of functions like
             | memset/memcpy/memcmp, as there are vector instructions that
             | significantly speed these functions up.
             | 
             | There's an LWN article about the feature if you're curious
             | how it works: https://lwn.net/Articles/691932/
        
             | akelly wrote:
             | Intel 10th gen mobile and 11th gen mobile and desktop,
             | excluding Pentium and Celeron, have AVX-512. And all 12th
             | gen have it on the P cores but not the E cores. If the E
             | cores are enabled then AVX-512 is unavailable.
        
               | monocasa wrote:
               | On 12th gen they disabled it on the P cores too even with
               | E cores disabled with a microcode update. A lot of newer
               | systems don't have access to the older microcode, and
               | microcode doesn't typically let you downgrade.
        
               | wtallis wrote:
               | There are workarounds for downgrading microcode, because
               | the CPU itself doesn't actually have non-volatile storage
               | for microcode updates and relies on the motherboard
               | firmware to upload updates on each boot (and motherboard
               | firmware can often be downgraded, possibly after changing
               | a setting to allow that).
               | 
               | Which is probably why Intel has changed to disabling
               | AVX512 using fuses in more recently manufactured Alder
               | Lake CPUs.
        
               | monocasa wrote:
               | My point with "a lot of newer systems" was that there are
               | motherboards now that completely lack a version of their
               | firmware with the microcode that allows avx-512. There's
               | nothing to downgrade to without an exploit to allow
               | making your own firmware images with mixed and matched
               | components.
        
             | robertlagrant wrote:
             | > Some of us like to think of ourselves as
             | 
             | I don't see how this is relevant to anything.
        
             | janwas wrote:
             | I agree AVX-512 is not exactly widespread on client CPUs
             | but as akelly mentions, it does exist (e.g. Icelake).
             | 
             | What we do is dispatch to the best available instruction
             | set at runtime - that costs only an indirect branch, plus
             | somewhat larger binary and longer compile time.
        
             | wtallis wrote:
             | Even if AVX512 was entirely constrained to server hardware
             | (it's not), how would it be contrary to the public interest
             | for open-source software to take advantage of those
             | instructions?
        
         | janwas wrote:
         | Yes, we can sort 64-bit ints. The speedup on AVX2 is roughly
         | 2/3 of the 10x we see on AVX-512. Longsort appears to be an
         | autovectorized sorting network. That's only going to be
         | competitive or even viable for relatively small arrays
         | (thousands). See comments above on djbsort.
         | 
         | Why not use whichever AVX the CPU has? Not a problem when using
         | runtime dispatch :)
        
           | heavenlyblue wrote:
           | What about performance-per-watt?
        
             | celrod wrote:
             | The bigger the vectors, the better the performance per
             | watt.
        
       | olliej wrote:
       | Re-implementation of stuff with SIMD is always amazing to me. I
       | have done stuff with SIMD before: 4 element float vector
       | operations; basic arithmetic on arrays of floats.
       | 
       | Those are things that are super simple, once you get past the
       | scary mystique of SIMD.
       | 
       | It's stuff like this that should be getting all the witch burning
       | :D
        
       | MrYellowP wrote:
       | Man, I was thinking about this problem and came up with some
       | weird solutions.
       | 
       | I've actually thought all this stuff has already been solved?
       | 
       | Is there a point in still trying to improve efficiency of
       | sorting?
       | 
       | I can do that! I'd love such a challenge if there was a point to
       | it!
        
         | joebob42 wrote:
         | Sorting is one of a few things that computers spend so much
         | time doing, even a relatively small improvement over that state
         | of the art can have a big impact.
         | 
         | On the one hand this means if you can come up with a better way
         | to do it in some case that's awesome and could be very
         | valuable. On the other hand it means a lot of people have
         | thought pretty hard about it and the next improvement may be
         | hard to come up with unless you pick a tight enough subset of
         | the problem.
         | 
         | Definitely go for it though :)
        
       | jerryjerryjerry wrote:
       | This is great, and can definitely improve sort performance in
       | database and big data projects. I can immediately imagine this is
       | a perfect match to my project of columnar processing engine
       | TiFlash (https://github.com/pingcap/tiflash).
        
       | BeeOnRope wrote:
       | Are there instructions for building & reproducing the performance
       | results?
        
       | VHRanger wrote:
       | I love work like this and SIMD-JSON which vectorizes what "should
       | be" sequential form problems to find massive performance gains
        
         | charcircuit wrote:
         | Any nonsequential problem can be divided into sequential
         | subprobolems.
        
         | singhrac wrote:
         | Another one if you enjoy this sort of thing like me is Raph
         | Levien's work on the stack monoid; this [1] blog post and this
         | [2] ArXiV paper are great places to look.
         | 
         | [1]: https://raphlinus.github.io/gpu/2020/09/05/stack-
         | monoid.html [2]: https://arxiv.org/abs/2205.11659
        
       | alecco wrote:
       | > By comparison, the standard library reaches 58/128/117 MB/s on
       | the same CPU, so we have managed a 9-19x speedup depending on the
       | type of numbers.
       | 
       | That's pretty disingenuous. The standard implementation is known
       | to be terribly slow because of constraints. They should compare
       | against current state of the art.
        
         | slackerIII wrote:
         | The paper goes into a lot more details:
         | https://arxiv.org/pdf/2205.05982.pdf
        
           | MontyCarloHall wrote:
           | I don't see any mention in the paper of thermal clock
           | throttling concerns, which can really neuter performance of
           | tools that sustain use of AVX operations over a period of
           | time. For the quick benchmarks presented in the paper, of
           | course it will be faster. What if I'm continuously hammering
           | my CPU with AVX operations? I expect it to severely
           | downclock.
        
             | kadoban wrote:
             | That might be an interesting benchmark, but assuming good
             | cooling isn't exactly unreasonable either.
        
               | MontyCarloHall wrote:
               | In datacenter blade servers (i.e. on a cloud VM), I've
               | noticed up to 50% downclocking due to thermal throttling
               | when running sustained frequent AVX operations.
               | 
               | I'm sure an exotic watercooled setup will fare much
               | better, but those aren't generally what we run in
               | production.
        
             | hinkley wrote:
             | My laptop has been throttling itself for a while, I
             | recently discovered. I had been trying to benchmark some
             | code changes and have given up and am letting the CI
             | machine run them, because my numbers are all over the place
             | and go down with each run.
        
               | bee_rider wrote:
               | One option would be to go into BIOS and see if there's
               | some way of just locking your CPU to one of the lower
               | clock speeds. This will give lower benchmarking numbers
               | of course, but at least they should be fairly stable. (in
               | Linux, it also often possible to tinker with frequencies
               | while the system is running).
               | 
               | Even on a desktop this sort of thing is sometimes
               | necessary, for example my CPU has different clock speeds
               | depending on how many processors are running, so I have
               | to lock it to the all-core clock if I want to see proper
               | parallel speedups.
               | 
               | This might be annoying for day-to-day usage (although,
               | CPUs really are insanely performant nowadays so maybe it
               | will not be too bad).
        
             | jeffbee wrote:
             | On Ice Lake Xeon the penalty for using the AVX-512 features
             | on a single core is -100MHz. If we pessimistically use the
             | slowest part Intel sells, that is a 5% performance penalty
             | (2% on their fastest parts). The speedup from this work is
             | 40-60% compared to AVX2. So you'd be a fool to take the
             | side of the folk myth. AVX-512 works.
             | 
             | By the way the performance penalty for using AVX-512 on
             | multiple cores when the multiple cores were already active
             | is zero. There is no penalty in most server scenarios.
        
               | MontyCarloHall wrote:
               | >On Ice Lake Xeon the penalty for using the AVX-512
               | features on a single core is -100MHz.
               | 
               | That is a penalty due to licensing [0], not thermal
               | throttling. As I wrote elsewhere, I've seen my clockspeed
               | get cut in half across all cores on a physical die when
               | running AVX-heavy operations for a sustained period of
               | time, due to thermal throttling.
               | 
               | [0] https://travisdowns.github.io/blog/2020/08/19/icl-
               | avx512-fre...
        
               | mcronce wrote:
               | The default AVX offset for Ice Lake is indeed only 100MHz
               | (and it doesn't exist starting with Rocket Lake), but
               | 512b SIMD instructions use a lot of power, and as a
               | result generate a lot of heat - so they certainly can
               | cause thermal throttling or throttling due to power
               | limits
        
               | ahh wrote:
               | It's the transition that kills you. Are you doing this
               | full time?
        
               | jeffbee wrote:
               | My full-time thing is more search-y and takes tens of
               | milliseconds so I'm not really sweating power state
               | transitions that take a few micros.
        
         | Rizz wrote:
         | Which they did and showed in the previous sentence. You cannot
         | possibly have missed that.
        
         | throw-away_42 wrote:
         | Also disingenuous to claim they don't make that comparison when
         | it's literally the sentence before the one you quoted.
        
           | alecco wrote:
           | My bad. Apologies.
        
       | cbetti wrote:
       | Not having played with SIMD much myself, does leveraging these
       | instructions for an intensive operation like a sort push other
       | workloads out of the CPU more aggressively than operating on 32
       | or 64 bits at a time would?
       | 
       | In other words, do you have to be more careful when integrating
       | these wide operators to preserve some resources for other
       | operations?
        
         | jeffbee wrote:
         | At least on Intel, AVX has its own functional units and
         | register file, so I would say it's not a major concern. It's
         | possible that running some AVX instructions could be almost or
         | completely free if you weren't using execution port 5 anyway,
         | to the extent that instruction-level parallelism can be
         | considered "free".
         | 
         | If you're really taking a microscope to performance, the main
         | hazards would be intermittently using AVX for only a few
         | instructions, because that might lead to the CPU stopping for a
         | few microseconds to turn the power on and off on the functional
         | units. If you're using them heavily the overall thermal
         | situation might cause a core or package-wide clock rate
         | degradation, but if you have a use case for sustained AVX-512
         | usage this is likely to be a good tradeoff.
        
         | Arkanosis wrote:
         | Pushing them out of the CPU, I don't know, but some SIMD
         | instruction sets on some CPUs have side effects that can
         | negatively affect the performance of other operations. For
         | example, the use of AVX2 / AVX-512 can cause some Intel CPUs to
         | lower their base frequency, thus reducing the performance of
         | simultaneous operations that are not using SIMD.
        
           | [deleted]
        
           | vlovich123 wrote:
           | Is that still true? I was under the impression that post-
           | Haswell CPUs don't have that particular issue.
        
             | mcronce wrote:
             | Not for recent hardware - Ice Lake eliminated the
             | throttling on 256b ops (they already didn't exist for
             | certain 256b ops and all smaller ops), and reduced the
             | throttling to almost nothing for 512b ops. Rocket Lake
             | eliminated the throttling for 512b ops.
             | 
             | https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Do
             | w...
             | 
             | They do use a lot of power (and as a result, generate a lot
             | of heat), so they can still cause thermal throttling, or
             | throttling due to power limits - but there's no more "AVX
             | offset"
        
         | fulafel wrote:
         | Generally yes to the first question, no to the second.
         | 
         | If you want your code to have low perturbance on other
         | concurrent work done by the system, implementing it in a
         | inefficient way doesn't usually help with that goal, since then
         | your code will just be executing all the time because it takes
         | a long time to finish. And you still won't have good control of
         | the execution resources compared to using normal tools like OS
         | scheduling policies to address the goal.
        
         | VHRanger wrote:
         | They do emit a lot of heat, which might actually throttle the
         | CPU overall.
         | 
         | But to my knowledge they use different registers, and when
         | properly pipelined they don't hog the CPU cache like
         | unoptimized algorithms that constantly round trip to RAM.
        
           | saltcured wrote:
           | Right, though your use of the terms "different registers" and
           | "properly pipelined" is quite nuanced for a non-specialist.
           | It leaves a lot to the imagination or prior experience!
           | 
           | If it is a compute-heavy code that was spilling to a stack
           | and now can fit in the SIMD registers, there could be a speed
           | up with less memory traffic. This is analogous to a program's
           | working set either fitting in RAM or spilling to swap while
           | it runs, but at a different level of the memory hierarchy. In
           | the extreme case, your working set could be a fixed set of
           | variables in expensive loops for some sequential algorithm,
           | and so the compiler register placement ability and number of
           | registers available could be the difference between
           | effectively O(1) or O(n) memory accesses with respect to the
           | number of compute steps.
           | 
           | Of course, you can find such transitions between registers/L1
           | cache, L1/L2 cache, L2/L3 cache (if present), local/remote
           | RAM (for modern NUMA machines), etc. This happens as you
           | transition from a pure CPU-bound case to something with
           | bigger data structures, where the program focus has to move
           | to different parts of the data to make progress. Naively
           | holding other things equal, an acceleration of a CPU-bound
           | code will of course mean you churn through these different
           | data faster, which means more cache spills as you pull in new
           | data and have to displace something else. Going back to the
           | earlier poster's question, this cache spilling can have a
           | detrimental effect on other processes sharing the same cache
           | or NUMA node, just like one extremely bloated program can
           | cause a virtual memory swapping frenzy and hurt every other
           | program on the machine.
           | 
           | One form of "pipelining" optimization is to recognize when
           | your loop is repeatedly fetching a lot of the same input data
           | in multiple iterations and changing that to use
           | variables/buffers to retain state between iterations and
           | avoid redundant access. This often happens in convolution
           | algorithms on arrays of multi-dimensional data, e.g. image
           | processing and neural nets and many cellular-automata style
           | or grid-based simulations. Your algorithm slides a "sampling
           | window" along an array to compute an output value as a linear
           | combination of all the neighboring values within the window
           | using some filtering kernel (a fixed pattern of
           | multiplicative scalars/weights). Naive code keeps loading
           | this whole window once for each iteration of the loop to
           | address one output position. It is effectively stateless
           | except for the loops to visit every output. Pipelined code
           | would manage a window buffer and fetch and replace only the
           | fringe of the buffer while holding and reusing inputs from
           | the previous iterations. While this makes the loops more
           | stateful, it can dramatically reduce the number of memory
           | reads and shift a memory-bound code back into a CPU-bound
           | regime.
           | 
           | Other major optimizations for these N-dimensional
           | convolutions are done by the programmer/designer: some
           | convolution kernels are "decomposable" and the expensive
           | multi-dimensional operation can be strength-reduced to a
           | sequence of 1-dimensional convolutions with appropriately
           | chosen filter kernels to produce the same mathematical
           | result. This optimization has nothing to do with SIMD but
           | simplifies the work to embody the operation. Fewer input
           | values to read (e.g. a 1D line of neighbors instead of 2D
           | box) and the number of arithmetic operations to do all the
           | multiply-and-add steps to produce the output. Imagine a 2D
           | filter that operates on an 8x8 window. The 2D convolution has
           | to sample and combine 64 input neighbors per output, while
           | the decomposed filter does 8 neighbors in each axis and so
           | one quarter as many steps total after one pass for the X axis
           | and one pass for the Y axis.
           | 
           | Naively, decompsition is done as separate 1D passes over the
           | data and so performs more memory writes and allocates an
           | additional intermediate array between the original input and
           | final output. This is often a big win in spite of the extra
           | memory demands. It's a lot more coding, but this could also
           | be "pipelined" to some degree in N-dimensions to avoid
           | pushing the entire intermediate results arrays through main
           | memory. Approaches differ, but you can make different
           | tradeoffs for how many intermediate results to store or
           | buffer versus redundant calculation in a less stateful
           | manner.
           | 
           | Generally, as you make the above kinds of optimizations your
           | code becomes more "block-based", loading larger chunks of
           | data with fewer changes to new "random" memory locations.
           | This is very much like how databases and filesystems
           | optimized their access to disk storage in prior decades to
           | avoid random seeks for individual bytes or blocks and to
           | instead use clever structures to support faster loading of
           | sets of blocks or pages. When this is successful, your
           | program does most of its memory access sequentially and
           | achieves higher throughput that is bound by total memory bus
           | bandwidth rather than random access latency limitations.
           | 
           | The final form of "pipelining" matters once you really hit
           | your stride with these optimized, block-based programs. All
           | this access again can cause cache spilling as you sustain
           | high bandwidth memory access. But if you can provide the
           | right hints to the CPU, or if the CPU is clever enough to
           | detect the pattern, it can shift into a different cache
           | management regime. Knowing that you will only access each
           | block of data once and then move on (because you are holding
           | the reusable state in registers), the CPU can use a different
           | prefetching and spilling policy to both optimize for your
           | next memory access and to avoid churning the whole cache with
           | these values you will only read once and then ignore. This
           | reduces the impact of the code on other concurrent tasks
           | which can still enjoy more reasonable caching behavior in
           | spite of the busy neighbor.
        
           | janwas wrote:
           | The CPU has a certain upper bound on power, TDP. That limit
           | is independent of whether it is reached by executing scalar
           | code on lots of cores, or SIMD on a few. One little-known
           | fact is that out of order CPUs burn lots of energy just on
           | scheduling instructions, ~10x more than the operation itself.
           | SIMD amortizes that per-instruction overhead over multiple
           | data elements, and actually increases the amount of useful
           | work done per joule. We're talking 5-10x increase in energy
           | efficiency.
        
       | gww wrote:
       | I am always amazed at the algorithms re-implemented using SIMD.
       | One of my favorites is the Striped Smith-Waterman approach used
       | for sequence alignment. Does anyone have any good resources on
       | learning to use SIMD? I've found it heard to make the "plunge".
        
         | janwas wrote:
         | :) How about Chapter 12 of Agner's awesome manual
         | (https://agner.org/optimize/optimizing_cpp.pdf),
         | http://const.me/articles/simd/simd.pdf,
         | https://en.algorithmica.org/hpc/simd/ ?
         | 
         | Does anyone have others to share?
        
         | magicnubs wrote:
         | The lectures and assignments for Oregon State's CS475 (Parallel
         | Programming) are all available online. [0] There are lectures
         | [1] and a project [2] about SIMD. I really enjoyed the entire
         | course as a survey of parallel and high-performance computing.
         | The full course covers multi-processing, multi-threading,
         | caching, SIMD, GPUs (CUDA and OpenCL) and MPI. The projects are
         | in C/C++ (along with OpenMP, CUDA and OpenCL). FYI, I think the
         | last two projects use some large research GPU bank that you
         | have to have special access to use, so you'd be out of luck on
         | implementing the projects for those.
         | 
         | [0] https://web.engr.oregonstate.edu/~mjb/cs575/ [1]
         | https://media.oregonstate.edu/media/t/1_7wju0jtq [2]
         | https://web.engr.oregonstate.edu/~mjb/cs575/Projects/proj04....
        
       | thecompilr wrote:
       | I'm the co-author of one of the papers referenced in the
       | blogpost, (Fast Quicksort Implementation Using AVX Instructions),
       | we did write the AVX512 code back in 2015, just had nowhere to
       | run it, at least publicly. The paper also very explicitly says
       | that the lookup tables can be instead replaced by the AVX512
       | compress instructions. The code for that paper is available in
       | https://github.com/vkrasnov/avx_qsort
        
       | slackerIII wrote:
       | What would it take for something like this to make it into
       | Postgres?
        
         | polskibus wrote:
         | See commment: https://news.ycombinator.com/item?id=31624451
        
         | VHRanger wrote:
         | unlikely - postgres stores data row-wise and this assumes
         | sequentially stored columns. They even mention that issue in
         | the blog post.
         | 
         | It would be more likely to show up in something like Apache
         | Arrow which is designed columnar to leverage these sort of
         | tricks
        
           | Denvercoder9 wrote:
           | The data in indexes is stored columnar by most RDBMSes, as
           | far as I know.
        
             | wenc wrote:
             | Typically a rowstore index is a B-tree data structure (on a
             | column) that points to data pages in a rowstore dataset.
             | 
             | It's not technically columnar as such.
             | 
             | You can sort with a rowstore index today, but I imagine you
             | have to materialize the index in some way to take advantage
             | of vectorization.
        
             | ddorian43 wrote:
             | It actually is not.
        
             | kjeetgill wrote:
             | I mean, I guess you could call a B-Tree over a subset of
             | columns "columnar"... And bit-map indexes are by columns
             | too -- but that's really bending the definition of what
             | most people mean by columnar storage in databases.
        
             | janwas wrote:
             | I don't know about "most", but here is a list:
             | https://en.wikipedia.org/wiki/List_of_column-
             | oriented_DBMSes
        
               | ComputerGuru wrote:
               | No, this is a list of columnar database systems - gp was
               | saying the index, which is often (but not always) stored
               | separately from the main db records.
        
               | janwas wrote:
               | Ah, indeed, my mistake. FWIW I've observed increasing
               | numbers of papers over the past couple of years
               | mentioning entirely columnar databases.
        
           | samwillis wrote:
           | There are various Postgres extensions (such as Citus) that
           | are columnar, so with them it should be possible.
        
       | [deleted]
        
       | aaaaaaaaaaab wrote:
       | This works only with integers, right? Then radix sort is gonna be
       | still faster on sufficiently large inputs...
        
       | carbocation wrote:
       | Looks like djb is giving it a spin:
       | https://twitter.com/hashbreaker/status/1533201734369103872
        
         | sydthrowaway wrote:
         | And he shat all over Google yet again
        
           | aaaaaaaaaaab wrote:
           | But he's right.
        
           | viraptor wrote:
           | He's just investigating how to reproduce the claimed result.
           | Not sure where you got your take from.
        
       ___________________________________________________________________
       (page generated 2022-06-04 23:00 UTC)