[HN Gopher] Vectorized and performance-portable Quicksort ___________________________________________________________________ Vectorized and performance-portable Quicksort Author : slackerIII Score : 321 points Date : 2022-06-04 16:48 UTC (6 hours ago) (HTM) web link (opensource.googleblog.com) (TXT) w3m dump (opensource.googleblog.com) | DeathArrow wrote: | This might interest someone: "Vectorized Operations extension for | PostgreSQL" | | https://github.com/postgrespro/vops | polskibus wrote: | This looks awesome but could do with a bit of | productionalization. For example ready to use binaries. Do you | know if there are any plans to merge this work with core | postgres? | orlp wrote: | SIMD based sorting isn't new, e.g. djbsort: | https://sorting.cr.yp.to/. I find it strange the paper doesn't | cite it or compare to it at all. | | EDIT: to be fair it does seem that Bramas' first published work | on SIMD sorting (that I could find) predates djbsort, | https://arxiv.org/abs/1704.08579. A comparison would still be | nice though. | TontonNestor wrote: | djbsort uses a sorting network, not a quicksort. It is also | designed to be in constant time, which is not the case with the | linked blog post. So the restrictions are different. | | This is not my field, so I can't judge the relevance of this, | but the author cites the state of the art. It just seems that | djbsort is not the state of the art and therefore does not need | to be cited. | | I still don't understand the hype around Bernstein, he is quite | relevant in cryptography/cryptography implementations but not | everything he does is gold. | janwas wrote: | :) FWIW we also use a constant-time sorting network as the | base case of the Quicksort recursion, that's a widely used | optimization. | janwas wrote: | What exactly should be cited? I had a quick look and djbsort | seems to be a constant-time algorithm, presumably a sorting | network. They report about 2 GB/s for 768 int32 on 3 GHz | Skylake; IIRC our sorting network reaches 3-4 GB/s, but for a | smaller input size (256 int32 or int64). Hard to compare | directly, but djbsort is likely slower, and the sorting network | is anyway a small fraction of the total time in a Quicksort. | explaingarlic wrote: | The whole "stick SIMD into it and compare it with the stdlib" | thing is a really common learning trope. Feels more like a | portfolio filler, albeit probably not a useless one. | janwas wrote: | We also compare with state of the art platform-specific code. | The interesting thing here is that they turn out to be slower | than our approach using portable intrinsics | (github.com/google/highway) :) | mixedCase wrote: | The article acknowledges SIMD based sorting and states what | makes this novel. | nostrademons wrote: | It's not new within Google either. I remember poking through | Jeff Dean's original MapReduce code (written c. 2003) and | finding a SIMD-based QuickSort that was the in-memory portion | of the external sort algorithm. | | It looks like this algorithm is interesting in the specifics of | how it works, eg. the use of a compress-store instruction for | partitioning. The algorithm I mentioned above mostly focused on | speeding up comparisons through the use of SIMD instructions, | rather than changing the partitioning part of the QuickSort | algorithm itself. Not sure how that compares to djbsort, I | didn't see technical details of the algorithm on the page. | jeffbee wrote: | The blog post does seem to be quite explicit about the fact | that their contribution is in the field of portability, not | novelty. | janwas wrote: | Interesting, I did not know that, thanks for the pointer. | There is also an effort underway to update LLVM std::sort to | BlockQuicksort, which can use some SIMD instructions | similarly to what you describe. | convolvatron wrote: | nothing happened before google | | https://dl.acm.org/doi/10.1145/113379.113380 | janwas wrote: | hah, our paper actually cites "Radix sort for vector | multiprocessors" by the same author, also from 1991 :) | skybrian wrote: | Will browsers start using this code for typed array sorting? | Maybe they already do? | xxs wrote: | Sorting integers is actually rare in practice, esp. not very | useful for a browser. | skybrian wrote: | The method exists. They could make it faster without breaking | anything. | | https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... | wly_cdgr wrote: | SpaceChem players have known this for years | unnouinceput wrote: | Quote: "Today we're sharing open source code that can sort arrays | of numbers about ten times as fast as the C++ std::sort...." and | proceeds to explain about SIMD instructions. | | Well, to me sounds that C++ std::sort is simply lacking an | optimization which can very easy be implemented in next iteration | of the compiler/linker. I had high hopes this would be a really | breakthrough algorithm, not lack of a simple optimization using | some instruction set that compiler/linker didn't implement. If | they want, CPU vendors would make an even more awesome and faster | instruction set, let's say SIMD2, which would leave this one in | the dust. | janwas wrote: | CPU vendors have recently introduced SVE (Arm) and RVV (RISC-V) | and we support those in Highway and thus also vqsort. | | It's definitely interesting to discuss std::sort benefitting | from such optimizations, but "very easy" seems rather | optimistic. | benstrumental wrote: | Do you expect these optimizations making their way into | std::sort eventually? | janwas wrote: | Author here, happy to discuss. | tehsauce wrote: | Do you have an idea if it's possible that this fast SIMD sort | could be applied for faster sorting on GPUs? | [deleted] | TinkersW wrote: | I just wanted to say awesome job, I'm going to use it, ignore | all the ignorant SIMD haters;0 | bhl wrote: | Regarding columnar databases, is there an optimal data | structure in between storing data as rows and storing data as | columns that would make it fast to sort and filter by either | dimension? Some mix of B-trees or interval trees? | nickysielicki wrote: | Highway looks awesome from a quick glance. Definitely going to | set some time aside to play with it and read it. | | Sort of tangentially related, what sort of tools are you using | for disassembly in 2022? Is there anything better than objdump | today? What I really want is a version of godbolt that reads | compile_commands.json and can show me the godbolt-style color- | coded disassembly for an entire binary, any function. I find | that when I'm asking these sort of questions I'm wasting a lot | of time fitting my problem into godbolt so I can just get a | rough idea of what's coming out, because it's so much faster | (maybe 10x) to read the color tagged version from godbolt than | it is to do the same with objdump. | | Edit: along the same lines, what can people do (or rather, what | do you do) about situations like this: | https://github.com/google/highway/blob/master/hwy/examples/b... | where things might get better in the future but for now we have | to implement it another way? How do you avoid regressions and | how do you track workarounds? I want some sort of database that | tries competing implementations on every build and emits a | compiler warning when things get better. How are you dealing | with this? | janwas wrote: | Thanks! Feel free to raise Github issues if you'd like to | ask/discuss anything. | | I'm also a huge fan of Godbolt/Compiler Explorer. Highway is | integrated there, so you can just copy in your functions. | Here's the last throwaway test that I was using: | https://gcc.godbolt.org/z/5azbK95j9 | | > things might get better in the future but for now we have | to implement it another way | | There's several possible answers. 1) For the issue you | pointed to, that we cannot have arrays of N vectors, it's a | reasonable workaround to instead allocate an array of N | vectors' worth of elements, and trust the compiler to elide | unnecessary Load/Store. This often works using clang, and | would avoid having to manually unroll here. I do prefer to | minimize the amount of compiler magic required for good | performance, though, so typically we're unrolling manually as | shown in that code. | | 2) If there are compiler bugs, typically the workarounds have | to stay in because in the open-source world people are still | using that compiler years later. | | Automatically detecting when things get better is an | interesting idea but I am not yet aware of such | infrastructure. | celrod wrote: | // Compiler doesn't make independent sum* accumulators, so | unroll manually. // We cannot use an array because V | might be a sizeless type. For reasonable // code, we | unroll 4x, but 8x might help (2 FMA ports * 4 cycle | latency). | | That code needs 2 loads per FMA. So a CPU with 2 FMA ports | would need at least 4 load ports to be able to feed the 2 | FMA ports. Given that most CPUs with 2 FMA ports have just | 2 load ports, unrolling by 4 should be more or less ideal. | | But, ideally, the compiler could make the decision based on | the target architecture. | | Without enabling associative math, it isn't legal to | duplicate floating point accumulators and change the order | of the accumulation. Perhaps compiling under `-funsafe- | math` would help. If you're using GCC, you'll probably need | `-fvariable-expansion-in-unroller`, too. | | I think highway looks great. I'm sure I'll procrastinate on | something important to play with it reasonably soon. | superjan wrote: | Great work! Does it also support sorting object pointers by a | numeric key? | mchusma wrote: | Very cool work! What do you see as the typical process of | getting code like this into use in well known codebases? I'm | trying to get a sense of when a typical engineer (operating | with higher level language or libraries) might get to take | advantage of this. | | Thanks for open sourcing! | janwas wrote: | Thanks! Highway is included in Chrome and Firefox. Not sure | what you mean by process? It's pretty easy for example using | git submodules - you tell Git the Highway version you want to | depend on, notify your CMake or Bazel build system of the hwy | library dependency, and that's it. | | One requirement is C++11 or later - some people have asked | about supporting C but I believe that's infeasible. | DennisL123 wrote: | Well done, Jan! | BeeOnRope wrote: | Interesting post and paper, thanks! | | Sometimes the state of the art is not found in another paper | but somewhere else, e.g,. there is vxsort by Dan Shechter | (damageboy): | | https://github.com/damageboy/vxsort-cpp/tree/master | | He uses a similar approach and while I'm not sure how it | compares to the older Blacher et al version, I expect it to be | in the ballpark. | | He's written an excellent series of blog posts on the design & | implementation: | | https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1 | johntortugo wrote: | I think it would interesting to have a power consumption | comparison as well. | janwas wrote: | Definitely, I have on my TODO to give turbostat a try. | SemanticStrengh wrote: | jart wrote: | I wonder how fast it is compared to djbsort | https://github.com/jart/cosmopolitan/blob/master/libc/nexgen... | and longsort | https://github.com/jart/cosmopolitan/blob/e011973593407f576d... | djbsort is outrageously fast for 32-bit ints with avx2 (which | unlike avx512 it's the avx we can reliably use in open source). | But there's never been a clear instruction set to use on Intel / | AMD for sorting 64-bit ints that's reliably fast. So if this | thing can actually sort 64-bit integers 10x faster on avx2 I'd be | so thrilled. | wtallis wrote: | > it's the avx we can reliably use in open source | | I'm not sure what you mean by that. You can't assume the | presence of AVX or AVX2 without explicitly checking for it, | because Intel was still disabling those features on new low-end | Pentium and Celeron parts at least a recently as _Comet Lake_ | (2020). Sure, AVX2 support is much more widespread than AVX512 | support, but that has nothing to do with open-source and it 's | a bit strange to describe that in terms of reliability. | jart wrote: | Some of us like to think of ourselves writing open source as | serving the public interest. It's hard to do that if you're | focusing on an ISA the public doesn't have. I haven't seen | any consumer hardware that has AVX512. | eklitzke wrote: | Lots of consumer hardware has AVX512 (I have an 11th gen | Intel laptop CPU that has it). | | Regardless, Clang and GCC both support function multi- | versioning where you supply multiple versions of a function | and specify which CPU features each implementation needs, | and the best version of the function will be selected at | runtime based on the results of cpuid. For example, you can | use this to write a function that uses no vector | instructions, SEE, AVX2, or AVX512 and all versions will be | compiled into the executable and the best version you can | actually use will be selected at runtime. This is how glibc | selects the optimal version of functions like | memset/memcpy/memcmp, as there are vector instructions that | significantly speed these functions up. | | There's an LWN article about the feature if you're curious | how it works: https://lwn.net/Articles/691932/ | akelly wrote: | Intel 10th gen mobile and 11th gen mobile and desktop, | excluding Pentium and Celeron, have AVX-512. And all 12th | gen have it on the P cores but not the E cores. If the E | cores are enabled then AVX-512 is unavailable. | monocasa wrote: | On 12th gen they disabled it on the P cores too even with | E cores disabled with a microcode update. A lot of newer | systems don't have access to the older microcode, and | microcode doesn't typically let you downgrade. | wtallis wrote: | There are workarounds for downgrading microcode, because | the CPU itself doesn't actually have non-volatile storage | for microcode updates and relies on the motherboard | firmware to upload updates on each boot (and motherboard | firmware can often be downgraded, possibly after changing | a setting to allow that). | | Which is probably why Intel has changed to disabling | AVX512 using fuses in more recently manufactured Alder | Lake CPUs. | monocasa wrote: | My point with "a lot of newer systems" was that there are | motherboards now that completely lack a version of their | firmware with the microcode that allows avx-512. There's | nothing to downgrade to without an exploit to allow | making your own firmware images with mixed and matched | components. | robertlagrant wrote: | > Some of us like to think of ourselves as | | I don't see how this is relevant to anything. | janwas wrote: | I agree AVX-512 is not exactly widespread on client CPUs | but as akelly mentions, it does exist (e.g. Icelake). | | What we do is dispatch to the best available instruction | set at runtime - that costs only an indirect branch, plus | somewhat larger binary and longer compile time. | wtallis wrote: | Even if AVX512 was entirely constrained to server hardware | (it's not), how would it be contrary to the public interest | for open-source software to take advantage of those | instructions? | janwas wrote: | Yes, we can sort 64-bit ints. The speedup on AVX2 is roughly | 2/3 of the 10x we see on AVX-512. Longsort appears to be an | autovectorized sorting network. That's only going to be | competitive or even viable for relatively small arrays | (thousands). See comments above on djbsort. | | Why not use whichever AVX the CPU has? Not a problem when using | runtime dispatch :) | heavenlyblue wrote: | What about performance-per-watt? | celrod wrote: | The bigger the vectors, the better the performance per | watt. | olliej wrote: | Re-implementation of stuff with SIMD is always amazing to me. I | have done stuff with SIMD before: 4 element float vector | operations; basic arithmetic on arrays of floats. | | Those are things that are super simple, once you get past the | scary mystique of SIMD. | | It's stuff like this that should be getting all the witch burning | :D | MrYellowP wrote: | Man, I was thinking about this problem and came up with some | weird solutions. | | I've actually thought all this stuff has already been solved? | | Is there a point in still trying to improve efficiency of | sorting? | | I can do that! I'd love such a challenge if there was a point to | it! | joebob42 wrote: | Sorting is one of a few things that computers spend so much | time doing, even a relatively small improvement over that state | of the art can have a big impact. | | On the one hand this means if you can come up with a better way | to do it in some case that's awesome and could be very | valuable. On the other hand it means a lot of people have | thought pretty hard about it and the next improvement may be | hard to come up with unless you pick a tight enough subset of | the problem. | | Definitely go for it though :) | jerryjerryjerry wrote: | This is great, and can definitely improve sort performance in | database and big data projects. I can immediately imagine this is | a perfect match to my project of columnar processing engine | TiFlash (https://github.com/pingcap/tiflash). | BeeOnRope wrote: | Are there instructions for building & reproducing the performance | results? | VHRanger wrote: | I love work like this and SIMD-JSON which vectorizes what "should | be" sequential form problems to find massive performance gains | charcircuit wrote: | Any nonsequential problem can be divided into sequential | subprobolems. | singhrac wrote: | Another one if you enjoy this sort of thing like me is Raph | Levien's work on the stack monoid; this [1] blog post and this | [2] ArXiV paper are great places to look. | | [1]: https://raphlinus.github.io/gpu/2020/09/05/stack- | monoid.html [2]: https://arxiv.org/abs/2205.11659 | alecco wrote: | > By comparison, the standard library reaches 58/128/117 MB/s on | the same CPU, so we have managed a 9-19x speedup depending on the | type of numbers. | | That's pretty disingenuous. The standard implementation is known | to be terribly slow because of constraints. They should compare | against current state of the art. | slackerIII wrote: | The paper goes into a lot more details: | https://arxiv.org/pdf/2205.05982.pdf | MontyCarloHall wrote: | I don't see any mention in the paper of thermal clock | throttling concerns, which can really neuter performance of | tools that sustain use of AVX operations over a period of | time. For the quick benchmarks presented in the paper, of | course it will be faster. What if I'm continuously hammering | my CPU with AVX operations? I expect it to severely | downclock. | kadoban wrote: | That might be an interesting benchmark, but assuming good | cooling isn't exactly unreasonable either. | MontyCarloHall wrote: | In datacenter blade servers (i.e. on a cloud VM), I've | noticed up to 50% downclocking due to thermal throttling | when running sustained frequent AVX operations. | | I'm sure an exotic watercooled setup will fare much | better, but those aren't generally what we run in | production. | hinkley wrote: | My laptop has been throttling itself for a while, I | recently discovered. I had been trying to benchmark some | code changes and have given up and am letting the CI | machine run them, because my numbers are all over the place | and go down with each run. | bee_rider wrote: | One option would be to go into BIOS and see if there's | some way of just locking your CPU to one of the lower | clock speeds. This will give lower benchmarking numbers | of course, but at least they should be fairly stable. (in | Linux, it also often possible to tinker with frequencies | while the system is running). | | Even on a desktop this sort of thing is sometimes | necessary, for example my CPU has different clock speeds | depending on how many processors are running, so I have | to lock it to the all-core clock if I want to see proper | parallel speedups. | | This might be annoying for day-to-day usage (although, | CPUs really are insanely performant nowadays so maybe it | will not be too bad). | jeffbee wrote: | On Ice Lake Xeon the penalty for using the AVX-512 features | on a single core is -100MHz. If we pessimistically use the | slowest part Intel sells, that is a 5% performance penalty | (2% on their fastest parts). The speedup from this work is | 40-60% compared to AVX2. So you'd be a fool to take the | side of the folk myth. AVX-512 works. | | By the way the performance penalty for using AVX-512 on | multiple cores when the multiple cores were already active | is zero. There is no penalty in most server scenarios. | MontyCarloHall wrote: | >On Ice Lake Xeon the penalty for using the AVX-512 | features on a single core is -100MHz. | | That is a penalty due to licensing [0], not thermal | throttling. As I wrote elsewhere, I've seen my clockspeed | get cut in half across all cores on a physical die when | running AVX-heavy operations for a sustained period of | time, due to thermal throttling. | | [0] https://travisdowns.github.io/blog/2020/08/19/icl- | avx512-fre... | mcronce wrote: | The default AVX offset for Ice Lake is indeed only 100MHz | (and it doesn't exist starting with Rocket Lake), but | 512b SIMD instructions use a lot of power, and as a | result generate a lot of heat - so they certainly can | cause thermal throttling or throttling due to power | limits | ahh wrote: | It's the transition that kills you. Are you doing this | full time? | jeffbee wrote: | My full-time thing is more search-y and takes tens of | milliseconds so I'm not really sweating power state | transitions that take a few micros. | Rizz wrote: | Which they did and showed in the previous sentence. You cannot | possibly have missed that. | throw-away_42 wrote: | Also disingenuous to claim they don't make that comparison when | it's literally the sentence before the one you quoted. | alecco wrote: | My bad. Apologies. | cbetti wrote: | Not having played with SIMD much myself, does leveraging these | instructions for an intensive operation like a sort push other | workloads out of the CPU more aggressively than operating on 32 | or 64 bits at a time would? | | In other words, do you have to be more careful when integrating | these wide operators to preserve some resources for other | operations? | jeffbee wrote: | At least on Intel, AVX has its own functional units and | register file, so I would say it's not a major concern. It's | possible that running some AVX instructions could be almost or | completely free if you weren't using execution port 5 anyway, | to the extent that instruction-level parallelism can be | considered "free". | | If you're really taking a microscope to performance, the main | hazards would be intermittently using AVX for only a few | instructions, because that might lead to the CPU stopping for a | few microseconds to turn the power on and off on the functional | units. If you're using them heavily the overall thermal | situation might cause a core or package-wide clock rate | degradation, but if you have a use case for sustained AVX-512 | usage this is likely to be a good tradeoff. | Arkanosis wrote: | Pushing them out of the CPU, I don't know, but some SIMD | instruction sets on some CPUs have side effects that can | negatively affect the performance of other operations. For | example, the use of AVX2 / AVX-512 can cause some Intel CPUs to | lower their base frequency, thus reducing the performance of | simultaneous operations that are not using SIMD. | [deleted] | vlovich123 wrote: | Is that still true? I was under the impression that post- | Haswell CPUs don't have that particular issue. | mcronce wrote: | Not for recent hardware - Ice Lake eliminated the | throttling on 256b ops (they already didn't exist for | certain 256b ops and all smaller ops), and reduced the | throttling to almost nothing for 512b ops. Rocket Lake | eliminated the throttling for 512b ops. | | https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Do | w... | | They do use a lot of power (and as a result, generate a lot | of heat), so they can still cause thermal throttling, or | throttling due to power limits - but there's no more "AVX | offset" | fulafel wrote: | Generally yes to the first question, no to the second. | | If you want your code to have low perturbance on other | concurrent work done by the system, implementing it in a | inefficient way doesn't usually help with that goal, since then | your code will just be executing all the time because it takes | a long time to finish. And you still won't have good control of | the execution resources compared to using normal tools like OS | scheduling policies to address the goal. | VHRanger wrote: | They do emit a lot of heat, which might actually throttle the | CPU overall. | | But to my knowledge they use different registers, and when | properly pipelined they don't hog the CPU cache like | unoptimized algorithms that constantly round trip to RAM. | saltcured wrote: | Right, though your use of the terms "different registers" and | "properly pipelined" is quite nuanced for a non-specialist. | It leaves a lot to the imagination or prior experience! | | If it is a compute-heavy code that was spilling to a stack | and now can fit in the SIMD registers, there could be a speed | up with less memory traffic. This is analogous to a program's | working set either fitting in RAM or spilling to swap while | it runs, but at a different level of the memory hierarchy. In | the extreme case, your working set could be a fixed set of | variables in expensive loops for some sequential algorithm, | and so the compiler register placement ability and number of | registers available could be the difference between | effectively O(1) or O(n) memory accesses with respect to the | number of compute steps. | | Of course, you can find such transitions between registers/L1 | cache, L1/L2 cache, L2/L3 cache (if present), local/remote | RAM (for modern NUMA machines), etc. This happens as you | transition from a pure CPU-bound case to something with | bigger data structures, where the program focus has to move | to different parts of the data to make progress. Naively | holding other things equal, an acceleration of a CPU-bound | code will of course mean you churn through these different | data faster, which means more cache spills as you pull in new | data and have to displace something else. Going back to the | earlier poster's question, this cache spilling can have a | detrimental effect on other processes sharing the same cache | or NUMA node, just like one extremely bloated program can | cause a virtual memory swapping frenzy and hurt every other | program on the machine. | | One form of "pipelining" optimization is to recognize when | your loop is repeatedly fetching a lot of the same input data | in multiple iterations and changing that to use | variables/buffers to retain state between iterations and | avoid redundant access. This often happens in convolution | algorithms on arrays of multi-dimensional data, e.g. image | processing and neural nets and many cellular-automata style | or grid-based simulations. Your algorithm slides a "sampling | window" along an array to compute an output value as a linear | combination of all the neighboring values within the window | using some filtering kernel (a fixed pattern of | multiplicative scalars/weights). Naive code keeps loading | this whole window once for each iteration of the loop to | address one output position. It is effectively stateless | except for the loops to visit every output. Pipelined code | would manage a window buffer and fetch and replace only the | fringe of the buffer while holding and reusing inputs from | the previous iterations. While this makes the loops more | stateful, it can dramatically reduce the number of memory | reads and shift a memory-bound code back into a CPU-bound | regime. | | Other major optimizations for these N-dimensional | convolutions are done by the programmer/designer: some | convolution kernels are "decomposable" and the expensive | multi-dimensional operation can be strength-reduced to a | sequence of 1-dimensional convolutions with appropriately | chosen filter kernels to produce the same mathematical | result. This optimization has nothing to do with SIMD but | simplifies the work to embody the operation. Fewer input | values to read (e.g. a 1D line of neighbors instead of 2D | box) and the number of arithmetic operations to do all the | multiply-and-add steps to produce the output. Imagine a 2D | filter that operates on an 8x8 window. The 2D convolution has | to sample and combine 64 input neighbors per output, while | the decomposed filter does 8 neighbors in each axis and so | one quarter as many steps total after one pass for the X axis | and one pass for the Y axis. | | Naively, decompsition is done as separate 1D passes over the | data and so performs more memory writes and allocates an | additional intermediate array between the original input and | final output. This is often a big win in spite of the extra | memory demands. It's a lot more coding, but this could also | be "pipelined" to some degree in N-dimensions to avoid | pushing the entire intermediate results arrays through main | memory. Approaches differ, but you can make different | tradeoffs for how many intermediate results to store or | buffer versus redundant calculation in a less stateful | manner. | | Generally, as you make the above kinds of optimizations your | code becomes more "block-based", loading larger chunks of | data with fewer changes to new "random" memory locations. | This is very much like how databases and filesystems | optimized their access to disk storage in prior decades to | avoid random seeks for individual bytes or blocks and to | instead use clever structures to support faster loading of | sets of blocks or pages. When this is successful, your | program does most of its memory access sequentially and | achieves higher throughput that is bound by total memory bus | bandwidth rather than random access latency limitations. | | The final form of "pipelining" matters once you really hit | your stride with these optimized, block-based programs. All | this access again can cause cache spilling as you sustain | high bandwidth memory access. But if you can provide the | right hints to the CPU, or if the CPU is clever enough to | detect the pattern, it can shift into a different cache | management regime. Knowing that you will only access each | block of data once and then move on (because you are holding | the reusable state in registers), the CPU can use a different | prefetching and spilling policy to both optimize for your | next memory access and to avoid churning the whole cache with | these values you will only read once and then ignore. This | reduces the impact of the code on other concurrent tasks | which can still enjoy more reasonable caching behavior in | spite of the busy neighbor. | janwas wrote: | The CPU has a certain upper bound on power, TDP. That limit | is independent of whether it is reached by executing scalar | code on lots of cores, or SIMD on a few. One little-known | fact is that out of order CPUs burn lots of energy just on | scheduling instructions, ~10x more than the operation itself. | SIMD amortizes that per-instruction overhead over multiple | data elements, and actually increases the amount of useful | work done per joule. We're talking 5-10x increase in energy | efficiency. | gww wrote: | I am always amazed at the algorithms re-implemented using SIMD. | One of my favorites is the Striped Smith-Waterman approach used | for sequence alignment. Does anyone have any good resources on | learning to use SIMD? I've found it heard to make the "plunge". | janwas wrote: | :) How about Chapter 12 of Agner's awesome manual | (https://agner.org/optimize/optimizing_cpp.pdf), | http://const.me/articles/simd/simd.pdf, | https://en.algorithmica.org/hpc/simd/ ? | | Does anyone have others to share? | magicnubs wrote: | The lectures and assignments for Oregon State's CS475 (Parallel | Programming) are all available online. [0] There are lectures | [1] and a project [2] about SIMD. I really enjoyed the entire | course as a survey of parallel and high-performance computing. | The full course covers multi-processing, multi-threading, | caching, SIMD, GPUs (CUDA and OpenCL) and MPI. The projects are | in C/C++ (along with OpenMP, CUDA and OpenCL). FYI, I think the | last two projects use some large research GPU bank that you | have to have special access to use, so you'd be out of luck on | implementing the projects for those. | | [0] https://web.engr.oregonstate.edu/~mjb/cs575/ [1] | https://media.oregonstate.edu/media/t/1_7wju0jtq [2] | https://web.engr.oregonstate.edu/~mjb/cs575/Projects/proj04.... | thecompilr wrote: | I'm the co-author of one of the papers referenced in the | blogpost, (Fast Quicksort Implementation Using AVX Instructions), | we did write the AVX512 code back in 2015, just had nowhere to | run it, at least publicly. The paper also very explicitly says | that the lookup tables can be instead replaced by the AVX512 | compress instructions. The code for that paper is available in | https://github.com/vkrasnov/avx_qsort | slackerIII wrote: | What would it take for something like this to make it into | Postgres? | polskibus wrote: | See commment: https://news.ycombinator.com/item?id=31624451 | VHRanger wrote: | unlikely - postgres stores data row-wise and this assumes | sequentially stored columns. They even mention that issue in | the blog post. | | It would be more likely to show up in something like Apache | Arrow which is designed columnar to leverage these sort of | tricks | Denvercoder9 wrote: | The data in indexes is stored columnar by most RDBMSes, as | far as I know. | wenc wrote: | Typically a rowstore index is a B-tree data structure (on a | column) that points to data pages in a rowstore dataset. | | It's not technically columnar as such. | | You can sort with a rowstore index today, but I imagine you | have to materialize the index in some way to take advantage | of vectorization. | ddorian43 wrote: | It actually is not. | kjeetgill wrote: | I mean, I guess you could call a B-Tree over a subset of | columns "columnar"... And bit-map indexes are by columns | too -- but that's really bending the definition of what | most people mean by columnar storage in databases. | janwas wrote: | I don't know about "most", but here is a list: | https://en.wikipedia.org/wiki/List_of_column- | oriented_DBMSes | ComputerGuru wrote: | No, this is a list of columnar database systems - gp was | saying the index, which is often (but not always) stored | separately from the main db records. | janwas wrote: | Ah, indeed, my mistake. FWIW I've observed increasing | numbers of papers over the past couple of years | mentioning entirely columnar databases. | samwillis wrote: | There are various Postgres extensions (such as Citus) that | are columnar, so with them it should be possible. | [deleted] | aaaaaaaaaaab wrote: | This works only with integers, right? Then radix sort is gonna be | still faster on sufficiently large inputs... | carbocation wrote: | Looks like djb is giving it a spin: | https://twitter.com/hashbreaker/status/1533201734369103872 | sydthrowaway wrote: | And he shat all over Google yet again | aaaaaaaaaaab wrote: | But he's right. | viraptor wrote: | He's just investigating how to reproduce the claimed result. | Not sure where you got your take from. ___________________________________________________________________ (page generated 2022-06-04 23:00 UTC)