[HN Gopher] Designing a SIMD Algorithm from Scratch ___________________________________________________________________ Designing a SIMD Algorithm from Scratch Author : ingve Score : 303 points Date : 2023-11-28 07:12 UTC (15 hours ago) (HTM) web link (mcyoung.xyz) (TXT) w3m dump (mcyoung.xyz) | gbin wrote: | This is a pretty good trial for Rust Simd! What were the most | surprising quirks of it when you inspected the generated code? | celeritascelery wrote: | Not OP, but one thing that surprised me was if you are doing | rust Simd in a library, and part of the code is marked | #[inline] but others are not you might see catastrophic | performance regressions. We saw an issue where the SIMD version | was over 10x slower because we missed marking one function as | inline. Essentially rustc converted it from an intrinsic to a | regular function call. | | https://github.com/rust-lang/rust/issues/107617#issuecomment... | uo21tp5hoyg wrote: | Wow I really like that little mini-map on the right... | joennlae wrote: | +1. Does someone know how to do that? | progbits wrote: | Firefox only: https://news.ycombinator.com/item?id=36757542 | m1el wrote: | The minimap contains a copy of the content, but with | `transform: scale`. The rest is handling `window.onscroll` | and mouse events on the overlay. | Klaster_1 wrote: | Found a canvas-based library for this: | https://larsjung.de/pagemap/. Definitely not what OP uses, | where the minimap is a shrunk copy of the content markup, | with all the drawbacks, such as page search finding the same | item twice. | orlp wrote: | The author should really add at least aria-hidden="true" to | the minimap element. | orf wrote: | Interesting article! Right at the start, the first example of a | non-vectorized popcnt implementation is said to produce | "comically bad code, frankly", but in release mode with the | native target CPU, it does appear to be vectorising the function | fairly ok? | | https://godbolt.org/z/WE1Eq65jY | saagarjha wrote: | I mean ideally you'd probably want this to be lowered to a | popcnt instruction. | celrod wrote: | Yes, just because you see SIMD instructions doesn't mean the | code is fast. You need the correct instructions. | | Not relevant to this example, which is `popcount`ing only a | single number, but AVX512 did also introduce SIMD popcount | instructions for getting many inputs at a time. Also true for | other useful bit-funs like leading or trailing zeros. So if | you're using zmm registers, you can do way better than that | godbolt example. | eesmith wrote: | The following should produce equivalent output: | pub fn popcnt(mut x: u32) -> u32 { x.count_ones() | } | | which gets compiled to: example::popcnt: | popcnt eax, edi ret | | For large bit vectors, an AVX2 implementation can outperform | POPCNT. See "Faster Population Counts Using AVX2 Instructions" | at https://academic.oup.com/comjnl/article/61/1/111/3852071 | | 32 bits is not large enough, and the code Rust produces is | indeed comically bad. | orf wrote: | Would the equivalent c++ code run through LLVM generate a | different output? | asb wrote: | LLVM has a LoopIdiomRecognize pass which will, amongst | other patterns, try to detect a loop implementing popcount | [1] and convert it to the llvm.ctpop.* intrinsic [2]. I've | not looked at why it's not matching this loop, but the | sibling comment suggesting to just use `.count_ones` seems | like the right answer for Rust. In C or C++ I'd strongly | recommend using `__builtin_popcount` (provided by both | Clang and GCC). | | [1]: https://github.com/llvm/llvm- | project/blob/08a6968127f04a40d7... [2]: | https://llvm.org/docs/LangRef.html#llvm-ctpop-intrinsic | celrod wrote: | `std::popcount` was added to the stl in C++20 | https://en.cppreference.com/w/cpp/numeric/popcount | asb wrote: | Thanks for pointing that out - definitely a better choice | for C++. | the8472 wrote: | autovectorization is hit and miss. I recently wrote this which | has to count the bits in a result mask of vector operations and | it turns into popcnt just fine. | | https://godbolt.org/z/zT9Whcnco | dist1ll wrote: | Great article, and it's pretty interesting to see portable simd | in use. I tried reproducing the benchmarks on a Zen 3 system, and | I'm getting identical speedups. On my M1 mbp, the perf gains | start around 1.4x, gradually increasing to 2x @ 110 bytes of | input length. A bit less gains than on x86_64, but I'd say it | fulfils its goal. | | Although looking at the code, this article confirms my experience | that Rust has rather poor ergonomics for simd and pointer-related | work (and performance engineering in general). | pcwalton wrote: | I disagree with the ergonomics being poor. C++ SSE intrinsics | are significantly worse, with ugly underscores and names that | are hard to remember. | dist1ll wrote: | Being more ergonomic than C++ is a low bar anyways, we should | hold ourselves to a higher standard. For starters, field | offset syntax is painful (see Gankra's post on this topic) | and pointer manipulation lacks syntactic sugar. I also | believe unsigned integer promotion and UFCS would improve | ergonomics with high-perf code (although I'm aware this is | not going to be included in Rust). | lifthrasiir wrote: | At that point it is better to have some kind of DSL that | should _not_ be in the main language, because it would | target a much lower level than a typical program. The best | effort I 've seen in this scene was Google's Highway [1] | (not to be confused with HighwayHash) and I even once | attempted to recreate it in Rust, but it is still a bit far | from my ideal. | | [1] https://github.com/google/highway | janwas wrote: | Thanks :) We're always curious to hear how it could be | improved? | superjan wrote: | I agree about having a DSL. It may not need to be low | level, you could have an array language like APL/J. But | glsl would work for me too, the important thing is to | lure developers to expose parallelism. | flohofwoe wrote: | Check out the Clang extended vector extension for C, IMHO | that's the perfect way to add portable SIMD to a language. | Doing it through a library will always be clumsy. | lifthrasiir wrote: | It quickly runs into __builtin functions though, which is | not very different from intrinsics with a better and | portable name. Rust `Simd` type is much more consistent | with corresponding scalar types compared to that. | flohofwoe wrote: | True, I think the builtins were necessary where the | existing C operators don't map well to a specific SIMD | feature (from what I've seen it's mostly "reduce" actions | to reduce SIMD lanes into a single value). Still, a lot | can be done without having to pepper the code with | builtin functions, and the result is much more readable | than purely intrinsic-based code. | | But a "new" language like Rust would have the freedom to | add special language syntax and operators for such things | instead of going the C++ way of "just add it to the | stdlib" (which IMHO is almost always the wrong place... | it should either go into a regular cargo dependency, or | into the language itself). | vt240 wrote: | I loved the Intel Cilk Plus project. I was sad to see it | was abandoned. It always felt like a very natural syntax at | least to me. | secondcoming wrote: | Those intrinsics are specified by Intel, not C++ | bottled_poe wrote: | As if semantics are a significant productivity factor, get | off ya high horse. | cmrdporcupine wrote: | Haven't read through the article yet, but as a Rust eng... I | sort of agree, but ... I mean, pointer & raw memory related | work is _deliberately_ kneecapped in the name of safety, the | language wants you to really think about what you 're doing. | | But yep portable SIMD in Rust is still not a good story, | compared to C++. And getting down to raw byte regions, pointer | & buffer manipulation, etc requires becoming comfortable with | Pin, MaybeUninit, etc. Both portable simd and allocator_api | have been sitting unstable for years. And the barrier to entry | is a bit higher, and it's more awkward ... mostly on purpose | | But there's nothing stopping one from building one's own | abstractions (or using 3rd party crates etc) to make these | things more ergonomic within one's own program? | dzaima wrote: | Small note - while the compiler wasn't able to optimize that | specific popcount implementation to a single instruction, it can | for some other implementations, though it's of course very picky: | https://godbolt.org/z/T69KxWWW8 | Const-me wrote: | > _mm256_cvtps_epu32 that represent a low-level operation in a | specific instruction set (this is a float to int cast from AVX2) | | That instruction is from AVX-512. There's no AVX2 float to int | cast. There's in AVX1, the int is signed and the instruction is | _mm256_cvtps_epi32. | mattsan wrote: | In Lua we have indexes off by one and in SIMD we have keyboard | letters off by one | alex_suzuki wrote: | Awesome writeup. Leaves me with the distinct impression that | ,,I'll never be this smart". | hnisoss wrote: | Ah, it's just not your field of work. Just like average person | is not soft.engineer or physicist or... you get the point. Few | months of dedicated study and you would be able to do similar | for sure. | VitruviusBen wrote: | Dunning Kruger effect? | cmrdporcupine wrote: | If you ever had occasion to have an employer and/or project | where this was called for, you could probably "be this smart"; | it just comes down to interest and necessity. | | I dip in and out of these waters (performance optimization, | more systems bare metal engineering), but basically on personal | projects. I wish I had jobs where it was more called for, but | it's not what most industry work needs. | anonzzzies wrote: | Try doing AoC '23 in APL/j/k, BQN or Python/numpy (meaning, not | idiomatic python, but everything with numpy) or cuda etc. It's | fun and it will teach you these types of smarts; a lot of the | post is very natural on how you think about solving things in | those languages. After while you start seeing problems in that | form. | corysama wrote: | https://fgiesen.wordpress.com/2016/02/05/smart/ | anonymoushn wrote: | How does this compare to fastbase64[0]? Great article, I'm happy | to see this sort of thing online. I wish I could share the | author's optimism about portable SIMD libraries. | | [0]: https://github.com/lemire/fastbase64 | whalesalad wrote: | the minimap on this page is so cool | Aardwolf wrote: | It is crazy sometimes how you try to program something the best | you ever could with classical C++, and then someone comes along | and makes a version using SIMD that is over 10x faster (but less | portable code). | | I really wish compilers were better at auto vectorization. And | some support added for annotations in the language to locally | allow reordering some operations, ... | janwas wrote: | This happens regularly in my experience :) Also, when using a | SIMD wrapper library, it can actually be quite portable in | practice. | kolbe wrote: | There are unfortunately a lot of serial guarantees that are | unavoidable even if compilers could optimize perfectly. For | example: 'for(double v: vec) sum+=v' | | Floating point addition is not associative, so summing each | value in order is not the same as summing every 8th element, | then summing the remainder, which is how SIMD handles it. So | even though this is an obvious optimization for compilers, they | will prioritize the serial guarantee over the optimization | unless you tell it to relax that particular guarantee. | | It's a mess and I agree with janwas: use a library (and in | particular: use Google Highway) or something like Intel's ISPC | when your hot path needs this. | Aardwolf wrote: | Hence my suggestion for language support. Add some language | syntax where you can say: "add these doubles, order doesn't | matter", which allows the compiler to use SIMD | kolbe wrote: | -ffast-math if you want to push the onus on the compiler. | | Problem is you need to enforce that requirement on all user | compilations, and I don't know what for MSVC. In-language | would be nice. | Aardwolf wrote: | -ffast-math is global though and can break other | libraries. I mean local, vectorization compatible but | portable syntax, without actually writing it (let | compiler do the work) | kolbe wrote: | Totally fair. I'm with you. | eesmith wrote: | According to | https://stackoverflow.com/questions/40699071/can-i-make- | my-c... you annotate it with: | __attribute__((optimize("-ffast-math"))) | | I tried it with: double sum1(double | arr[128]) { double tot = 0.0; for (int | i=0; i<128; i++) { tot += arr[i]; } | return tot; } | __attribute__((optimize("-ffast-math"))) double | sum2(double arr[128]) { double tot = 0.0; | for (int i=0; i<128; i++) { tot += arr[i]; | } return tot; } | | The Compiler Explorer (gcc 13.2, --std=c++20 | -march=native -O3) generates two different bodies for | those: sum1(double*): lea | rax, [rdi+1024] vxorpd xmm0, xmm0, xmm0 | .L2: vaddsd xmm0, xmm0, QWORD PTR [rdi] | add rdi, 32 vaddsd xmm0, xmm0, QWORD PTR | [rdi-24] vaddsd xmm0, xmm0, QWORD PTR | [rdi-16] vaddsd xmm0, xmm0, QWORD PTR | [rdi-8] cmp rax, rdi jne | .L2 ret sum2(double*): | lea rax, [rdi+1024] vxorpd xmm0, xmm0, | xmm0 .L6: vaddpd ymm0, ymm0, YMMWORD | PTR [rdi] add rdi, 32 cmp | rax, rdi jne .L6 | vextractf64x2 xmm1, ymm0, 0x1 vaddpd xmm1, | xmm1, xmm0 vunpckhpd xmm0, xmm1, xmm1 | vaddpd xmm0, xmm0, xmm1 vzeroupper | ret | | It is still compiler-specific and non-portable, but at | least it is not global. | gumby wrote: | > It is crazy sometimes how you try to program something the | best you ever could with classical C++, and then someone comes | along and makes a version using SIMD that is over 10x faster | (but less portable code). | | But that's one of the points of a systems programming language | (of which C++ is one) -- it tries to be portably as efficient | as possible but makes it easy to do target-specific programming | when required. | | > I really wish compilers were better at auto vectorization | | FORTRAN compilers sure are, since aliasing is not allowed. C++ | is kneecapped by following C's memory model. | secondcoming wrote: | It's not just aliasing. gcc is quite bad at autovectorising | in general, clang is much better. | | Here's a trivial algorithm that gcc borks over because 'bool' | is used instead of 'int', even fixing that the codegen isn't | great: | | https://godbolt.org/z/hf7szo78E | pklausler wrote: | Fortran compilers can assume that (many) dummy arguments to a | particular procedure reference do not alias each other, true. | But there are many other ways in which aliasing can happen in | a conforming Fortran program. | Aardwolf wrote: | But why is C++ refusing to add "restrict" to its spec? C has | restrict in its spec, C++ has not. Of course compilers | inherit restrict from C, but it's not super portable due to | C++ not officially supporting it. But why?! | dragontamer wrote: | You could just use CUDA, which is C++ designed for GPUs, the | ultimate SIMD machine of today. | | Or ROCm (basically CUDA but for AMD). | | I always was a fan of Microsoft's C++AMP though. I thought that | was easiest to get into. Too bad it never stuck though. | cmovq wrote: | Good SIMD code requires careful consideration of how your data | is laid out in memory. This makes auto vectorization really | difficult since the compiler can't fix your data for you | outside of very local contexts | intalentive wrote: | SIMD is pretty intuitive if you've used deep learning libraries, | NumPy, array based or even functional languages | westurner wrote: | NumPy roadmap: https://numpy.org/neps/roadmap.html : | | > _Improvements to NumPy's performance are important to many | users. We have focused this effort on Universal SIMD (see NEP | 38 -- Using SIMD optimization instructions for performance) | intrinsics which provide nice improvements across various | hardware platforms via an abstraction layer. The infrastructure | is in place, and we welcome follow-on PRs to add SIMD support | across all relevant NumPy functions_ | | "NEP 38 -- Using SIMD optimization instructions for | performance" (2019) https://numpy.org/neps/nep-0038-SIMD- | optimizations.html#nep3... | | NumPy docs > CPU/SIMD Optimizations: | https://numpy.org/doc/stable/reference/simd/index.html | | std::simd: https://doc.rust-lang.org/std/simd/index.html | | "Show HN: SimSIMD vs SciPy: How AVX-512 and SVE make SIMD nicer | and ML 10x faster" (2023-10) | https://news.ycombinator.com/item?id=37808036 | | "Standard library support for SIMD" (2023-10) | https://discuss.python.org/t/standard-library-support-for-si... ___________________________________________________________________ (page generated 2023-11-28 23:00 UTC)