[HN Gopher] Designing a SIMD Algorithm from Scratch
       ___________________________________________________________________
        
       Designing a SIMD Algorithm from Scratch
        
       Author : ingve
       Score  : 303 points
       Date   : 2023-11-28 07:12 UTC (15 hours ago)
        
 (HTM) web link (mcyoung.xyz)
 (TXT) w3m dump (mcyoung.xyz)
        
       | gbin wrote:
       | This is a pretty good trial for Rust Simd! What were the most
       | surprising quirks of it when you inspected the generated code?
        
         | celeritascelery wrote:
         | Not OP, but one thing that surprised me was if you are doing
         | rust Simd in a library, and part of the code is marked
         | #[inline] but others are not you might see catastrophic
         | performance regressions. We saw an issue where the SIMD version
         | was over 10x slower because we missed marking one function as
         | inline. Essentially rustc converted it from an intrinsic to a
         | regular function call.
         | 
         | https://github.com/rust-lang/rust/issues/107617#issuecomment...
        
       | uo21tp5hoyg wrote:
       | Wow I really like that little mini-map on the right...
        
         | joennlae wrote:
         | +1. Does someone know how to do that?
        
           | progbits wrote:
           | Firefox only: https://news.ycombinator.com/item?id=36757542
        
           | m1el wrote:
           | The minimap contains a copy of the content, but with
           | `transform: scale`. The rest is handling `window.onscroll`
           | and mouse events on the overlay.
        
           | Klaster_1 wrote:
           | Found a canvas-based library for this:
           | https://larsjung.de/pagemap/. Definitely not what OP uses,
           | where the minimap is a shrunk copy of the content markup,
           | with all the drawbacks, such as page search finding the same
           | item twice.
        
             | orlp wrote:
             | The author should really add at least aria-hidden="true" to
             | the minimap element.
        
       | orf wrote:
       | Interesting article! Right at the start, the first example of a
       | non-vectorized popcnt implementation is said to produce
       | "comically bad code, frankly", but in release mode with the
       | native target CPU, it does appear to be vectorising the function
       | fairly ok?
       | 
       | https://godbolt.org/z/WE1Eq65jY
        
         | saagarjha wrote:
         | I mean ideally you'd probably want this to be lowered to a
         | popcnt instruction.
        
           | celrod wrote:
           | Yes, just because you see SIMD instructions doesn't mean the
           | code is fast. You need the correct instructions.
           | 
           | Not relevant to this example, which is `popcount`ing only a
           | single number, but AVX512 did also introduce SIMD popcount
           | instructions for getting many inputs at a time. Also true for
           | other useful bit-funs like leading or trailing zeros. So if
           | you're using zmm registers, you can do way better than that
           | godbolt example.
        
         | eesmith wrote:
         | The following should produce equivalent output:
         | pub fn popcnt(mut x: u32) -> u32 {         x.count_ones()
         | }
         | 
         | which gets compiled to:                 example::popcnt:
         | popcnt  eax, edi             ret
         | 
         | For large bit vectors, an AVX2 implementation can outperform
         | POPCNT. See "Faster Population Counts Using AVX2 Instructions"
         | at https://academic.oup.com/comjnl/article/61/1/111/3852071
         | 
         | 32 bits is not large enough, and the code Rust produces is
         | indeed comically bad.
        
           | orf wrote:
           | Would the equivalent c++ code run through LLVM generate a
           | different output?
        
             | asb wrote:
             | LLVM has a LoopIdiomRecognize pass which will, amongst
             | other patterns, try to detect a loop implementing popcount
             | [1] and convert it to the llvm.ctpop.* intrinsic [2]. I've
             | not looked at why it's not matching this loop, but the
             | sibling comment suggesting to just use `.count_ones` seems
             | like the right answer for Rust. In C or C++ I'd strongly
             | recommend using `__builtin_popcount` (provided by both
             | Clang and GCC).
             | 
             | [1]: https://github.com/llvm/llvm-
             | project/blob/08a6968127f04a40d7... [2]:
             | https://llvm.org/docs/LangRef.html#llvm-ctpop-intrinsic
        
               | celrod wrote:
               | `std::popcount` was added to the stl in C++20
               | https://en.cppreference.com/w/cpp/numeric/popcount
        
               | asb wrote:
               | Thanks for pointing that out - definitely a better choice
               | for C++.
        
         | the8472 wrote:
         | autovectorization is hit and miss. I recently wrote this which
         | has to count the bits in a result mask of vector operations and
         | it turns into popcnt just fine.
         | 
         | https://godbolt.org/z/zT9Whcnco
        
       | dist1ll wrote:
       | Great article, and it's pretty interesting to see portable simd
       | in use. I tried reproducing the benchmarks on a Zen 3 system, and
       | I'm getting identical speedups. On my M1 mbp, the perf gains
       | start around 1.4x, gradually increasing to 2x @ 110 bytes of
       | input length. A bit less gains than on x86_64, but I'd say it
       | fulfils its goal.
       | 
       | Although looking at the code, this article confirms my experience
       | that Rust has rather poor ergonomics for simd and pointer-related
       | work (and performance engineering in general).
        
         | pcwalton wrote:
         | I disagree with the ergonomics being poor. C++ SSE intrinsics
         | are significantly worse, with ugly underscores and names that
         | are hard to remember.
        
           | dist1ll wrote:
           | Being more ergonomic than C++ is a low bar anyways, we should
           | hold ourselves to a higher standard. For starters, field
           | offset syntax is painful (see Gankra's post on this topic)
           | and pointer manipulation lacks syntactic sugar. I also
           | believe unsigned integer promotion and UFCS would improve
           | ergonomics with high-perf code (although I'm aware this is
           | not going to be included in Rust).
        
             | lifthrasiir wrote:
             | At that point it is better to have some kind of DSL that
             | should _not_ be in the main language, because it would
             | target a much lower level than a typical program. The best
             | effort I 've seen in this scene was Google's Highway [1]
             | (not to be confused with HighwayHash) and I even once
             | attempted to recreate it in Rust, but it is still a bit far
             | from my ideal.
             | 
             | [1] https://github.com/google/highway
        
               | janwas wrote:
               | Thanks :) We're always curious to hear how it could be
               | improved?
        
               | superjan wrote:
               | I agree about having a DSL. It may not need to be low
               | level, you could have an array language like APL/J. But
               | glsl would work for me too, the important thing is to
               | lure developers to expose parallelism.
        
           | flohofwoe wrote:
           | Check out the Clang extended vector extension for C, IMHO
           | that's the perfect way to add portable SIMD to a language.
           | Doing it through a library will always be clumsy.
        
             | lifthrasiir wrote:
             | It quickly runs into __builtin functions though, which is
             | not very different from intrinsics with a better and
             | portable name. Rust `Simd` type is much more consistent
             | with corresponding scalar types compared to that.
        
               | flohofwoe wrote:
               | True, I think the builtins were necessary where the
               | existing C operators don't map well to a specific SIMD
               | feature (from what I've seen it's mostly "reduce" actions
               | to reduce SIMD lanes into a single value). Still, a lot
               | can be done without having to pepper the code with
               | builtin functions, and the result is much more readable
               | than purely intrinsic-based code.
               | 
               | But a "new" language like Rust would have the freedom to
               | add special language syntax and operators for such things
               | instead of going the C++ way of "just add it to the
               | stdlib" (which IMHO is almost always the wrong place...
               | it should either go into a regular cargo dependency, or
               | into the language itself).
        
             | vt240 wrote:
             | I loved the Intel Cilk Plus project. I was sad to see it
             | was abandoned. It always felt like a very natural syntax at
             | least to me.
        
           | secondcoming wrote:
           | Those intrinsics are specified by Intel, not C++
        
           | bottled_poe wrote:
           | As if semantics are a significant productivity factor, get
           | off ya high horse.
        
         | cmrdporcupine wrote:
         | Haven't read through the article yet, but as a Rust eng... I
         | sort of agree, but ... I mean, pointer & raw memory related
         | work is _deliberately_ kneecapped in the name of safety, the
         | language wants you to really think about what you 're doing.
         | 
         | But yep portable SIMD in Rust is still not a good story,
         | compared to C++. And getting down to raw byte regions, pointer
         | & buffer manipulation, etc requires becoming comfortable with
         | Pin, MaybeUninit, etc. Both portable simd and allocator_api
         | have been sitting unstable for years. And the barrier to entry
         | is a bit higher, and it's more awkward ... mostly on purpose
         | 
         | But there's nothing stopping one from building one's own
         | abstractions (or using 3rd party crates etc) to make these
         | things more ergonomic within one's own program?
        
       | dzaima wrote:
       | Small note - while the compiler wasn't able to optimize that
       | specific popcount implementation to a single instruction, it can
       | for some other implementations, though it's of course very picky:
       | https://godbolt.org/z/T69KxWWW8
        
       | Const-me wrote:
       | > _mm256_cvtps_epu32 that represent a low-level operation in a
       | specific instruction set (this is a float to int cast from AVX2)
       | 
       | That instruction is from AVX-512. There's no AVX2 float to int
       | cast. There's in AVX1, the int is signed and the instruction is
       | _mm256_cvtps_epi32.
        
         | mattsan wrote:
         | In Lua we have indexes off by one and in SIMD we have keyboard
         | letters off by one
        
       | alex_suzuki wrote:
       | Awesome writeup. Leaves me with the distinct impression that
       | ,,I'll never be this smart".
        
         | hnisoss wrote:
         | Ah, it's just not your field of work. Just like average person
         | is not soft.engineer or physicist or... you get the point. Few
         | months of dedicated study and you would be able to do similar
         | for sure.
        
           | VitruviusBen wrote:
           | Dunning Kruger effect?
        
         | cmrdporcupine wrote:
         | If you ever had occasion to have an employer and/or project
         | where this was called for, you could probably "be this smart";
         | it just comes down to interest and necessity.
         | 
         | I dip in and out of these waters (performance optimization,
         | more systems bare metal engineering), but basically on personal
         | projects. I wish I had jobs where it was more called for, but
         | it's not what most industry work needs.
        
         | anonzzzies wrote:
         | Try doing AoC '23 in APL/j/k, BQN or Python/numpy (meaning, not
         | idiomatic python, but everything with numpy) or cuda etc. It's
         | fun and it will teach you these types of smarts; a lot of the
         | post is very natural on how you think about solving things in
         | those languages. After while you start seeing problems in that
         | form.
        
         | corysama wrote:
         | https://fgiesen.wordpress.com/2016/02/05/smart/
        
       | anonymoushn wrote:
       | How does this compare to fastbase64[0]? Great article, I'm happy
       | to see this sort of thing online. I wish I could share the
       | author's optimism about portable SIMD libraries.
       | 
       | [0]: https://github.com/lemire/fastbase64
        
       | whalesalad wrote:
       | the minimap on this page is so cool
        
       | Aardwolf wrote:
       | It is crazy sometimes how you try to program something the best
       | you ever could with classical C++, and then someone comes along
       | and makes a version using SIMD that is over 10x faster (but less
       | portable code).
       | 
       | I really wish compilers were better at auto vectorization. And
       | some support added for annotations in the language to locally
       | allow reordering some operations, ...
        
         | janwas wrote:
         | This happens regularly in my experience :) Also, when using a
         | SIMD wrapper library, it can actually be quite portable in
         | practice.
        
         | kolbe wrote:
         | There are unfortunately a lot of serial guarantees that are
         | unavoidable even if compilers could optimize perfectly. For
         | example: 'for(double v: vec) sum+=v'
         | 
         | Floating point addition is not associative, so summing each
         | value in order is not the same as summing every 8th element,
         | then summing the remainder, which is how SIMD handles it. So
         | even though this is an obvious optimization for compilers, they
         | will prioritize the serial guarantee over the optimization
         | unless you tell it to relax that particular guarantee.
         | 
         | It's a mess and I agree with janwas: use a library (and in
         | particular: use Google Highway) or something like Intel's ISPC
         | when your hot path needs this.
        
           | Aardwolf wrote:
           | Hence my suggestion for language support. Add some language
           | syntax where you can say: "add these doubles, order doesn't
           | matter", which allows the compiler to use SIMD
        
             | kolbe wrote:
             | -ffast-math if you want to push the onus on the compiler.
             | 
             | Problem is you need to enforce that requirement on all user
             | compilations, and I don't know what for MSVC. In-language
             | would be nice.
        
               | Aardwolf wrote:
               | -ffast-math is global though and can break other
               | libraries. I mean local, vectorization compatible but
               | portable syntax, without actually writing it (let
               | compiler do the work)
        
               | kolbe wrote:
               | Totally fair. I'm with you.
        
               | eesmith wrote:
               | According to
               | https://stackoverflow.com/questions/40699071/can-i-make-
               | my-c... you annotate it with:
               | __attribute__((optimize("-ffast-math")))
               | 
               | I tried it with:                 double sum1(double
               | arr[128]) {         double tot = 0.0;         for (int
               | i=0; i<128; i++) {           tot += arr[i];         }
               | return tot;       }
               | __attribute__((optimize("-ffast-math")))       double
               | sum2(double arr[128]) {         double tot = 0.0;
               | for (int i=0; i<128; i++) {           tot += arr[i];
               | }         return tot;       }
               | 
               | The Compiler Explorer (gcc 13.2, --std=c++20
               | -march=native -O3) generates two different bodies for
               | those:                   sum1(double*):             lea
               | rax, [rdi+1024]             vxorpd  xmm0, xmm0, xmm0
               | .L2:             vaddsd  xmm0, xmm0, QWORD PTR [rdi]
               | add     rdi, 32             vaddsd  xmm0, xmm0, QWORD PTR
               | [rdi-24]             vaddsd  xmm0, xmm0, QWORD PTR
               | [rdi-16]             vaddsd  xmm0, xmm0, QWORD PTR
               | [rdi-8]             cmp     rax, rdi             jne
               | .L2             ret         sum2(double*):
               | lea     rax, [rdi+1024]             vxorpd  xmm0, xmm0,
               | xmm0         .L6:             vaddpd  ymm0, ymm0, YMMWORD
               | PTR [rdi]             add     rdi, 32             cmp
               | rax, rdi             jne     .L6
               | vextractf64x2   xmm1, ymm0, 0x1             vaddpd  xmm1,
               | xmm1, xmm0             vunpckhpd       xmm0, xmm1, xmm1
               | vaddpd  xmm0, xmm0, xmm1             vzeroupper
               | ret
               | 
               | It is still compiler-specific and non-portable, but at
               | least it is not global.
        
         | gumby wrote:
         | > It is crazy sometimes how you try to program something the
         | best you ever could with classical C++, and then someone comes
         | along and makes a version using SIMD that is over 10x faster
         | (but less portable code).
         | 
         | But that's one of the points of a systems programming language
         | (of which C++ is one) -- it tries to be portably as efficient
         | as possible but makes it easy to do target-specific programming
         | when required.
         | 
         | > I really wish compilers were better at auto vectorization
         | 
         | FORTRAN compilers sure are, since aliasing is not allowed. C++
         | is kneecapped by following C's memory model.
        
           | secondcoming wrote:
           | It's not just aliasing. gcc is quite bad at autovectorising
           | in general, clang is much better.
           | 
           | Here's a trivial algorithm that gcc borks over because 'bool'
           | is used instead of 'int', even fixing that the codegen isn't
           | great:
           | 
           | https://godbolt.org/z/hf7szo78E
        
           | pklausler wrote:
           | Fortran compilers can assume that (many) dummy arguments to a
           | particular procedure reference do not alias each other, true.
           | But there are many other ways in which aliasing can happen in
           | a conforming Fortran program.
        
           | Aardwolf wrote:
           | But why is C++ refusing to add "restrict" to its spec? C has
           | restrict in its spec, C++ has not. Of course compilers
           | inherit restrict from C, but it's not super portable due to
           | C++ not officially supporting it. But why?!
        
         | dragontamer wrote:
         | You could just use CUDA, which is C++ designed for GPUs, the
         | ultimate SIMD machine of today.
         | 
         | Or ROCm (basically CUDA but for AMD).
         | 
         | I always was a fan of Microsoft's C++AMP though. I thought that
         | was easiest to get into. Too bad it never stuck though.
        
         | cmovq wrote:
         | Good SIMD code requires careful consideration of how your data
         | is laid out in memory. This makes auto vectorization really
         | difficult since the compiler can't fix your data for you
         | outside of very local contexts
        
       | intalentive wrote:
       | SIMD is pretty intuitive if you've used deep learning libraries,
       | NumPy, array based or even functional languages
        
         | westurner wrote:
         | NumPy roadmap: https://numpy.org/neps/roadmap.html :
         | 
         | > _Improvements to NumPy's performance are important to many
         | users. We have focused this effort on Universal SIMD (see NEP
         | 38 -- Using SIMD optimization instructions for performance)
         | intrinsics which provide nice improvements across various
         | hardware platforms via an abstraction layer. The infrastructure
         | is in place, and we welcome follow-on PRs to add SIMD support
         | across all relevant NumPy functions_
         | 
         | "NEP 38 -- Using SIMD optimization instructions for
         | performance" (2019) https://numpy.org/neps/nep-0038-SIMD-
         | optimizations.html#nep3...
         | 
         | NumPy docs > CPU/SIMD Optimizations:
         | https://numpy.org/doc/stable/reference/simd/index.html
         | 
         | std::simd: https://doc.rust-lang.org/std/simd/index.html
         | 
         | "Show HN: SimSIMD vs SciPy: How AVX-512 and SVE make SIMD nicer
         | and ML 10x faster" (2023-10)
         | https://news.ycombinator.com/item?id=37808036
         | 
         | "Standard library support for SIMD" (2023-10)
         | https://discuss.python.org/t/standard-library-support-for-si...
        
       ___________________________________________________________________
       (page generated 2023-11-28 23:00 UTC)