[HN Gopher] SIMD Everywhere Optimization from ARM Neon to RISC-V...
       ___________________________________________________________________
        
       SIMD Everywhere Optimization from ARM Neon to RISC-V Vector
       Extensions
        
       Author : camel-cdr
       Score  : 68 points
       Date   : 2023-09-29 15:54 UTC (7 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | almatabata wrote:
       | Very neat, i hope this will get easier to do in the future once
       | languages start including these SIMD semantics in the language
       | itself like rust tries to do:
       | 
       | https://doc.rust-lang.org/std/simd/struct.Simd.html
       | 
       | Libraries implemented in languages without these semantics will
       | greatly benefit from this.
        
         | geertj wrote:
         | And C++:
         | 
         | https://en.cppreference.com/w/cpp/experimental/simd
         | 
         | This proposal has been around for a while a but it recently got
         | some new momentum and seems to be on track for c++26. Gcc ships
         | a version today for those wanting to try it.
        
         | dzaima wrote:
         | A problem is that most such things (the rust thing, C++'s
         | experimental/simd, Zig's SIMD types) have the vector size as a
         | compile-time property, while ARM's SVE and RISC-V's RVV are
         | designed such that it's possible to write portable code that
         | can work for a range of implementation widths. Thus such a
         | fixed-width SIMD library would be forced to target the minimum
         | (128-bit) even if the hardware supports 256-bit, 512-bit, or
         | more. (SVE supports up to 1024-bit, RVV - up to 65536-bit)
         | 
         | There is Highway (https://github.com/google/highway) however,
         | that does support dynamically-sized SIMD.
        
           | almatabata wrote:
           | If you compile the code by specifying the target as native
           | you could get around that limitation no?
        
             | cozzyd wrote:
             | yes, but then if distributing binaries, you need a
             | different binary for each SIMD width.
        
               | almatabata wrote:
               | Ah makes sense if you have complete control over your
               | hardware it could make sense but with open source
               | projects and businesses with a wide customer base it
               | might not make sense.
               | 
               | Compiled languages like Rust, C++ and Zig cannot detect
               | the hardware because they have no runtime right? Could a
               | language like Go add the simd semantics and detect the
               | support vector size?
        
               | dzaima wrote:
               | The problem isn't detecting the width (that's trivially
               | possible at runtime with a single instruction, though
               | both SVE and RVV have a way to write loops such that you
               | don't even need to).
               | 
               | The problem is that a "Simd<i32, 4>" will always have 4
               | elements, but you'd need a "Simd<i32, whatever the
               | hardware has>" type, which has significant impact on what
               | is possible to do with such a type.
        
               | almatabata wrote:
               | Ah thank you for clarifying so you would have to create
               | an abstraction layer on top of the current simd
               | implementation like for example simd_vector(type, size).
               | That abstraction would have to dynamically detect the
               | hardware and dispatch it to the hardware like the project
               | you shared (https://github.com/google/highway).
               | 
               | So technically it sounds feasible but all of the
               | languages like Zig, C++ and Rust picked a simpler
               | approach. Is it simply a first step to a more abstract
               | approach?
        
               | dzaima wrote:
               | Not really - you don't need to dispatch anything, the
               | idea is that the same code (and thus the same
               | assembly/machine code) can operate on different sizes by
               | itself. e.g. with RVV "vsetvli x0,x0,e32,m1,ta,ma;
               | vadd.vv v0, v1, v2" on a CPU with 128-bit vectors will do
               | 4 element additions, but on a CPU with 1024-bit vectors
               | it'll do 32 additions.
               | 
               | And some things you just can't really "generalize" to
               | scalable vectors. e.g. you can store Simd<i32,4> in a
               | struct or global variables, or initialize with, say,
               | [3,2,1,0], but none of those things are possible with
               | scalable vectors (globals/struct fields need a known
               | size, and initializing with a hard-coded list of elements
               | doesn't make much sense if you don't even know how many
               | elements you'll need).
        
               | Conscat wrote:
               | C++ comes with a runtime which, among many other things,
               | allows you to detect the microarchitecture and featureset
               | of the environment you're running on using
               | `__builtin_cpu_init()` which calls a dynamically linked
               | function `__cpu_indicator_init()`. Then using the
               | `cpu_dispatch`, `target`, or `target_clones` attributes
               | you can compile multiple variations of an algorithm in
               | your program and dynamically select the one to execute.
               | This is referred to as a "fat binary", and the feature is
               | "multifunctions" or "multiversioned functions".
               | 
               | Zig intends to support a similar feature but doesn't yet,
               | at least not built into the language (you could certainly
               | express this if you tried hard enough). I don't know
               | about Rust, but I would be very surprised if it can't do
               | this.
               | 
               | edit: I think I replied to the wrong comment >.<
        
               | vkazanov wrote:
               | You can actually autogenerate all reasonable variants of
               | the code if necessary, there aren't that many
               | architectures these days. Simd imstructions are usually
               | very local, this shouldn't blow up the binary.
               | 
               | The point is to not have to write repetitive source code
               | many times.
        
               | Pet_Ant wrote:
               | What is the real cost to just have those few methods be
               | compiled in and then a branch? You don't need to ship a
               | separate binary for each target, you can have dead code
               | in it. I mean fat binaries take this idea to the extreme
               | to support multiple architectures.
               | 
               | https://en.wikipedia.org/wiki/Fat_binary
        
               | elabajaba wrote:
               | Not being able to inline and having to branch on every
               | call to a simd function can sometimes make it slower than
               | the basic scalar version.
        
               | Pet_Ant wrote:
               | Just a thought, but would it be possible to hot patch at
               | the time of loading the binary? I realise it might
               | require updates to the binary format, but it might be
               | very well justified.
        
               | dzaima wrote:
               | You sould branch at the level where inlining doesn't make
               | sense, which would usually be some function wrapping the
               | big loop, which should be rather free. Which is the same
               | situation as on x86-64 if you want to target pre-
               | AVX2/post-AVX2/AVX-512.
        
               | dzaima wrote:
               | That's 5 copies for SVE (had an error in first message -
               | SVE allows up to 2048-bit vectors, not 1024), and 10
               | copies for RVV if you wanted to target all widths (though
               | you'd probably be fine for a decade or a couple by
               | targeting just 128 & 256-bit, and maybe 512-bit). Plus
               | one more for a scalar fallback.
               | 
               | And yes, it's not particularly large of a cost, other
               | than it being an extremely pointless waste of space given
               | that it is possible to have just one variant that covers
               | them all.
               | 
               | Though, it would become significantly more problematic if
               | you wanted to target different extension groups too
               | (which you would quite likely want to some extent) as
               | those'd multiply with all the length targets - SVE vs
               | SVE2 vs more future extensions, and on RVV there's just a
               | lot (Zvfh & Zvfhmin for FP16, Zvbb for extra bitmanip
               | stuff, many more here[1]; and potentially at some point
               | there could be an extension that uses a wider encoding
               | scheme to inline vsetvl fields & allow masking by
               | registers other than v0, which could benefit everything)
               | 
               | [1]: https://github.com/riscv/riscv-
               | crypto/blob/c8ddeb7e64a3444dd...
        
       | snvzz wrote:
       | RISC-V is rapidly building the strongest ecosystem.
        
       | adgjlsfhk1 wrote:
       | and hug of death
        
       | kierank wrote:
       | The paper suggests FFmpeg uses intrinsics which is not correct.
       | 
       | There have been many SIMD abstraction layers created in the past
       | but none of them will beat the raw speed of handwritten assembly.
       | Try and implement something like vpternlogd in one of these
       | abstraction layers.
        
         | dist1ll wrote:
         | The main abstraction of intrinsics is register allocation,
         | right? Is there anything else that can be gained by handwritten
         | asm?
        
           | camel-cdr wrote:
           | For rvv specifically there are a few things that aren't
           | possible using the intrinsics abstraction.
           | 
           | E.g. in asm you can run the same instruction sequence with
           | different vtype (element width and LMUL).
        
       | Danidada wrote:
       | Neat project!
       | 
       | However, I'm pretty sure OpenCV has their "universal intrinsics"
       | and RISC-V with scalable vector registers is supported in the
       | latest OpenCV version
       | 
       | Universal intrinsics (docs not updated):
       | https://docs.opencv.org/4.x/d6/dd1/tutorial_univ_intrin.html
       | Scalable RVV support: https://github.com/opencv/opencv/pull/22179
        
       | KingLancelot wrote:
       | We need to do better than ISA specific intrinsics.
       | 
       | There should be a simd.h header in the C standard library that
       | contains typedefs for vector types, and various functions to
       | operate on them as well as Operators for them.
       | 
       | Like my _Operator <symbol> <function name>; proposal, which
       | requires no mangling.
        
       | camel-cdr wrote:
       | This doesn't seem to be upstreamed yet.
       | 
       | I hope they have real hardware performance numbers for the rv
       | summit talk.
        
         | biocrusoe wrote:
         | SIMDe maintainer here, I would welcome a PR; yes!
        
       | atdt wrote:
       | Highway (https://github.com/google/highway), Google's SIMD
       | library, lets you write length-agnostic SIMD code. It has
       | excellent support for a wide range of targets, including both
       | RISC-V and Arm vector extensions.
        
       | mgaunard wrote:
       | There are so many SIMD libraries nowadays.
       | 
       | I myself implemented one in the SSE4/Altivec days (later extended
       | to AVX, AVX512 and NEON). There were only a few options then, but
       | now everyone seems to be doing it.
        
       | biocrusoe wrote:
       | Archived copy:
       | https://web.archive.org/web/20230929161438/https://arxiv.org...
       | 
       | Direct link to PDF: https://arxiv.org/pdf/2309.16509.pdf
        
       ___________________________________________________________________
       (page generated 2023-09-29 23:00 UTC)