[HN Gopher] SIMD Everywhere Optimization from ARM Neon to RISC-V... ___________________________________________________________________ SIMD Everywhere Optimization from ARM Neon to RISC-V Vector Extensions Author : camel-cdr Score : 68 points Date : 2023-09-29 15:54 UTC (7 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | almatabata wrote: | Very neat, i hope this will get easier to do in the future once | languages start including these SIMD semantics in the language | itself like rust tries to do: | | https://doc.rust-lang.org/std/simd/struct.Simd.html | | Libraries implemented in languages without these semantics will | greatly benefit from this. | geertj wrote: | And C++: | | https://en.cppreference.com/w/cpp/experimental/simd | | This proposal has been around for a while a but it recently got | some new momentum and seems to be on track for c++26. Gcc ships | a version today for those wanting to try it. | dzaima wrote: | A problem is that most such things (the rust thing, C++'s | experimental/simd, Zig's SIMD types) have the vector size as a | compile-time property, while ARM's SVE and RISC-V's RVV are | designed such that it's possible to write portable code that | can work for a range of implementation widths. Thus such a | fixed-width SIMD library would be forced to target the minimum | (128-bit) even if the hardware supports 256-bit, 512-bit, or | more. (SVE supports up to 1024-bit, RVV - up to 65536-bit) | | There is Highway (https://github.com/google/highway) however, | that does support dynamically-sized SIMD. | almatabata wrote: | If you compile the code by specifying the target as native | you could get around that limitation no? | cozzyd wrote: | yes, but then if distributing binaries, you need a | different binary for each SIMD width. | almatabata wrote: | Ah makes sense if you have complete control over your | hardware it could make sense but with open source | projects and businesses with a wide customer base it | might not make sense. | | Compiled languages like Rust, C++ and Zig cannot detect | the hardware because they have no runtime right? Could a | language like Go add the simd semantics and detect the | support vector size? | dzaima wrote: | The problem isn't detecting the width (that's trivially | possible at runtime with a single instruction, though | both SVE and RVV have a way to write loops such that you | don't even need to). | | The problem is that a "Simd<i32, 4>" will always have 4 | elements, but you'd need a "Simd<i32, whatever the | hardware has>" type, which has significant impact on what | is possible to do with such a type. | almatabata wrote: | Ah thank you for clarifying so you would have to create | an abstraction layer on top of the current simd | implementation like for example simd_vector(type, size). | That abstraction would have to dynamically detect the | hardware and dispatch it to the hardware like the project | you shared (https://github.com/google/highway). | | So technically it sounds feasible but all of the | languages like Zig, C++ and Rust picked a simpler | approach. Is it simply a first step to a more abstract | approach? | dzaima wrote: | Not really - you don't need to dispatch anything, the | idea is that the same code (and thus the same | assembly/machine code) can operate on different sizes by | itself. e.g. with RVV "vsetvli x0,x0,e32,m1,ta,ma; | vadd.vv v0, v1, v2" on a CPU with 128-bit vectors will do | 4 element additions, but on a CPU with 1024-bit vectors | it'll do 32 additions. | | And some things you just can't really "generalize" to | scalable vectors. e.g. you can store Simd<i32,4> in a | struct or global variables, or initialize with, say, | [3,2,1,0], but none of those things are possible with | scalable vectors (globals/struct fields need a known | size, and initializing with a hard-coded list of elements | doesn't make much sense if you don't even know how many | elements you'll need). | Conscat wrote: | C++ comes with a runtime which, among many other things, | allows you to detect the microarchitecture and featureset | of the environment you're running on using | `__builtin_cpu_init()` which calls a dynamically linked | function `__cpu_indicator_init()`. Then using the | `cpu_dispatch`, `target`, or `target_clones` attributes | you can compile multiple variations of an algorithm in | your program and dynamically select the one to execute. | This is referred to as a "fat binary", and the feature is | "multifunctions" or "multiversioned functions". | | Zig intends to support a similar feature but doesn't yet, | at least not built into the language (you could certainly | express this if you tried hard enough). I don't know | about Rust, but I would be very surprised if it can't do | this. | | edit: I think I replied to the wrong comment >.< | vkazanov wrote: | You can actually autogenerate all reasonable variants of | the code if necessary, there aren't that many | architectures these days. Simd imstructions are usually | very local, this shouldn't blow up the binary. | | The point is to not have to write repetitive source code | many times. | Pet_Ant wrote: | What is the real cost to just have those few methods be | compiled in and then a branch? You don't need to ship a | separate binary for each target, you can have dead code | in it. I mean fat binaries take this idea to the extreme | to support multiple architectures. | | https://en.wikipedia.org/wiki/Fat_binary | elabajaba wrote: | Not being able to inline and having to branch on every | call to a simd function can sometimes make it slower than | the basic scalar version. | Pet_Ant wrote: | Just a thought, but would it be possible to hot patch at | the time of loading the binary? I realise it might | require updates to the binary format, but it might be | very well justified. | dzaima wrote: | You sould branch at the level where inlining doesn't make | sense, which would usually be some function wrapping the | big loop, which should be rather free. Which is the same | situation as on x86-64 if you want to target pre- | AVX2/post-AVX2/AVX-512. | dzaima wrote: | That's 5 copies for SVE (had an error in first message - | SVE allows up to 2048-bit vectors, not 1024), and 10 | copies for RVV if you wanted to target all widths (though | you'd probably be fine for a decade or a couple by | targeting just 128 & 256-bit, and maybe 512-bit). Plus | one more for a scalar fallback. | | And yes, it's not particularly large of a cost, other | than it being an extremely pointless waste of space given | that it is possible to have just one variant that covers | them all. | | Though, it would become significantly more problematic if | you wanted to target different extension groups too | (which you would quite likely want to some extent) as | those'd multiply with all the length targets - SVE vs | SVE2 vs more future extensions, and on RVV there's just a | lot (Zvfh & Zvfhmin for FP16, Zvbb for extra bitmanip | stuff, many more here[1]; and potentially at some point | there could be an extension that uses a wider encoding | scheme to inline vsetvl fields & allow masking by | registers other than v0, which could benefit everything) | | [1]: https://github.com/riscv/riscv- | crypto/blob/c8ddeb7e64a3444dd... | snvzz wrote: | RISC-V is rapidly building the strongest ecosystem. | adgjlsfhk1 wrote: | and hug of death | kierank wrote: | The paper suggests FFmpeg uses intrinsics which is not correct. | | There have been many SIMD abstraction layers created in the past | but none of them will beat the raw speed of handwritten assembly. | Try and implement something like vpternlogd in one of these | abstraction layers. | dist1ll wrote: | The main abstraction of intrinsics is register allocation, | right? Is there anything else that can be gained by handwritten | asm? | camel-cdr wrote: | For rvv specifically there are a few things that aren't | possible using the intrinsics abstraction. | | E.g. in asm you can run the same instruction sequence with | different vtype (element width and LMUL). | Danidada wrote: | Neat project! | | However, I'm pretty sure OpenCV has their "universal intrinsics" | and RISC-V with scalable vector registers is supported in the | latest OpenCV version | | Universal intrinsics (docs not updated): | https://docs.opencv.org/4.x/d6/dd1/tutorial_univ_intrin.html | Scalable RVV support: https://github.com/opencv/opencv/pull/22179 | KingLancelot wrote: | We need to do better than ISA specific intrinsics. | | There should be a simd.h header in the C standard library that | contains typedefs for vector types, and various functions to | operate on them as well as Operators for them. | | Like my _Operator <symbol> <function name>; proposal, which | requires no mangling. | camel-cdr wrote: | This doesn't seem to be upstreamed yet. | | I hope they have real hardware performance numbers for the rv | summit talk. | biocrusoe wrote: | SIMDe maintainer here, I would welcome a PR; yes! | atdt wrote: | Highway (https://github.com/google/highway), Google's SIMD | library, lets you write length-agnostic SIMD code. It has | excellent support for a wide range of targets, including both | RISC-V and Arm vector extensions. | mgaunard wrote: | There are so many SIMD libraries nowadays. | | I myself implemented one in the SSE4/Altivec days (later extended | to AVX, AVX512 and NEON). There were only a few options then, but | now everyone seems to be doing it. | biocrusoe wrote: | Archived copy: | https://web.archive.org/web/20230929161438/https://arxiv.org... | | Direct link to PDF: https://arxiv.org/pdf/2309.16509.pdf ___________________________________________________________________ (page generated 2023-09-29 23:00 UTC)