[HN Gopher] Optimizing compilers reload vector constants needlessly ___________________________________________________________________ Optimizing compilers reload vector constants needlessly Author : ibobev Score : 82 points Date : 2022-12-06 16:41 UTC (6 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | jeffbee wrote: | Moving the constant to file or anonymous namespace scope solves | the issue. It's too bad that intrinsics are not `constexpr` | because I have a powerful urge to hang a `constinit` in front of | this line. | gumby wrote: | Disturbing that this works, as it shouldn't do the reload even | if the constant is passed in as a parameter. | leni536 wrote: | In this particular case the broadcasting instruction can be | replaced with builtin operations, allowing constexpr. | | https://godbolt.org/z/Td6vG9cqG | | edit: uh, the constant requires some hand adjustment | | edit2: fixed version https://godbolt.org/z/4Px5Mbsx4, and I | just don't get this. gcc really just wants to load that | constant twice. | foota wrote: | Maybe it's trying to avoid using SSE in the case where there's no | loop? SSE on some older platforms had a cost just from using it, | so it might be possible. | stephc_int13 wrote: | My experience with optimizing compilers is that generated code is | often frustratingly close to optimal (given the source is well | written and taking account the constraints of the target arch). | | It is perfectly reasonable to take a look at the output on | Godbolt, tweak it a bit and call it a day. | | Maintaining a full assembly language version of the same code is | rarely justifiable. | | And yet, I understand the itch, especially because there are | quite often some low-hanging fruits to grab. | evancox100 wrote: | This may be true for scalar code but it seems like the | compilers still aren't quite there with vector code. | phkahler wrote: | The optimization here would be CSE or hoisting, or both? I'm | guessing the problem is those are performed prior to | vectorization. | | In other words, I suspect an invariant calculation inside | consecutive loops but that is not vectorized will be pulled out | of the loops and also moved prior to them and executed just once. | JonChesterfield wrote: | At a guess, constant rematerialision failing to cross basic | block boundaries. Feels like a plausible thing for a heuristic | to miss. E.g. sink the constant into the loop so it's available | when optimising that block, then fail to hoist it back out | afterwards because constant materialisation is cheap. | JoeAltmaier wrote: | Intel had an optimizing compiler that was amazing. But unless you | were intel-only it made life harder to switch compilers for that | platform. | berkut wrote: | Yeah, I haven't used ICC for 7 years now, but at the time it | was much better than clang/gcc at keeping SSE/AVX intrinsic | types in registers through function calls (i.e. clang/gcc used | to spill out onto the stack and re-load), and things like this | in the article. | cwzwarich wrote: | Were you testing on the same platform? The Microsoft ABI has | callee-save XMM registers, whereas the Linux/macOS ABI does | not. Regardless, it would be nice if more compilers could do | interprocedural register allocation in cases where all | callers are known. | tester756 wrote: | I've heard similar opinions that people could just recompile | their soft and receive significant speed boost | inetknght wrote: | I haven't (yet) read the article, but I will. But the headline... | | > _Optimizing compilers reload vector constants needlessly_ | | ...is absolutely true. I wrote some code that just does bit | management (shifting, or, and, xor, popcount) on a byte-level. | Compiler produced vectorized instructions that provided about a | 30% speed-up. But when I looked at it... it was definitely not as | good as it could be, and one of the big things was frequently | reloading /broadcasting constants like 0x0F or 0xCC or similar. | Another thing it would do is to sometimes drop down to normal | (not-SIMD) instructions. This was with both `-O2` and `-O3`, and | also with `-march=native` | | I ended up learning how to use SIMD intrinsics and hand-wrote it | all... and achieved about a 600% speedup. The code reached about | 90% of the performance of the bus to RAM which was what I | theorized "should" be the limiting factor: bitwise operations | like this are _extremely_ fast and the slowest point point was | popcount which didn 't have a native instruction on the hardware | I was targeting (AVX2). This was with GCC 6.3 if I recall, about | 5 years ago. | an1sotropy wrote: | Can you recommend any favorite resources for learning how to | use SIMD intrinsics? | teux wrote: | Not OP but also work with this. | | There's some tutorials but honestly the best thing is to just | use them. | | Write an image processing routine that does something like | apply a gaussian blur to a black and white image. The c++ | code for this is _everywhere_. You have a fixed kernel (2d | matrix) and you have to do repeat multiplication and addition | to each pixel for each element in the kernel. | | Write it in C++ or Rust. Then read the Arm SIMD manual, find | the instructions that do the math you want, and switch it | over to intrinsics. You are doing the same exact operations | with the intrinsics as the raw c++. Just 8 or 16 of them at a | single time. | | Run them side by side for parity and to check speed, tweak | the simd, etc. | | Arm has good (well ,okay) documentation | | https://developer.arm.com/documentation/den0018/a/?lang=en | | https://arm- | software.github.io/acle/neon_intrinsics/advsimd.... | | * Edit: you also have to do this on a supported architecture. | Raspberry pi's have a neon core at least in the 3's. Not sure | about the 4's but I believe so too! | corysama wrote: | Adding on: | | Go to | https://www.intel.com/content/www/us/en/docs/intrinsics- | guid... | | Start with SSE, SSE2, SSE3 | | Write small functions in https://godbolt.org/ . Watch the | assembly and the program output. | jeffreyrogers wrote: | That's basically the problem the article describes although | he's using vector intrinsics too and it still reloads and | broadcasts the constant before each loop. | pclmulqdq wrote: | When I have used intrinsics, the compiler at least has a hope | of getting this right, particularly when you use patterns | like: | | __m256i mask = _mm256_set1_epi8(0x0f) | | If you just used the intrinsic that sets the register to a | constant over and over, it often repeats the instruction. | | The compilers just aren't that smart about SIMD yet. | jeffreyrogers wrote: | He sets it once like this before the loops. | __m256i c = _mm256_set1_epi32(10001); | | And then the disassembly has mov | eax, 10001 vpbroadcastd ymm1, eax | | before each loop. | DannyBee wrote: | There are three reasons it reloads constants: | | 1. It thinks it is cheaper than keeping them in a register ( | this is known as rematerialization). It will reload constants | that it lets it keep something else in a register, and it's | cheaper to do this. | | 2. It thinks something could affect the constant. | | 3. It thinks it must move it through memory to use it, and | then it thinks the memory was clobbered. | | In this case, it definitely knows it is a constant, and it | can't prove that both loops always execute, so it places it | in the path where it is only executed once per loop, because | it believes it will be cheaper. | | I can still make at least gcc do weird things if i prove to | it the loop executes once. | | In that case, what is happening in gcc is that constant | propagation is propagating the vector constant forward into | both loops. Something later (that has a machine cost model) | is expected to commonize it if it is cheaper, but never does. | teux wrote: | I often hand write neon (and other vectorised architecture) | intrinsics/assembly for my job, optimising image and signal | processing routines. We have seen many many 3 digit percentage | speedups from bare c/c++ code. | | I got into the nastiest discussion on reddit where people were | swearing up and down it was impossible to beat the compiler, | and handwritten assembly was useless/pretentious/dangerous. I | was downvoted massively. Sigh. | | Anyways, that was a year ago. Thanks for another point of | validation for that. It clearly didn't hurt my feelings. :) | | I never come across people in the wild that actually do this | also, it's such a niche area of expertise. | fwsgonzo wrote: | It also slightly annoys me a bit the things JIT people write | on their github READMEs about the incredibly theoretical | improvements that can happen at runtime, yet it's never | anywhere close to AOT compilation. Then you can add 2-3x on | top of that for hand-written assembly. | | I do wonder whats going on with projects like BOLT though. I | have seen it was merged into LLVM, and I have tried to use it | but the improvement was never more than 7%. I feel like it | has a lot of potential because it does try to take run-time | into account. | | See: https://github.com/llvm/llvm-project/tree/main/bolt | wyldfire wrote: | > improvement was never more than 7%. | | If your use case isn't straining icache then you won't | benefit as much. | | BTW 7% is huge, odd that you would describe it as "only". | wyldfire wrote: | > impossible to beat the compiler | | Ludicrous! How could they be taken seriously? Which subreddit | was this? | astrange wrote: | Tell them to read the ffmpeg code. All the platform- | specific/SIMD stuff is done in asm. | | This isn't only because it's faster, it's honestly easier to | read than intrinsics anyway. What it does lack is | debugability. | teux wrote: | For debugging you can actually use gdb in assembly tui mode | and step through the instructions! You can even get it | hooked up in vs code and remote debug an embedded target | using the full IDE. Full register view, watch registers for | changes, breakpoints, step instruction to instruction. | | Pipelining and optimisations can make the intrinsics a bit | fucky though, have to make sure it's -O0 and a proper debug | compilation. | | I have line by line debugged raw assembly many times. It's | just a pain to initially set up. Honestly not very | different from c/c++ debugging once running. | MaxBarraclough wrote: | Or any other highly optimised numerical codebase. From a | quick glance at OpenBLAS, it looks like they have a _lot_ | of microarchitecture-specific assembly code, with | dispatching code to pick out the appropriate | implementations. | | https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e | 2... | | https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e | 2... | [deleted] | phkahler wrote: | >> This was with both `-O2` and `-O3`, and also with | `-march=native` | | Until very recently GCC didn't do vectorization at -O2 usless | you told it to. | inetknght wrote: | That's true. I definitely omitted a bunch of other flags that | were added including the flags to turn on vectorizations | BoardsOfCanada wrote: | Seems like the compiler puts the test for the first loop before | loading the constant the first time, and therefor needs to load | it again before the second loop. So the "tradeoff" is that if | neither loop runs it will load the constant zero times. Of course | this isn't what a human would do but at least there is some kind | of sliver of logic to it. (Like if vpbroadcastd was a 2000 cycle | instruction this pattern might have made sense) ___________________________________________________________________ (page generated 2022-12-06 23:00 UTC)