[HN Gopher] Removing characters from strings faster with AVX-512 ___________________________________________________________________ Removing characters from strings faster with AVX-512 Author : mdb31 Score : 124 points Date : 2022-05-01 13:12 UTC (9 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | gfody wrote: | there's more whitespace above 0x20 | https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode | brrrrrm wrote: | The complication involved with UTF-8 encoded space removal is | immense and likely quite far out of scope. | protoman3000 wrote: | Please correct me if I'm wrong, but wouldn't we normally scale | these things instead on a GPU? | curling_grad wrote: | Maybe because of IO costs? | raphlinus wrote: | The short answer is no, but the long answer is that this is a | very complex tradeoff space. Going forward, we may see more of | these types of tasks moving to GPU, but for the moment it is | generally not a good choice. | | The GPU is incredible at raw throughput, and this particular | problem can actually implemented fairly straightforwardly (it's | a stream compaction, which in turn can be expressed in terms of | prefix sum). However, where the GPU absolutely falls down is | when you want to interleave CPU and GPU computations. To give | round numbers, the roundtrip latency is on the order of 100us, | and even aside from that, the memcpy back and forth between | host and device memory might actually be slower than just | solving the problem on the CPU. So you only win when the | strings are very large, again using round numbers about a | megabyte. | | Things change if you are able to pipeline a lot of useful | computation on the GPU. This is an area of active research | (including my own). Aaron Hsu has been doing groundbreaking | work implementing an entire compiler on the GPU, and there's | more recent work[1], implemented in Futhark, that suggests that | that this approach is promising. | | I have a paper in the pipeline that includes an extraordinarily | high performance (~12G elements/s) GPU implementation of the | parentheses matching problem, which is the heart of parsing. If | anyone would like to review a draft and provide comments, | please add a comment to the GitHub issue[2] I'm using to track | this. It's due very soon and I'm on a tight timeline to get all | the measurements done, so actionable suggestions on how to | improve the text would be most welcome. | | [1]: https://theses.liacs.nl/pdf/2020-2021-VoetterRobin.pdf | | [2]: | https://github.com/raphlinus/raphlinus.github.io/issues/66#i... | mwcampbell wrote: | > To give round numbers, the roundtrip latency is on the | order of 100us | | I can't help but notice that, at least in my experience on | Windows, this is the same order of magnitude as for inter- | process communication on the local machine. Tangent: That | latency was my nemesis as a Windows screen reader developer; | the platform accessibility APIs weren't well designed to take | it into account. Windows 11 finally has a good solution for | this problem (yes, I helped implement that while I was at | Microsoft). | fancyfredbot wrote: | I wonder if this applies to the same extent for an on-package | GPU which shares the same physical memory as the CPU. I'd | expect round trip times in that case to be minimal and the | available processing power would probably be competitive with | AVX512. I have been wondering if this is the reason for | deprecating AVX512 on consumer processors - these are likely | to have a GPU available. | raphlinus wrote: | Good question! There are two separate issues with putting | the GPU in the same package as the CPU. One is the memcpy | bandwidth issue, which is indeed entirely mitigated | (assuming the app is smart enough to exploit this). But the | round trip times seem more related to context switches. I | have an M1 Max here, and just found ~200us for a very | simple dispatch (just clearing 16k of memory). | | I personally believe it may be possible to reduce latency | using techniques similar to io_uring, but it may not be | simple. Likely a major reason for the roundtrips is so that | a trusted process (part of the GPU driver) can validate | inputs from untrusted user code before it's presented to | the GPU hardware. | Andoryuuta wrote: | Intel is removing AVX-512 support from their newer CPU's (Alder | Lake +). :/ | | https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde... | mhh__ wrote: | You're forgetting about server CPUs, and we don't know yet | about Raptor Lake. | Andoryuuta wrote: | Ah, yep. You're totally right. I didn't even consider server | CPUs. Also, I thought I read somewhere that it was for all | consumer CPUs starting at Alder Lake, but I have no idea | where, so I could be entirely wrong. :) | SemanticStrengh wrote: | And zen 4 is rumoured to add support for it ^^ | electricshampo1 wrote: | This is only on the client side; server still has and will have | AVX512 for the foreseeable future. | PragmaticPulp wrote: | Server and workstation chips still have AVX-512. It's only | unsupported on CPUs with smaller E(fficeincy) cores. | | AVX-512 was never really supported in newer consumer CPUs with | heterogeneous architecture. These CPUs have a mix of powerful | cores and efficiency cores. The AVX-512 instructions were never | added to the efficiency cores because it would use way too much | die space and defeat the purpose of efficiency cores. | | There was previously a hidden option to disable the efficiency | cores and enable AVX-512 on the remaining power cores, but the | number of workloads that would warrant turning off a lot of | your cores to speed up AVX-512 calculations is virtually non- | existent in the consumer world (where these cheap CPUs are | targeted). | | The whole journalism controversy around AVX-512 has been a bit | of a joke because many of the same journalists tried to | generate controversy when AVX-512 was first introduced and they | realized that AVX-512 code would reduce the CPU clock speed. | There were numerous articles about turning off AVX-512 on | previous generation CPUs to avoid this downclocking and to make | overclocks more stable. | pantalaimon wrote: | Catching the bad instruction fault on the E-cores and only | scheduling the thread on the P-cores would be something that | could be added to Linux (there were already third party | patches towards that goal) if Intel had not disable the | feature entirely. | saagarjha wrote: | Presumably the AVX-512 code is something on your hot path, | so I'm not sure waiting for a signal to reschedule the work | is something you would want. | jeffbee wrote: | But it's not really compatible with the GCC IFUNC scheme | ... PTL entries will be permanently remapped to the most | appropriate code on the CPU where the function is first | called, and never thereafter remapped. So you end up with a | coin toss whether you get the optimized function or not. | | Personally I don't find the e-cores on my alder lake CPU to | be of any value. They're more of a hazard than a benefit. | janwas wrote: | Fair point about ifunc, but we're using our own table of | function pointers, which can be invalidated/re- | initialized. Someone also mentioned that the OS could | catch SIGILL.. indeed seems doable to then reset thread | affinity to the P cores? | willis936 wrote: | >The AVX-512 instructions were never added to the efficiency | cores because it would use way too much die space and defeat | the purpose of efficiency cores. | | Isn't the purpose of efficiency cores to be more power | efficient? It's more power efficient to vectorize | instructions and minimize pipeline re-ordering. | mastax wrote: | Power and area efficient. You can fit 4 E cores in the area | of 1 P core. Adding AVX-512 to the E cores would | significantly hamper that, though I don't know by how much. | zozbot234 wrote: | > The AVX-512 instructions were never added to the efficiency | cores because it would use way too much die space and defeat | the purpose of efficiency cores. | | And this is why scalable vector ISA's like the RISC-V vector | extensions are superior to fixed-size SIMD. You can support | both kinds of microarchitecture while running the exact same | code. | R0b0t1 wrote: | That's not a valid reason why I can't use them on the P | cores. Some motherboards can enable them on the i9-12900k, it | works fine, but you need to pin to a P core. | PragmaticPulp wrote: | The reason is that it was never validated or tested with | AVX-512 and Intel and motherboard vendors couldn't commit | to shipping everything with AVX-512 support in future | steppings/revisions. | | If you disable E cores you could enable AVX-512 on certain | motherboards, but like I said that's not really a net win | 99.99% of the time when you're giving up entire cores. | | It was also at your own risk because presumably the | power/clock speed profiles were never tuned for a feature | that wasn't actually supported. I can see exactly why they | turned it off _on newer CPUs only after an announcement_. | R0b0t1 wrote: | Still smells like bullshit. Let the customer decide. Who | cares if it was validated? Why was it even included? Just | put it behind a yes-I-really-mean-it-switch so nobody | uses it by accident. | watmough wrote: | This is really cool. | | I just got through doing some work with vectorization. | | On the simplest workload I have, splitting a 3 MByte text file | into lines, writing a pointer to each string to an array, GCC | will not vectorize the naive loop, though ICC might I guess. | | With simple vectorization to AVX512 (64 unsigned chars in a | vector), finding all the line breaks goes from 1.3 msec to 0.1 | msec, so a little better than a 10x speedup, still just on the | one core, which keeps things simple. | | I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class | Library. It's super easy. | bertr4nd wrote: | I love Daniel's vectorized string processing posts. There's | always some clever trickery that's hard for a guy like me (who | mostly uses vector extensions for ML kernels) to get quickly. | | I found myself wondering if one could create a domain-specific | language for specifying string processing tasks, and then | automate some of the tricks with a compiler (possibly with human- | specified optimization annotations). Halide did this sort of | thing for image processing (and ML via TVM to some extent) and it | was a pretty significant success. | gslin wrote: | A problem is slowing down the CPU frequency significantly when | AVX-512 is involved, e.g. | https://en.wikichip.org/wiki/intel/xeon_gold/6262v this, which | usually cancels out the benefit in the Real World (tm). | janwas wrote: | I would love to see an example of reasonable code not seeing | any benefit. On first generation SKX, we saw 1.5x speedups vs | AVX2, and that was IIRC even without taking much advantage of | AVX3-only instructions. | SemanticStrengh wrote: | Please stop spreading this fallacy, while downclocking can | happen, usually the benefit is still strong and superior to | avx256. Even 256 can induce downclocking. AVX 512 when properly | utilized simply demolish non AVX 512 cpus. | vlovich123 wrote: | On that one task. The challenge is if the avx512 pieces | aren't a bottleneck in every single concurrent workload you | run. It's fine if the most important thing your running on | them is code optimized for AVX512. Realistically though, is | that the case for the target market of CPUs capable of | AVX512, since consumer use cases aren't? The predominant | workload would be cloud right? Where you're running | heterogeneous workloads right? You'd have to get real smart | by coalescing AVX512 and non AVX512 workloads onto separate | machines and disabling it on the machines that don't need it. | Very complicated work to do because you'd have to have each | workload annotated by hand (memcpy is optimized to use AVX512 | when available so the presence of AVX512 in the code is | insufficient) | | The more generous interpretation is that Intel fixed that | issue a while back although the CPUs with that problem are | still in rotation and you have to think about that when | compiling your code. | PragmaticPulp wrote: | This was massively exaggerated by journalists when AVX-512 was | first announced. | | It is true that randomly applied AVX-512 instructions can cause | a slight clock speed reduction, the proper way to use libraries | like this would be within specific hot code loops where the | mild clock speed reduction is more than offset by the huge | parallelism increase. | | This doesn't make sense if you're a consumer doing something | multitasking and a background process is invoking the AVX-512 | penalty in the background, but it usually would make sense in a | server scenario. | adgjlsfhk1 wrote: | the thing I never understood about this is why Intel didn't | just add latency to the avx512 instructions instead? that | seems much easier than downclocking the whole cpu | janwas wrote: | I believe they do actually do something like this - until | power and voltage delivery change, wide instructions are | throttled independently of frequency changes (which on SKX | involved a short halt). | pclmulqdq wrote: | Intel has been trying to reduce the penalty for AVX-512, and | barring that, advertise that there is no penalty. Most things | on Ice Lake run fine with 256 bit vectors, but Skylake and | earlier really needed 128 bit or narrower if you weren't doing | serious vector math. | | Forget about 512 bit vectors or FMAs. | alksjdalkj wrote: | I think this is less of a problem on newer CPUs: | https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre... | pclmulqdq wrote: | Those are client CPUs, which have very different behavior | around power management than server parts. However, AVX | downclocking has mostly gone away with ice lake and hopefully | sapphire rapids does away with it permanently (except on 512 | bit vectors). | mhh__ wrote: | Unless someone has data for the latest Intel chips (i.e. | sapphire rapids) showing the opposite I'm inclined to think | this is a meme from 2016/7 that needs to go the way of the | dodo. | Twirrim wrote: | It was largely wrong then, too. Cloudflare, who really kicked | off a large amount of the fuss, had "Bronze" class Xeon | chips, that weren't designed or marketed for what they were | attempting to use them for. They were only ever intended for | small business stuff. Not large scale high performance | operations. Their performance downclock for AVX-512 is way, | way higher on Bronze. | NavinF wrote: | Weren't those chips $10k each back then? Hardly anyone got | gold Xeons. | Twirrim wrote: | Not even close. The blog post was 2017. | | Actually, I stand corrected, after double checking, | Cloudflare were using Silver. Entry level data centre | chips, instead of small business chips. Still not the | kind of chips you'd buy for high performance | infrastructure, and not intended to be used for such. | | Xeon Silver 4116s hit the market at $1,002.00. The Golds | were $1,221.00. The performance differences are quite | significant. For something that'll be in service for ~3-5 | years, $200 is absolutely trivial by way of a per-chip | increase. It's firmly in the "false economy" territory to | be skimping on your chip costs. It's a bit more | understandable in smaller businesses, but you just don't | do it when you're operating at scale. | | Also remember: at the scales that Cloudflare are | purchasing at, they won't be paying RRP. They'll be | getting tidy discounts. | steve76 wrote: | brrrrrm wrote: | What's the generated assembly look like? I suspect clang isn't | smart enough to store things into registers. The latency of | VPCOMPRESSB seems quite high (according to the table here at | least https://uops.info/table.html), so you'll probably want to | induce a bit more pipelining by manually unrolling into the | register variant. | | I don't have an AVX512 machine with VBMI2, but here's what my | untested code might look like: __m512i spaces = | _mm512_set1_epi8(' '); size_t i = 0; for (; i + (64 * | 4 - 1) < howmany; i += 64 * 4) { // 4 input regs, 4 | output regs, you can actually do up to 8 because there are 8 mask | registers __m512i in0 = _mm512_loadu_si512(bytes + i); | __m512i in1 = _mm512_loadu_si512(bytes + i + 64); __m512i | in2 = _mm512_loadu_si512(bytes + i + 128); __m512i in3 = | _mm512_loadu_si512(bytes + i + 192); __mmask64 mask0 | = _mm512_cmpgt_epi8_mask (in0, spaces); __mmask64 mask1 = | _mm512_cmpgt_epi8_mask (in1, spaces); __mmask64 mask2 = | _mm512_cmpgt_epi8_mask (in2, spaces); __mmask64 mask3 = | _mm512_cmpgt_epi8_mask (in3, spaces); auto reg0 = | _mm512_maskz_compress_epi8 (mask0, x); auto reg1 = | _mm512_maskz_compress_epi8 (mask1, x); auto reg2 = | _mm512_maskz_compress_epi8 (mask2, x); auto reg3 = | _mm512_maskz_compress_epi8 (mask3, x); | _mm512_storeu_si512(bytes + pos, reg0); pos += | _popcnt64(mask0); _mm512_storeu_si512(bytes + pos, reg1); | pos += _popcnt64(mask1); _mm512_storeu_si512(bytes + pos, | reg2); pos += _popcnt64(mask2); | _mm512_storeu_si512(bytes + pos, reg3); pos += | _popcnt64(mask3); } // old code can go here, since it | handles a smaller size well | | You can probably do better by chunking up the input and using | temporary memory (coalesced at the end). | tedunangst wrote: | What would be a practical application of this? The linked post | mentions a trim like operation, but in practice I only want to | remove white space from the ends, not the interior of the string, | and finding the ends is basically the whole problem. Or maybe I | want to compress some json, but a simple approach won't work | because there can be spaces inside string values which must be | preserved. | mdb31 wrote: | Cool performance enhancement, with an accompanying implementation | in a real-world library (https://github.com/lemire/despacer). | | Still, what does it signal that vector extensions are required to | get better string performance on x86? Wouldn't it be better if | Intel invested their AVX transistor budget into simply making | existing REPB prefixes a lot faster? | janwas wrote: | Why is a large speedup from vectors surprising? Considering | that the energy required for scheduling/dispatching an | instruction on OoO cores dwarfs that of the actual operation | (add/mul etc), amortizing over multiple elements (=SIMD) is an | obvious win. | mdb31 wrote: | Where do I say that the speedup is surprising? | | My question is whether Intel investing in AVX-512 is wise, | given that: -Most existing code is not aware of AVX anyway; | -Developers are especially wary of AVX-512, since they expect | it to be discontinued soon. | | Consequently, wouldn't Intel be better off by using the | silicon dedicated to AVX-512 to speed up instruction patterns | that are actually used? | mhh__ wrote: | AVX-512 is not going to be discontinued. Intel's | reticence/struggling with having it on desktop is | irritating but it's here to stay on servers for a long | time. | | Writing code for a specific SIMD instruction set is non- | trivial, but most code will get some benefit by being | compiled for the right ISA. You don't get the really fancy | instructions because the pattern matching in the compiler | isn't very intelligent but quite a lot of stuff is going to | benefit by magic. | | Even without cutting people without some AVX off, you can | have a fast/slow path fairly easily. | janwas wrote: | My point is that vector instructions are fundamentally | necessary and thus "what does it signal" evaluates to | "nothing surprising". | | Sure, REP STOSB/MOVSB make for a very compact | memset/memcpy, but their performance varies depending on | CPU feature flags, so you're going to want multiple | codepaths anyway. And vector instructions are vastly more | flexible than just those two. | | Also, I have not met developers who expect AVX-512 to be | discontinued (the regrettable ADL situation | notwithstanding; that's not a server CPU). AMD is actually | adding AVX-512. | mdb31 wrote: | > vector instructions are fundamentally necessary | | For which percentage of users? | | > AMD is actually adding AVX-512 | | Which is irrelevant to in-market support for that | instruction set. | XorNot wrote: | Why would it be irrelevant? Even the paucity of | availability isn't really a problem - the big winners | here are server users in data centers, not desktops or | laptops. How much string parsing and munging is happening | ingesting big datasets right now? If running a specially | optimized function set on part of your fleet reduces | utilization, that's direct cost savings you realize. If | the AMD is then widening that support base, you're deeply | favoring expanding usage while you scale up. | _rtld_global_ro wrote: | Given Intel's AVX extension could cause silent failures | on servers (very high work load for prolonged time, | compare to end user computers), I'm not sure it would be | a big win for servers either: | https://arxiv.org/pdf/2102.11245.pdf. | jcranmer wrote: | I'm downvoting you because the assertion you're implying | --that use of AVX increases soft failure rates more than | using non-AVX instructions would--is not sustained by the | source you use as reference. | tialaramex wrote: | Indeed, I'd summarise that source as "At Facebook | sometimes weird stuff happens. We postulate it's not | because of all the buggy code written by Software | Engineers like us, it must be hardware. As well as lots | of speculation about hypothetical widespread problems | that would show we're actually not writing buggy | software, here's a single concrete example where it was | hardware". | | If anything I'd say that Core 59 is one of those | exceptions that prove the rule. This is such a rare | phenomenon that when it does happen you can do the work | to pin it down and say yup, this CPU is busted - if it | was really commonplace you'd constantly trip over these | bugs and get nowhere. There probably isn't really, as | that paper claims, a "systemic issue across generations" | except that those generations are all running Facebook's | buggy code. | janwas wrote: | One interesting anecdote is that HPC planning for | exascale included significant concern about machine | failures and (silent) data corruption. When running at | large enough scale, even seemingly small failure rates | translate into "oh, there goes another one". | mhh__ wrote: | Any users who either wants performance _or_ uses a | language that can depend on a fast library. | retrac wrote: | > For which percentage of users? | | Anyone using software that benefits from vector | instructions. That includes a variety of compression, | search, and image processing algorithms. Your JPEG | decompression library might be using SSE2 or Neon. All | high-end processors have included some form of vector | instruction for like 20+ years now. Even the processor in | my old eBook reader has the ARM Neon instructions. | ip26 wrote: | Is it generally possible to convert rep str sequences to AVX? | Could the hardware or compiler already be doing this? | | AVX is just the SIMD unit. I would argue the transistors were | spent on SIMD, and the hitch is simply the best way to send str | commands to the SIMD hardware. | 37ef_ced3 wrote: | AVX-512 is an elegant, powerful, flexible set of masked vector | instructions that is useful for many purposes. For example, | low-cost neural net inference (https://NN-512.com). To suggest | that Intel and AMD should instead make "existing REPB prefixes | a lot faster" is missing the big picture. The masked | compression instructions (one of which is used in Lemire's | article) are endlessly useful, not just for stripping spaces | out of a string! | [deleted] | mhh__ wrote: | Many people seem to think AVX-512 is just wider AVX, which is | a shame. | | NN-512 is cool. I think the Go code is pretty ugly but I like | the concept of the compiler a lot. | jquery wrote: | I prefer AMDs approach that allows them to put more cores on the | die instead of supporting a rarely used instruction set. | fulafel wrote: | Zen 4 is rumored to have AVX512. AMD has in the past had | support for wide SIMD instructions with half internal width | implementation, so the die area requirements and instruction | set support are somewhat orthogonal. There's many other | interesting things in AVX512 besides the wide vectors. | pclmulqdq wrote: | AVX-512 finally gets a lot of things right about vector | manipulation and plugged a lot of the holes in the | instruction set. Part of me is upset that it came with the | "512" name - they could have called it "AVX3" or "AVX Version | 2" (since it's intel and they love confusing names). | adrian_b wrote: | Actually AVX-512 predates AVX and Sandy Bridge. | | The original name of AVX-512 was "Larrabee New | Instructions". Unlike with the other Intel instruction set | extensions, the team which defined the "Larrabee New | Instructions" included graphics experts hired from outside | Intel, which is probably the reason why AVX-512 is a better | SIMD instruction set than all the other designed by Intel. | | Unfortunately, Sandy Bridge (2011), instead of implementing | a scaled-down version of the "Larrabee New Instructions", | implemented the significantly worse AVX instruction set. | | A couple of years later, Intel Haswell (2013), added to AVX | a few of the extra instructions of the "Larrabee New | Instructions", e.g. fused multiply-add and memory gather | instructions. The Haswell AVX2 was thus a great improvement | over the Sandy Bridge AVX, but it remained far from having | all the features that had already existed in LRBni (made | public in 2009). | | After the Intel Larrabee project flopped, LRBni passed | through a few name changes, until 2016, when it was renamed | to AVX-512 after a small change in the binary encoding of | the instructions. | | I also dislike the name "AVX-512", but my reason is | different. "AVX-512" is made to sound like it is an | evolution of AVX, while the truth is the other way around, | AVX was an involution of LRBni, whose purpose was to | maximize the profits of Intel by minimizing the CPU | manufacturing costs, taking advantage of the fact that the | competition was weak, so the buyers had to be content with | the crippled Intel CPUs with AVX, because nobody offered | anything better. | | The existence of AVX has caused a lot of additional work | for many programmers, who had to write programs much more | complex than it would have been possible with LRBni, which | had from the beginning features designed to allow | simplified programming, e.g. the mask registers that allow | much simpler prologues and epilogues for loops and both | gather loads and scatter stores for accessing the memory. | pclmulqdq wrote: | TIL. Thank you for the history lesson on AVX. Comparing | to SVE and the RISC-V vector instructions, AVX feels so | clunky, but I guess that was part of the "Intel tax." | janwas wrote: | :) I have actually heard it referred to as AVX3, we also | adopted that name in Highway. | atq2119 wrote: | Agreed. Though I feel that for the most part, size-agnostic | vector instructions a la SVE would be the way to go. ___________________________________________________________________ (page generated 2022-05-01 23:00 UTC)