hngopher.com

       [HN Gopher] Removing characters from strings faster with AVX-512
       ___________________________________________________________________
        
       Removing characters from strings faster with AVX-512
        
       Author : mdb31
       Score  : 124 points
       Date   : 2022-05-01 13:12 UTC (9 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | gfody wrote:
       | there's more whitespace above 0x20
       | https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode
        
         | brrrrrm wrote:
         | The complication involved with UTF-8 encoded space removal is
         | immense and likely quite far out of scope.
        
       | protoman3000 wrote:
       | Please correct me if I'm wrong, but wouldn't we normally scale
       | these things instead on a GPU?
        
         | curling_grad wrote:
         | Maybe because of IO costs?
        
         | raphlinus wrote:
         | The short answer is no, but the long answer is that this is a
         | very complex tradeoff space. Going forward, we may see more of
         | these types of tasks moving to GPU, but for the moment it is
         | generally not a good choice.
         | 
         | The GPU is incredible at raw throughput, and this particular
         | problem can actually implemented fairly straightforwardly (it's
         | a stream compaction, which in turn can be expressed in terms of
         | prefix sum). However, where the GPU absolutely falls down is
         | when you want to interleave CPU and GPU computations. To give
         | round numbers, the roundtrip latency is on the order of 100us,
         | and even aside from that, the memcpy back and forth between
         | host and device memory might actually be slower than just
         | solving the problem on the CPU. So you only win when the
         | strings are very large, again using round numbers about a
         | megabyte.
         | 
         | Things change if you are able to pipeline a lot of useful
         | computation on the GPU. This is an area of active research
         | (including my own). Aaron Hsu has been doing groundbreaking
         | work implementing an entire compiler on the GPU, and there's
         | more recent work[1], implemented in Futhark, that suggests that
         | that this approach is promising.
         | 
         | I have a paper in the pipeline that includes an extraordinarily
         | high performance (~12G elements/s) GPU implementation of the
         | parentheses matching problem, which is the heart of parsing. If
         | anyone would like to review a draft and provide comments,
         | please add a comment to the GitHub issue[2] I'm using to track
         | this. It's due very soon and I'm on a tight timeline to get all
         | the measurements done, so actionable suggestions on how to
         | improve the text would be most welcome.
         | 
         | [1]: https://theses.liacs.nl/pdf/2020-2021-VoetterRobin.pdf
         | 
         | [2]:
         | https://github.com/raphlinus/raphlinus.github.io/issues/66#i...
        
           | mwcampbell wrote:
           | > To give round numbers, the roundtrip latency is on the
           | order of 100us
           | 
           | I can't help but notice that, at least in my experience on
           | Windows, this is the same order of magnitude as for inter-
           | process communication on the local machine. Tangent: That
           | latency was my nemesis as a Windows screen reader developer;
           | the platform accessibility APIs weren't well designed to take
           | it into account. Windows 11 finally has a good solution for
           | this problem (yes, I helped implement that while I was at
           | Microsoft).
        
           | fancyfredbot wrote:
           | I wonder if this applies to the same extent for an on-package
           | GPU which shares the same physical memory as the CPU. I'd
           | expect round trip times in that case to be minimal and the
           | available processing power would probably be competitive with
           | AVX512. I have been wondering if this is the reason for
           | deprecating AVX512 on consumer processors - these are likely
           | to have a GPU available.
        
             | raphlinus wrote:
             | Good question! There are two separate issues with putting
             | the GPU in the same package as the CPU. One is the memcpy
             | bandwidth issue, which is indeed entirely mitigated
             | (assuming the app is smart enough to exploit this). But the
             | round trip times seem more related to context switches. I
             | have an M1 Max here, and just found ~200us for a very
             | simple dispatch (just clearing 16k of memory).
             | 
             | I personally believe it may be possible to reduce latency
             | using techniques similar to io_uring, but it may not be
             | simple. Likely a major reason for the roundtrips is so that
             | a trusted process (part of the GPU driver) can validate
             | inputs from untrusted user code before it's presented to
             | the GPU hardware.
        
       | Andoryuuta wrote:
       | Intel is removing AVX-512 support from their newer CPU's (Alder
       | Lake +). :/
       | 
       | https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...
        
         | mhh__ wrote:
         | You're forgetting about server CPUs, and we don't know yet
         | about Raptor Lake.
        
           | Andoryuuta wrote:
           | Ah, yep. You're totally right. I didn't even consider server
           | CPUs. Also, I thought I read somewhere that it was for all
           | consumer CPUs starting at Alder Lake, but I have no idea
           | where, so I could be entirely wrong. :)
        
         | SemanticStrengh wrote:
         | And zen 4 is rumoured to add support for it ^^
        
         | electricshampo1 wrote:
         | This is only on the client side; server still has and will have
         | AVX512 for the foreseeable future.
        
         | PragmaticPulp wrote:
         | Server and workstation chips still have AVX-512. It's only
         | unsupported on CPUs with smaller E(fficeincy) cores.
         | 
         | AVX-512 was never really supported in newer consumer CPUs with
         | heterogeneous architecture. These CPUs have a mix of powerful
         | cores and efficiency cores. The AVX-512 instructions were never
         | added to the efficiency cores because it would use way too much
         | die space and defeat the purpose of efficiency cores.
         | 
         | There was previously a hidden option to disable the efficiency
         | cores and enable AVX-512 on the remaining power cores, but the
         | number of workloads that would warrant turning off a lot of
         | your cores to speed up AVX-512 calculations is virtually non-
         | existent in the consumer world (where these cheap CPUs are
         | targeted).
         | 
         | The whole journalism controversy around AVX-512 has been a bit
         | of a joke because many of the same journalists tried to
         | generate controversy when AVX-512 was first introduced and they
         | realized that AVX-512 code would reduce the CPU clock speed.
         | There were numerous articles about turning off AVX-512 on
         | previous generation CPUs to avoid this downclocking and to make
         | overclocks more stable.
        
           | pantalaimon wrote:
           | Catching the bad instruction fault on the E-cores and only
           | scheduling the thread on the P-cores would be something that
           | could be added to Linux (there were already third party
           | patches towards that goal) if Intel had not disable the
           | feature entirely.
        
             | saagarjha wrote:
             | Presumably the AVX-512 code is something on your hot path,
             | so I'm not sure waiting for a signal to reschedule the work
             | is something you would want.
        
             | jeffbee wrote:
             | But it's not really compatible with the GCC IFUNC scheme
             | ... PTL entries will be permanently remapped to the most
             | appropriate code on the CPU where the function is first
             | called, and never thereafter remapped. So you end up with a
             | coin toss whether you get the optimized function or not.
             | 
             | Personally I don't find the e-cores on my alder lake CPU to
             | be of any value. They're more of a hazard than a benefit.
        
               | janwas wrote:
               | Fair point about ifunc, but we're using our own table of
               | function pointers, which can be invalidated/re-
               | initialized. Someone also mentioned that the OS could
               | catch SIGILL.. indeed seems doable to then reset thread
               | affinity to the P cores?
        
           | willis936 wrote:
           | >The AVX-512 instructions were never added to the efficiency
           | cores because it would use way too much die space and defeat
           | the purpose of efficiency cores.
           | 
           | Isn't the purpose of efficiency cores to be more power
           | efficient? It's more power efficient to vectorize
           | instructions and minimize pipeline re-ordering.
        
             | mastax wrote:
             | Power and area efficient. You can fit 4 E cores in the area
             | of 1 P core. Adding AVX-512 to the E cores would
             | significantly hamper that, though I don't know by how much.
        
           | zozbot234 wrote:
           | > The AVX-512 instructions were never added to the efficiency
           | cores because it would use way too much die space and defeat
           | the purpose of efficiency cores.
           | 
           | And this is why scalable vector ISA's like the RISC-V vector
           | extensions are superior to fixed-size SIMD. You can support
           | both kinds of microarchitecture while running the exact same
           | code.
        
           | R0b0t1 wrote:
           | That's not a valid reason why I can't use them on the P
           | cores. Some motherboards can enable them on the i9-12900k, it
           | works fine, but you need to pin to a P core.
        
             | PragmaticPulp wrote:
             | The reason is that it was never validated or tested with
             | AVX-512 and Intel and motherboard vendors couldn't commit
             | to shipping everything with AVX-512 support in future
             | steppings/revisions.
             | 
             | If you disable E cores you could enable AVX-512 on certain
             | motherboards, but like I said that's not really a net win
             | 99.99% of the time when you're giving up entire cores.
             | 
             | It was also at your own risk because presumably the
             | power/clock speed profiles were never tuned for a feature
             | that wasn't actually supported. I can see exactly why they
             | turned it off _on newer CPUs only after an announcement_.
        
               | R0b0t1 wrote:
               | Still smells like bullshit. Let the customer decide. Who
               | cares if it was validated? Why was it even included? Just
               | put it behind a yes-I-really-mean-it-switch so nobody
               | uses it by accident.
        
       | watmough wrote:
       | This is really cool.
       | 
       | I just got through doing some work with vectorization.
       | 
       | On the simplest workload I have, splitting a 3 MByte text file
       | into lines, writing a pointer to each string to an array, GCC
       | will not vectorize the naive loop, though ICC might I guess.
       | 
       | With simple vectorization to AVX512 (64 unsigned chars in a
       | vector), finding all the line breaks goes from 1.3 msec to 0.1
       | msec, so a little better than a 10x speedup, still just on the
       | one core, which keeps things simple.
       | 
       | I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class
       | Library. It's super easy.
        
       | bertr4nd wrote:
       | I love Daniel's vectorized string processing posts. There's
       | always some clever trickery that's hard for a guy like me (who
       | mostly uses vector extensions for ML kernels) to get quickly.
       | 
       | I found myself wondering if one could create a domain-specific
       | language for specifying string processing tasks, and then
       | automate some of the tricks with a compiler (possibly with human-
       | specified optimization annotations). Halide did this sort of
       | thing for image processing (and ML via TVM to some extent) and it
       | was a pretty significant success.
        
       | gslin wrote:
       | A problem is slowing down the CPU frequency significantly when
       | AVX-512 is involved, e.g.
       | https://en.wikichip.org/wiki/intel/xeon_gold/6262v this, which
       | usually cancels out the benefit in the Real World (tm).
        
         | janwas wrote:
         | I would love to see an example of reasonable code not seeing
         | any benefit. On first generation SKX, we saw 1.5x speedups vs
         | AVX2, and that was IIRC even without taking much advantage of
         | AVX3-only instructions.
        
         | SemanticStrengh wrote:
         | Please stop spreading this fallacy, while downclocking can
         | happen, usually the benefit is still strong and superior to
         | avx256. Even 256 can induce downclocking. AVX 512 when properly
         | utilized simply demolish non AVX 512 cpus.
        
           | vlovich123 wrote:
           | On that one task. The challenge is if the avx512 pieces
           | aren't a bottleneck in every single concurrent workload you
           | run. It's fine if the most important thing your running on
           | them is code optimized for AVX512. Realistically though, is
           | that the case for the target market of CPUs capable of
           | AVX512, since consumer use cases aren't? The predominant
           | workload would be cloud right? Where you're running
           | heterogeneous workloads right? You'd have to get real smart
           | by coalescing AVX512 and non AVX512 workloads onto separate
           | machines and disabling it on the machines that don't need it.
           | Very complicated work to do because you'd have to have each
           | workload annotated by hand (memcpy is optimized to use AVX512
           | when available so the presence of AVX512 in the code is
           | insufficient)
           | 
           | The more generous interpretation is that Intel fixed that
           | issue a while back although the CPUs with that problem are
           | still in rotation and you have to think about that when
           | compiling your code.
        
         | PragmaticPulp wrote:
         | This was massively exaggerated by journalists when AVX-512 was
         | first announced.
         | 
         | It is true that randomly applied AVX-512 instructions can cause
         | a slight clock speed reduction, the proper way to use libraries
         | like this would be within specific hot code loops where the
         | mild clock speed reduction is more than offset by the huge
         | parallelism increase.
         | 
         | This doesn't make sense if you're a consumer doing something
         | multitasking and a background process is invoking the AVX-512
         | penalty in the background, but it usually would make sense in a
         | server scenario.
        
           | adgjlsfhk1 wrote:
           | the thing I never understood about this is why Intel didn't
           | just add latency to the avx512 instructions instead? that
           | seems much easier than downclocking the whole cpu
        
             | janwas wrote:
             | I believe they do actually do something like this - until
             | power and voltage delivery change, wide instructions are
             | throttled independently of frequency changes (which on SKX
             | involved a short halt).
        
         | pclmulqdq wrote:
         | Intel has been trying to reduce the penalty for AVX-512, and
         | barring that, advertise that there is no penalty. Most things
         | on Ice Lake run fine with 256 bit vectors, but Skylake and
         | earlier really needed 128 bit or narrower if you weren't doing
         | serious vector math.
         | 
         | Forget about 512 bit vectors or FMAs.
        
         | alksjdalkj wrote:
         | I think this is less of a problem on newer CPUs:
         | https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...
        
           | pclmulqdq wrote:
           | Those are client CPUs, which have very different behavior
           | around power management than server parts. However, AVX
           | downclocking has mostly gone away with ice lake and hopefully
           | sapphire rapids does away with it permanently (except on 512
           | bit vectors).
        
         | mhh__ wrote:
         | Unless someone has data for the latest Intel chips (i.e.
         | sapphire rapids) showing the opposite I'm inclined to think
         | this is a meme from 2016/7 that needs to go the way of the
         | dodo.
        
           | Twirrim wrote:
           | It was largely wrong then, too. Cloudflare, who really kicked
           | off a large amount of the fuss, had "Bronze" class Xeon
           | chips, that weren't designed or marketed for what they were
           | attempting to use them for. They were only ever intended for
           | small business stuff. Not large scale high performance
           | operations. Their performance downclock for AVX-512 is way,
           | way higher on Bronze.
        
             | NavinF wrote:
             | Weren't those chips $10k each back then? Hardly anyone got
             | gold Xeons.
        
               | Twirrim wrote:
               | Not even close. The blog post was 2017.
               | 
               | Actually, I stand corrected, after double checking,
               | Cloudflare were using Silver. Entry level data centre
               | chips, instead of small business chips. Still not the
               | kind of chips you'd buy for high performance
               | infrastructure, and not intended to be used for such.
               | 
               | Xeon Silver 4116s hit the market at $1,002.00. The Golds
               | were $1,221.00. The performance differences are quite
               | significant. For something that'll be in service for ~3-5
               | years, $200 is absolutely trivial by way of a per-chip
               | increase. It's firmly in the "false economy" territory to
               | be skimping on your chip costs. It's a bit more
               | understandable in smaller businesses, but you just don't
               | do it when you're operating at scale.
               | 
               | Also remember: at the scales that Cloudflare are
               | purchasing at, they won't be paying RRP. They'll be
               | getting tidy discounts.
        
       | steve76 wrote:
        
       | brrrrrm wrote:
       | What's the generated assembly look like? I suspect clang isn't
       | smart enough to store things into registers. The latency of
       | VPCOMPRESSB seems quite high (according to the table here at
       | least https://uops.info/table.html), so you'll probably want to
       | induce a bit more pipelining by manually unrolling into the
       | register variant.
       | 
       | I don't have an AVX512 machine with VBMI2, but here's what my
       | untested code might look like:                 __m512i spaces =
       | _mm512_set1_epi8(' ');       size_t i = 0;       for (; i + (64 *
       | 4 - 1) < howmany; i += 64 * 4) {         // 4 input regs, 4
       | output regs, you can actually do up to 8 because there are 8 mask
       | registers         __m512i in0 = _mm512_loadu_si512(bytes + i);
       | __m512i in1 = _mm512_loadu_si512(bytes + i + 64);         __m512i
       | in2 = _mm512_loadu_si512(bytes + i + 128);         __m512i in3 =
       | _mm512_loadu_si512(bytes + i + 192);              __mmask64 mask0
       | = _mm512_cmpgt_epi8_mask (in0, spaces);         __mmask64 mask1 =
       | _mm512_cmpgt_epi8_mask (in1, spaces);         __mmask64 mask2 =
       | _mm512_cmpgt_epi8_mask (in2, spaces);         __mmask64 mask3 =
       | _mm512_cmpgt_epi8_mask (in3, spaces);              auto reg0 =
       | _mm512_maskz_compress_epi8 (mask0, x);         auto reg1 =
       | _mm512_maskz_compress_epi8 (mask1, x);         auto reg2 =
       | _mm512_maskz_compress_epi8 (mask2, x);         auto reg3 =
       | _mm512_maskz_compress_epi8 (mask3, x);
       | _mm512_storeu_si512(bytes + pos, reg0);         pos +=
       | _popcnt64(mask0);         _mm512_storeu_si512(bytes + pos, reg1);
       | pos += _popcnt64(mask1);         _mm512_storeu_si512(bytes + pos,
       | reg2);         pos += _popcnt64(mask2);
       | _mm512_storeu_si512(bytes + pos, reg3);         pos +=
       | _popcnt64(mask3);       }       // old code can go here, since it
       | handles a smaller size well
       | 
       | You can probably do better by chunking up the input and using
       | temporary memory (coalesced at the end).
        
       | tedunangst wrote:
       | What would be a practical application of this? The linked post
       | mentions a trim like operation, but in practice I only want to
       | remove white space from the ends, not the interior of the string,
       | and finding the ends is basically the whole problem. Or maybe I
       | want to compress some json, but a simple approach won't work
       | because there can be spaces inside string values which must be
       | preserved.
        
       | mdb31 wrote:
       | Cool performance enhancement, with an accompanying implementation
       | in a real-world library (https://github.com/lemire/despacer).
       | 
       | Still, what does it signal that vector extensions are required to
       | get better string performance on x86? Wouldn't it be better if
       | Intel invested their AVX transistor budget into simply making
       | existing REPB prefixes a lot faster?
        
         | janwas wrote:
         | Why is a large speedup from vectors surprising? Considering
         | that the energy required for scheduling/dispatching an
         | instruction on OoO cores dwarfs that of the actual operation
         | (add/mul etc), amortizing over multiple elements (=SIMD) is an
         | obvious win.
        
           | mdb31 wrote:
           | Where do I say that the speedup is surprising?
           | 
           | My question is whether Intel investing in AVX-512 is wise,
           | given that: -Most existing code is not aware of AVX anyway;
           | -Developers are especially wary of AVX-512, since they expect
           | it to be discontinued soon.
           | 
           | Consequently, wouldn't Intel be better off by using the
           | silicon dedicated to AVX-512 to speed up instruction patterns
           | that are actually used?
        
             | mhh__ wrote:
             | AVX-512 is not going to be discontinued. Intel's
             | reticence/struggling with having it on desktop is
             | irritating but it's here to stay on servers for a long
             | time.
             | 
             | Writing code for a specific SIMD instruction set is non-
             | trivial, but most code will get some benefit by being
             | compiled for the right ISA. You don't get the really fancy
             | instructions because the pattern matching in the compiler
             | isn't very intelligent but quite a lot of stuff is going to
             | benefit by magic.
             | 
             | Even without cutting people without some AVX off, you can
             | have a fast/slow path fairly easily.
        
             | janwas wrote:
             | My point is that vector instructions are fundamentally
             | necessary and thus "what does it signal" evaluates to
             | "nothing surprising".
             | 
             | Sure, REP STOSB/MOVSB make for a very compact
             | memset/memcpy, but their performance varies depending on
             | CPU feature flags, so you're going to want multiple
             | codepaths anyway. And vector instructions are vastly more
             | flexible than just those two.
             | 
             | Also, I have not met developers who expect AVX-512 to be
             | discontinued (the regrettable ADL situation
             | notwithstanding; that's not a server CPU). AMD is actually
             | adding AVX-512.
        
               | mdb31 wrote:
               | > vector instructions are fundamentally necessary
               | 
               | For which percentage of users?
               | 
               | > AMD is actually adding AVX-512
               | 
               | Which is irrelevant to in-market support for that
               | instruction set.
        
               | XorNot wrote:
               | Why would it be irrelevant? Even the paucity of
               | availability isn't really a problem - the big winners
               | here are server users in data centers, not desktops or
               | laptops. How much string parsing and munging is happening
               | ingesting big datasets right now? If running a specially
               | optimized function set on part of your fleet reduces
               | utilization, that's direct cost savings you realize. If
               | the AMD is then widening that support base, you're deeply
               | favoring expanding usage while you scale up.
        
               | _rtld_global_ro wrote:
               | Given Intel's AVX extension could cause silent failures
               | on servers (very high work load for prolonged time,
               | compare to end user computers), I'm not sure it would be
               | a big win for servers either:
               | https://arxiv.org/pdf/2102.11245.pdf.
        
               | jcranmer wrote:
               | I'm downvoting you because the assertion you're implying
               | --that use of AVX increases soft failure rates more than
               | using non-AVX instructions would--is not sustained by the
               | source you use as reference.
        
               | tialaramex wrote:
               | Indeed, I'd summarise that source as "At Facebook
               | sometimes weird stuff happens. We postulate it's not
               | because of all the buggy code written by Software
               | Engineers like us, it must be hardware. As well as lots
               | of speculation about hypothetical widespread problems
               | that would show we're actually not writing buggy
               | software, here's a single concrete example where it was
               | hardware".
               | 
               | If anything I'd say that Core 59 is one of those
               | exceptions that prove the rule. This is such a rare
               | phenomenon that when it does happen you can do the work
               | to pin it down and say yup, this CPU is busted - if it
               | was really commonplace you'd constantly trip over these
               | bugs and get nowhere. There probably isn't really, as
               | that paper claims, a "systemic issue across generations"
               | except that those generations are all running Facebook's
               | buggy code.
        
               | janwas wrote:
               | One interesting anecdote is that HPC planning for
               | exascale included significant concern about machine
               | failures and (silent) data corruption. When running at
               | large enough scale, even seemingly small failure rates
               | translate into "oh, there goes another one".
        
               | mhh__ wrote:
               | Any users who either wants performance _or_ uses a
               | language that can depend on a fast library.
        
               | retrac wrote:
               | > For which percentage of users?
               | 
               | Anyone using software that benefits from vector
               | instructions. That includes a variety of compression,
               | search, and image processing algorithms. Your JPEG
               | decompression library might be using SSE2 or Neon. All
               | high-end processors have included some form of vector
               | instruction for like 20+ years now. Even the processor in
               | my old eBook reader has the ARM Neon instructions.
        
         | ip26 wrote:
         | Is it generally possible to convert rep str sequences to AVX?
         | Could the hardware or compiler already be doing this?
         | 
         | AVX is just the SIMD unit. I would argue the transistors were
         | spent on SIMD, and the hitch is simply the best way to send str
         | commands to the SIMD hardware.
        
         | 37ef_ced3 wrote:
         | AVX-512 is an elegant, powerful, flexible set of masked vector
         | instructions that is useful for many purposes. For example,
         | low-cost neural net inference (https://NN-512.com). To suggest
         | that Intel and AMD should instead make "existing REPB prefixes
         | a lot faster" is missing the big picture. The masked
         | compression instructions (one of which is used in Lemire's
         | article) are endlessly useful, not just for stripping spaces
         | out of a string!
        
           | [deleted]
        
           | mhh__ wrote:
           | Many people seem to think AVX-512 is just wider AVX, which is
           | a shame.
           | 
           | NN-512 is cool. I think the Go code is pretty ugly but I like
           | the concept of the compiler a lot.
        
       | jquery wrote:
       | I prefer AMDs approach that allows them to put more cores on the
       | die instead of supporting a rarely used instruction set.
        
         | fulafel wrote:
         | Zen 4 is rumored to have AVX512. AMD has in the past had
         | support for wide SIMD instructions with half internal width
         | implementation, so the die area requirements and instruction
         | set support are somewhat orthogonal. There's many other
         | interesting things in AVX512 besides the wide vectors.
        
           | pclmulqdq wrote:
           | AVX-512 finally gets a lot of things right about vector
           | manipulation and plugged a lot of the holes in the
           | instruction set. Part of me is upset that it came with the
           | "512" name - they could have called it "AVX3" or "AVX Version
           | 2" (since it's intel and they love confusing names).
        
             | adrian_b wrote:
             | Actually AVX-512 predates AVX and Sandy Bridge.
             | 
             | The original name of AVX-512 was "Larrabee New
             | Instructions". Unlike with the other Intel instruction set
             | extensions, the team which defined the "Larrabee New
             | Instructions" included graphics experts hired from outside
             | Intel, which is probably the reason why AVX-512 is a better
             | SIMD instruction set than all the other designed by Intel.
             | 
             | Unfortunately, Sandy Bridge (2011), instead of implementing
             | a scaled-down version of the "Larrabee New Instructions",
             | implemented the significantly worse AVX instruction set.
             | 
             | A couple of years later, Intel Haswell (2013), added to AVX
             | a few of the extra instructions of the "Larrabee New
             | Instructions", e.g. fused multiply-add and memory gather
             | instructions. The Haswell AVX2 was thus a great improvement
             | over the Sandy Bridge AVX, but it remained far from having
             | all the features that had already existed in LRBni (made
             | public in 2009).
             | 
             | After the Intel Larrabee project flopped, LRBni passed
             | through a few name changes, until 2016, when it was renamed
             | to AVX-512 after a small change in the binary encoding of
             | the instructions.
             | 
             | I also dislike the name "AVX-512", but my reason is
             | different. "AVX-512" is made to sound like it is an
             | evolution of AVX, while the truth is the other way around,
             | AVX was an involution of LRBni, whose purpose was to
             | maximize the profits of Intel by minimizing the CPU
             | manufacturing costs, taking advantage of the fact that the
             | competition was weak, so the buyers had to be content with
             | the crippled Intel CPUs with AVX, because nobody offered
             | anything better.
             | 
             | The existence of AVX has caused a lot of additional work
             | for many programmers, who had to write programs much more
             | complex than it would have been possible with LRBni, which
             | had from the beginning features designed to allow
             | simplified programming, e.g. the mask registers that allow
             | much simpler prologues and epilogues for loops and both
             | gather loads and scatter stores for accessing the memory.
        
               | pclmulqdq wrote:
               | TIL. Thank you for the history lesson on AVX. Comparing
               | to SVE and the RISC-V vector instructions, AVX feels so
               | clunky, but I guess that was part of the "Intel tax."
        
             | janwas wrote:
             | :) I have actually heard it referred to as AVX3, we also
             | adopted that name in Highway.
        
             | atq2119 wrote:
             | Agreed. Though I feel that for the most part, size-agnostic
             | vector instructions a la SVE would be the way to go.
        
       ___________________________________________________________________
       (page generated 2022-05-01 23:00 UTC)