[HN Gopher] Zen4's AVX512 Teardown ___________________________________________________________________ Zen4's AVX512 Teardown Author : dragontamer Score : 326 points Date : 2022-09-26 14:17 UTC (8 hours ago) (HTM) web link (www.mersenneforum.org) (TXT) w3m dump (www.mersenneforum.org) | asbeb wrote: | bufo wrote: | The BF16 and VNNI instructions are finally going to make AMD | competitive for neural network inference. | smat wrote: | Very interesting read. The author notes that double pumping the | 512 bit instructions to 256 bit execution units appears to be a | good trade-off. | | As far as I understood ARMs new SIMD instruction set is able to | map to execution units of arbitrary width. So it sounds to me | like ARM is ahead of x86 in flexibility here and might be able to | profit in the future. | | Maybe somebody with more in-depth knowledge could respond whether | my understanding is correct. | adrian_b wrote: | With any traditional ISA with wide registers and instructions, | a.k.a. SIMD instructions, it is possible to implement the | execution units with any width desired, regardless which is the | architectural register and instruction width. | | Obviously, it only makes sense for the width of the execution | units to be a divisor of the architectural width, otherwise | they would not be used efficiently. | | Thus it is possible to choose various compromises between the | cost and the performance of the execution units. | | However, if the ISA specifies e.g. 32 512-bit registers, then | even the cheapest implementation must include at least that | amount of physical registers, even if the execution units may | be much narrower. | | What is new in the ARM SVE/SVE2 and which gives the name | "Scalable" to that vector extension, is that here the register | width is not fixed by the ISA, but it may be different between | implementations. | | Thus a cheap smartphone CPU may have 128-bit registers, while | an expensive server CPU for scientific computation applications | might have 1024-bit registers. | | With SVE/SVE2, it is possible to write a program without | knowing which will be the width of the registers on the target | CPU. | | Nevertheless, the scalability feature is not perfect, thus some | programs may still be made faster if a certain register width | is assumed before compilation, which may make them run slower | than possible on a CPU that in fact has wider registers than | assumed. | bee_rider wrote: | ARM's SVE is definitely interesting, but I do wonder if it is | slowly be honing in on CRAY style vector processing. Which is | definitely a cool idea, but a little different from the now- | popular fixed-width SIMD. I don't know that it makes sense to | call one ahead of the other yet -- ARM's documentation is clear | that SVE2 doesn't replace NEON. "Mostly scalar but let's | sprinkle in some SIMD" coding will probably always be with us | (until ML somehow turns all programs into dot products I | guess!) | | RISC-V also has a variable length vector extension. | brigade wrote: | There's not really any reason for modern general-purpose CPUs | to specialize for IPC lower than 1 like what Cray did. CPUs | need wide frontends to execute existing scalar code as fast | as we're used to, and if you're not reusing most of that | width for vectors then the design is just wasting power. | dis-sys wrote: | AMD EPYC Genoa will be a killing machine, with almost 1TBytes/sec | memory bandwidth and this avx512 extension... | | good luck for intel's xeon. | xani_ wrote: | Genuinely good luck, it's never good when there is no | competition | sliken wrote: | I'm hearing 12 channels @ 5200 MT/sec or so. Sounds like | 500GB/sec, not 1TB/sec. Oh maybe you meant in a dual socket | config? | sekh60 wrote: | I've updated most of my home lab to AMD EPYC Rome processors. | Really can't beat the core counts for private cloud and the | price is amazing compared to Intel. Looking forward to Genoa | myself, though moving past Rome will be a ways away for my lab. | xani_ wrote: | Sounds like a hell of a lab! What you're doing on it, machine | learning ? | causi wrote: | Exciting stuff. AVX512 isn't just for specialized work projects. | It's also a huge performance boost for game console emulation. | mmastrac wrote: | When I was doing some work on Dolphin's JIT, AVX | implementations were always back of mind. It's a massive | tradeoff in so many cases but having access to these is | amazing. | stagger87 wrote: | That sounds like a specialized project :) | marginalia_nu wrote: | Any project becomes specialized if you work long enough on | it. | dtech wrote: | Interesting! Any reason why they're specifically good for | emulation? | xani_ wrote: | https://whatcookie.github.io/posts/why-is-avx-512-useful- | for... | causi wrote: | I don't have a deep understanding of the implementation, but | it gets you a 30% performance boost in Playstation 3 | emulation. | | https://www.tomshardware.com/news/ps3-emulator- | avx-512-30-pe... | mastax wrote: | > And it is basically impossible to hit the 230W power limit at | stock without direct die or sub-ambient. | | Almost, but not quite. In GamersNexus' review they recorded | 250.8W measured at the EPS12V cables, while using an Arctic | Cooling Liquid Freezer II 360mm AIO with the fans at 100%. At | 230W/1.5V=153A a good VRM will generate about 17W of heat. That | leaves you a few watts for board power plane and socket resistive | losses (I don't have an estimate for that). | | Not a very practical cooling solution for a day-to-day | workstation, but I do wonder if you could reduce the fan speeds a | bit while still maxing out the power limit. | philjohn wrote: | Then again, I have 2 420mm Black Ice Nemesis radiators in my | custom loop - even at relatively low speeds it can keep the | 5800X in there and 3080 Ti cool under constant high loads. | pclmulqdq wrote: | My mini-ITX work desktop has no problems with a 5900x and a | Radeon VII pro running Rocm work, using only a tiny heatsink | on the 5900x (and some high-airflow fans, but nothing too | incredibly loud). It doesn't thermal throttle, but tops out | around 80-90 degrees C. | | The 7000-series seems to be a different story: you really | need a big cooler for those chips. | adrian_b wrote: | I have the same CPU + GPU combination, but used on an ATX | MB with a Noctua cooler with a double 120-mm fan. | | While the larger case and cooler makes the cooling easier, | the fans are normally inaudible and the CPU stays under 45 | Celsius degrees when not doing heavy work, and the | temperature may raise up to a little over 60 degrees | Celsius when 100% busy. | | From what I have seen until now, cooling will no longer be | so easy for the 7000 series, unless you choose to run them | in the Eco mode. | snvzz wrote: | Amusingly, there seems to still be 70% of performance if | limiting the power to 65W. | | This means the default power limits are not reasonable, and | only there to win the release day benchmarks. | ignaloidas wrote: | Worth noting that GN measured the power before the VRMs, while | the limit is applied after the VRMs. Assuming a 90% efficiency, | what GN measured would be 225.7W at the socket. Close, but | still not quite. | mastax wrote: | I accounted for VRM efficiency losses in the second sentence, | using data from a real X570 VRM. | fefe23 wrote: | I find the vpmullq part the most stunning. | | This instruction is used in some bignum code, for example if you | are implementing RSA. Yet AMD implemented it three times faster | than Intel. | | I'm also fascinated by AMD now making AVX512 worthwhile on | consumer devices (where they would until quite recently | artificially slow down Intel CPUs that had it), which presumably | will lead to widespread adoption where it matters. Intels | strategy of turning off AVX512 in the recent consumer devices | because their energy efficiency cores don't have it may turn out | to be a monumental mistake. | ComputerGuru wrote: | No one is going to be able to seriously use and support AVX512 | (or be sufficiently motivated to implement support for it in | their libraries and especially applications) until Intel | finally gets its act together with regards to AVX512 and | decides it actually wants to commit to it being a thing. | | The AVX2 rollout was (comparatively) flawless. The gains AVX512 | brings over AVX2 are, for most people w/ specialty libs | excluded, not worth dealing with the terrible CPU support. And | Intel just keeps making the situation worse, taking one step | forward and two back. | bayindirh wrote: | The biggest problem is not support for the instruction set in | the silicon, but the performance penalty it brings. | | SIMD hardware is the most power hungry block on Intel CPUs, | and the frequency penalty it brings is never completely | disclosed in the tech docs. Even Intel doesn't share that | information with you (as a serious customer) sometimes. | | In HPC world, no instruction is too obscure or niche to use. | However, when you use these instructions too frequently, the | heat load it generates can slow you down instead of | accelerating you over the course of your job, so AVX512 is a | pretty mixed case in Intel CPUs. | | Regardless of this penalty, numeric code benefits from wider | SIMD pipelines in most cases. At worst, you see no speedup, | but you're investing for the future. | | On the other hand, we have seen applications which run faster | on previous generation hardware due to over-optimization. | coder543 wrote: | > The biggest problem is not support for the instruction | set in the silicon, but the performance penalty it brings. | | Why is that sentence present tense instead of past tense? | Why does your entire comment make no mention of these | problems being specific to Intel? With the introduction of | Zen 4, your entire comment appears to be based on outdated | information. Zen 4 apparently implements AVX-512 | efficiently, without the problems Intel implementations | experienced. That's what this whole discussion is about, | and that's what Phoronix found as well.[0] | | [0]: https://www.phoronix.com/review/amd-zen4-avx512/6 | Tuna-Fish wrote: | > However, when you use these instructions too frequently, | the heat load it generates can slow you down | | It's not the heat load that slows you down. If you are | using them enough that you produce enough heat that you | have to downclock, it's still a win because the | instructions improved your throughput more than what you | lost in clocks. | | The problem with Intel's initial AVX-512 implementation was | that they didn't clock down because of heat, they clocked | down pre-emptively and substantially whenever the CPU | executed even a single AVX-512 instruction, even if there | was no added heat load, and stayed on the lower clocks for | a long period. This worked fine any proper SIMD loads, but | was crushing in any situation where there was just a | handful of AVX-512 ops between long stretches, such as | using an AVX-512 optimized version of some library | function. | bayindirh wrote: | > [T]hey clocked down pre-emptively and substantially | whenever the CPU executed even a single AVX-512 | instruction... | | Because you were hitting the power envelope limits in the | CPU in these cases too. You might not see the heat, but | the CPU cannot carry the power required to keep that core | at non-AVX speeds with these power-hungry blocks operated | at full speed. | | As I said, to add insult to the injury, Intel didn't | share the exact details of its AVX implementations and | frequency ranges it operates, either. | | Ah, publicly sharing your findings is/was forbidden too. | adrian_b wrote: | No, as the above poster said, Intel slows down the CPU | before any actual increase in power consumption or | temperature occurs, because their fear that their power | limit and temperature controller will not be able to | react fast enough when the power increase eventually | happens. | | Whatever control mechanism is used in the AMD Zen CPUs is | better than Intel's, so they downclock only when the | power consumption really increases and the clock | frequency recovers when the power consumption decreases, | so there is no penalty when using sporadically some | 512-bit instructions, like in the Intel CPUs. | jackmott42 wrote: | Imagine next gen consoles, suppose they stick with AMD. Then | every game studio and game engine studio is going to _love_ | flinging some AVX-512 around. Developers will get more | experience with it, any game that runs on PC and Console is | going to look slow on PC if you have intel cpus with bad | support. More libraries and tools will get created that | people will want to use. | | Adoption could accelerate quick! | kllrnohj wrote: | Next-next gen consoles are probably still a good 5+ years | away. AVX-512 for consumers will either have already become | "a thing" or it'll be dead & buried by then. | jackmott42 wrote: | People said that about it 5 years ago to. Yet here we | are. Nobody is going to just get rid of it, servers are | already using it. | pbsd wrote: | vpmullq is not that useful; in bignum code you also want the | upper part of the product, and there is no corresponding | vpmulhq instruction to get that. | | On the other hand, vpmadd52luq and vpmadd52huq do give you | access to the lower and upper parts of a 52x52->104 bit | product, and those instructions perform well in the Intel | chips, 3x faster than vpmullq. | oxxoxoxooo wrote: | > This instruction is used in some bignum code | | Could you be more specific? I think for that to work one would | also need the upper half of 64x64 multiplication and `vpmullq` | provides only the lower half. You could break one 64x64 | multiplication into four 32x32 multiplications (i.e. emulate | the full 64x64 = 128 bits multiplication) but I was under the | impression that this was slow. | adrian_b wrote: | I assume that as you say, whoever used this instruction was | using it for multiplying 32-bit numbers. | | On AMD Zen 4 and Intel Cannon Lake or newer (when AVX-512 is | supported), the fastest method to multiply big numbers is to | use the IFMA instructions, which reuse the floating-point | multipliers to generate 104-bit products of 52-bit numbers. | boundchecked wrote: | Not many people realize is that recent glibc brought AVX-512 | optimized str* and mem* functions to the ifunc dispatch table, | your C code may have been using fancy mask registers on someone's | Intel laptop! | formerly_proven wrote: | > For all practical purposes, under a suitable all-core load, the | 7950X will be running at 95C Tj.Max all the time. If you throw a | bigger cooler on it, it will boost to higher clock speeds to get | right back to Tj.Max. Because of this, the performance of the | chip is dependent on the cooling. And it is basically impossible | to hit the 230W power limit at stock without direct die or sub- | ambient. | | > If 95C sounds scary, wait to you see the voltages involved. AMD | advertises 5.7 GHz. In reality, a slightly higher value of 5.75 | GHz seems to be the norm - often across half the cores | simultaneously. So it's not just a single core load. The Fmax is | 5.85 GHz, but I have never personally seen it go above 5.75. | | 5.75 GHz is reached with 1.5 V Vcore. | | The +50 MHz bump over advertised boost clocks was also present in | Zen 3, likely in response to the poor reception of Zen 2 | behavior, which would usually fail to achieve the advertised | clocks. | loser777 wrote: | I'm genuinely curious of the details of how the 1.5v vCore | measurement was obtained. CPU-Z and software measurements in | general don't have the greatest reputation of being accurate, | especially with just-released generations of CPUs. Conventional | wisdom has been with newer manufacturing processes, less voltage | is required (and tolerated), and 1.5v vCore sounds truly insane | in 2022 for a "4nm" chip. For reference, I haven't heard of 1.5v | being a safe "24/7" voltage since the days of 90nm-130nm+ CPUs | circa 2005-2006. IIRC casual overclockers in the forums weren't | really comfortable with 1.5v even with 65nm Core 2, and this was | back when it was common to e.g., safely overclock your 2.4 GHz | Core 2 Quad to 3.4 GHz. | xani_ wrote: | Probably used same registers as previous. | | Would be simple to confirm with some scope probing CPU power. | magila wrote: | The problem is the CPU itself isn't the one measuring | voltage, it gets that information from the motherboard's VRM | controller. The accuracy of the reported value can vary | depending on the controller, how it's configured by the | motherboard's firmware, and the physical circuit design. | | That being said, with new motherboards generally using fully | digital VRM controllers the reported value should be pretty | close in most cases. | dragontamer wrote: | Excellent Teardown by "Mysticial" from mersenneforum.org. | | Cliffnotes: | | * Zen4 AVX512 is mostly double-pumped: a 256-bit native hardware | that processes two halves of the 512-bit register. | | * No throttling observed | | * 512-bit shuffle pipeline (!!). A powerful exception to the | "double-pumping" found in most other AVX512 instructions. | | * AMD seemingly handles the AVX512 mask registers better than | Intel. | | * Gather/Scatter slow on AMD's Zen4 implementation. | | * Intel's 512-bit native load/store unit has clear advantages | over AMD's 256-bit load-store unit when reading/writing to L1 | cache and beyond. | celrod wrote: | Looks like SIMD implementations that use LUTs should favor | small tables that fit in registers and use `vperm2ipd` as look | ups over larger tables + gather. | | With 64 bits, you still get a LUT size of 16 (shuffle indexes | into two 8xdouble vectors), which can be good enough for | functions like log and exp. | daniel-cussen wrote: | Shuffle is the SIMD's killer app. It's apparently an | interesting but expensive circuit, but it's smart to prioritize | it. Absolute best instruction, hands down. So double-pumping | yes isn't full speed meaning single cycle, but that increases | the compatibility with AVX512 code. I guess if a program | executed itself as a function of its runtime from CPUID it | might not, and of course there's all kinds of...but for | pedestrian purposes, meaning everything on github, it's a step. | Hey 40% speedup on Cinebench, that's buen. | dragontamer wrote: | > Shuffle is the SIMD's killer app | | A shame that AVX512 only has pshufb (aka: permute), and is | missing the GPU-instruction "bpermute", aka backwards | permute. | | pshufb is effectively a "gather" instruction over a AVX | register. Equivalent to GPU permutes. | | bpermute, in GPU land, is a "scatter" instruction over a | vector register. There's no CPU / AVX equivalent of it. But I | keep coming up with good uses of the bpermute instruction | (much like pshufb is crazy flexible, its inverse, the | backwards permute, is also crazy flexible). | | -------- | | Almost any code that's finding itself "gathering" data across | a vector register, will inevitably "scatter" the data back at | some point. | | Much like how "pext" is the "gather" instruction for 64-bits, | you need pdep to handle the equal-and-opposite case. Its | incredibly silly that AVX / AVX512 has implemented only one- | half of this concept (gather / pshufb / aka Permute). | | I wish for the day that Intel/AMD implements (scatter / | backwards-pshufb / aka Backwards-Permute). | | ------- | | Fortunately, I got Vega64 and NVidia Graphics Cards with both | permute and bpermute instructions for high-speed shuffling of | data. But CPU-space should benefit from this concept too. | daniel-cussen wrote: | OK that's cool, didn't know about bpermute. Made sense | there should be a counterpart. Well when you only have | pshufb, it works OK, yeah there's tons of gaps but if | you're clever and...and if you compromise speed...thanks | for telling me about bpermute! | giyanani wrote: | Why do you say shuffle is "SIMD's killer app"? I've only | dabbled in vector instructions from a learning perspective, | and seen others mention it's important too, but have yet to | understand why. | demindiro wrote: | I use PSHUFB to convert 24-bit RGB to 32-bit RGBX or BGRX. | Without a shuffle instruction it'd be quite a bit harder. | MrBuddyCasino wrote: | This Rust issue [0] was the best short summary of what an | SIMD Shuffle is I could find: | | ,,A "shuffle", in SIMD terms, takes a SIMD vector (or | possibly two vectors) and a pattern of source lane indexes | (usually as an immediate), and then produces a new SIMD | vector where the output is the source lane values in the | pattern given." | | [0] https://github.com/rust-lang/portable-simd/issues/11 | magicalhippo wrote: | It's basically several moves for the price of one. Given | that you operate on multiple values at once, being able to | shuffle or duplicate values comes up all the time. | | For example if you're filtering four image lines at a time | using a 1D filter kernel, you'll want to replicate the | filter coefficient to each SIMD element, so that you can | multiply each of the four pixel values with the same | coefficient. Shuffle lets you replicate a single | coefficient value into all the elements of a register in | one instruction. | daniel-cussen wrote: | Which is the point of SIMD. Several moves for the price | of one. | zX41ZdbW wrote: | Here is an overview of the usage of the shuffle instruction | to speed up decompression in ClickHouse: | https://habr.com/ru/company/yandex/blog/457612/ | Veliladon wrote: | Because you can do things using bitmasks and single | instructions instead of brute forcing using multiple | instructions. | | Let's say you have a whole heap of 8-bit numbers you want | to multiply by 2 and you have a set of 256-bit registers | and a nice SIMD multiply command. If you don't have a | shuffle you need to assemble your series of 2s for the | second operand for each lane before you can even start. | This is going to take hundreds of instructions and hundreds | of clocks. Shuffle means you load up lane 0 with the "2" | and then splat the contents of lane 0 across the other 31 | lanes in two instructions and a few clocks using the | shuffle unit. | | N.B. Shuffle isn't just about splatting. There's a whole | heap of different operations it can do that are useful. I | just picked a simple example with an obvious massive | performance increase for illustrative purposes. | Dylan16807 wrote: | I think that example is too simple to show the benefit of | shuffle. It's like explaining the benefit of an adder by | showing how you can move a value with X = Y + 0. | Especially since there's also a (much simpler) piece of | hardware dedicated to ultra-fast splat/broadcast (under | the right conditions). | stabbles wrote: | You're not talking about shuffle, you're talking about | broadcast. Shuffle instructions is where you take one or | two vectors, and output a third with elements from any | index of the input. So for example `out = [in[2], in[1]]` | is a shuffle of a vector of length 2. | | It's useful for example if you have say RGB color data | stored contiguously in memory as say RGBRGBRGBRGB..., and | you want to vectorize operations on R, B and G | separately. You can load a few registers like | [RGBR][GBRG][BRGB], and then shuffle them to | [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how | to shuffle optimally, it takes a few shuffles to get | there. | | More generally, if you have an array of structs, you | often need to go to struct of arrays to do vectorized | operations on the array, before returning to an array of | struct again. | | Another example is fast matrix transpose (in fact you can | think of the RGB example a 3 by N matrix transpose to N | by 3, where N is the vector width -- AoS -> SoA is a | transpose too, in a sense). Suppose you have a matrix of | size N by N where N is the vector width, you need N lg N | shuffles to transpose the matrix. | janwas wrote: | Indeed a great article, well worth reading in full for anyone | who uses AVX-512. | | Two other things that jumped out at me: VPCONFLICT is 10x as | fast, compressstoreu is >10x slower. Those might be enough to | warrant a Zen4-specific codepath in Highway. | celrod wrote: | The Intel optimization manual has a fun example where they | use vpconflict for vectorizing sparse dot products: | https://github.com/intel/optimization- | manual/blob/main/chap1... | | I benchmarked it on Intel, and it was indeed quite fast/a | good improvement over the scalar version. Will be interesting | to try that on AMD. | sitkack wrote: | I think it is important to note that while double-pumped, using | 512-bit registers puts lower pressure on decode and enables the | pipelines to fill. So use 512-bit if you can. | celrod wrote: | Yeah, the claim was that this is why it hit higher clock | speeds. The front end will be hard pressed to hit/maintian 4 | IPC, while 2 IPC is much easier. | adrian_b wrote: | It should also be noted that believing that Zen 4 is "double- | pumped" and the Intel CPUs are not "double-pumped" is | completely misleading. | | On most Intel CPUs with AVX-512 support, there are 2 classes | of 512-bit instructions: instructions executed by combining a | pair of 256-bit units, thus having an equal throughput for | 512-bit instructions and 256-bit instructions, and the second | class of instructions, which are executed by combining a pair | of 256-bit execution units and also by extending to 512 bits | another 256-bit execution unit. | | For the second class of instructions the Intel CPUs have a | throughput of two 512-bit instructions per cycle vs. three | 256-bit instructions per cycle. | | Compared to the cheaper models of Intel CPUs, Zen 4, while | having the same throughput as Zen 3, i.e. two 512-bit | instructions per cycle vs. four 256-bit instructions per | cycle in Zen 3, either matches or exceeds the throughput of | the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen | 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA | per cycle can be executed. | | The only important advantage of Intel appears in the most | expensive models of the server and workstation CPUs, i.e. in | most Xeon Gold, all Xeon Platinum and all of the Xeon W | models that have AVX-512 support. | | In these more expensive models, there is a second 512-bit FMA | unit, which enables a double FMA throughput compared to Zen | 4. These models with double FMA throughput are also helped by | a double throughput for the loads from the L1 cache, which is | matched to the FMA throughput. | | So the AVX-512 implementation in Zen 4 is superior to that in | the cheaper CPUs like Tiger Lake, even without taking into | account the few new execution units added in Zen 4, like the | 512-bit shuffle unit. | | Only the Xeon Platinum and the like of the future Sapphire | Rapids will have a definitely greater throughput for the | floating-point operations than Zen 4, but they will also have | a significantly lower all-clock frequency (due to the | inferior manufacturing process), so the higher throughput per | clock cycle is not certain to overcome the deficit in clock | frequency. | asbeb wrote: | pella wrote: | phoronix:AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 | 7950X | | https://www.phoronix.com/review/amd-zen4-avx512 | | _" On average for the tested AVX-512 workloads, making use of | the AVX-512 instructions led to around 59% higher performance | compared to when artificially limiting the Ryzen 9 7950X to AVX2 | / no-AVX512. | | From these results I am rather impressed by the AVX-512 | performance out of the AMD Ryzen 9 7950X. While initially being | disappointed when hearing of their "double pumping" approach | rather than going for a 512-bit data path, these benchmark | results speak for themselves. For software that can effectively | make use of AVX-512 (and compiled so), there is significant | performance uplift to enjoy while no negative impact in terms of | reduced CPU clock speeds / higher power consumption (with oneDNN | being one of the only exceptions seen so far in terms of higher | power draw). | | AVX-512 is looking good on the Ryzen 7000 series and I'll | continue running more benchmarks over the weeks ahead. These | AVX-512 results make me all the more excited for AMD EPYC "Genoa" | where AVX-512 can be a lot more widely-used among HPC/server | workloads. "_ | phire wrote: | I wonder how much of that 59% gain comes from the 512bit | registers/instructions themselves, and how much comes from the | new instructions and modes that come with AVX-512, and can | still be used with the narrower 256bit and 128bit registers. | | Would be interesting to modify some of the benchmarks to be | limited to 256bit AVX-512 and see how they compare. | TinkersW wrote: | Mysticals report indicates much of it does come from wider | instructions, because it can saturate the core easier. Zen 3 | was front end bottlenecked, so on Zen4 running AVX512 it can | more often hit 4x256. The new instructions are useful and | some help perf, but mostly only for pretty specialized stuff. | Masking is nice but I think people really exaggerate the | improvement from it, vblend was only 2 cycles. | paulmd wrote: | Haha, as someone who has been shouting "no, really, AVX-512 is | good, even if it's double-pumped, just wait for it guys" into the | void for years now, glad to see it finally hit the desktop for | real and that the AVX people are already leaning into it. | | Years and years of "nobody needs AVX-512" and "linus says it's | just for benchmarks, he worked at transmeta two decades ago, he | knows better than Lisa Su" hot takes down the tubes ;) ___________________________________________________________________ (page generated 2022-09-26 23:00 UTC)