hngopher.com

       [HN Gopher] Zen4's AVX512 Teardown
       ___________________________________________________________________
        
       Zen4's AVX512 Teardown
        
       Author : dragontamer
       Score  : 326 points
       Date   : 2022-09-26 14:17 UTC (8 hours ago)
        
 (HTM) web link (www.mersenneforum.org)
 (TXT) w3m dump (www.mersenneforum.org)
        
       | asbeb wrote:
        
       | bufo wrote:
       | The BF16 and VNNI instructions are finally going to make AMD
       | competitive for neural network inference.
        
       | smat wrote:
       | Very interesting read. The author notes that double pumping the
       | 512 bit instructions to 256 bit execution units appears to be a
       | good trade-off.
       | 
       | As far as I understood ARMs new SIMD instruction set is able to
       | map to execution units of arbitrary width. So it sounds to me
       | like ARM is ahead of x86 in flexibility here and might be able to
       | profit in the future.
       | 
       | Maybe somebody with more in-depth knowledge could respond whether
       | my understanding is correct.
        
         | adrian_b wrote:
         | With any traditional ISA with wide registers and instructions,
         | a.k.a. SIMD instructions, it is possible to implement the
         | execution units with any width desired, regardless which is the
         | architectural register and instruction width.
         | 
         | Obviously, it only makes sense for the width of the execution
         | units to be a divisor of the architectural width, otherwise
         | they would not be used efficiently.
         | 
         | Thus it is possible to choose various compromises between the
         | cost and the performance of the execution units.
         | 
         | However, if the ISA specifies e.g. 32 512-bit registers, then
         | even the cheapest implementation must include at least that
         | amount of physical registers, even if the execution units may
         | be much narrower.
         | 
         | What is new in the ARM SVE/SVE2 and which gives the name
         | "Scalable" to that vector extension, is that here the register
         | width is not fixed by the ISA, but it may be different between
         | implementations.
         | 
         | Thus a cheap smartphone CPU may have 128-bit registers, while
         | an expensive server CPU for scientific computation applications
         | might have 1024-bit registers.
         | 
         | With SVE/SVE2, it is possible to write a program without
         | knowing which will be the width of the registers on the target
         | CPU.
         | 
         | Nevertheless, the scalability feature is not perfect, thus some
         | programs may still be made faster if a certain register width
         | is assumed before compilation, which may make them run slower
         | than possible on a CPU that in fact has wider registers than
         | assumed.
        
         | bee_rider wrote:
         | ARM's SVE is definitely interesting, but I do wonder if it is
         | slowly be honing in on CRAY style vector processing. Which is
         | definitely a cool idea, but a little different from the now-
         | popular fixed-width SIMD. I don't know that it makes sense to
         | call one ahead of the other yet -- ARM's documentation is clear
         | that SVE2 doesn't replace NEON. "Mostly scalar but let's
         | sprinkle in some SIMD" coding will probably always be with us
         | (until ML somehow turns all programs into dot products I
         | guess!)
         | 
         | RISC-V also has a variable length vector extension.
        
           | brigade wrote:
           | There's not really any reason for modern general-purpose CPUs
           | to specialize for IPC lower than 1 like what Cray did. CPUs
           | need wide frontends to execute existing scalar code as fast
           | as we're used to, and if you're not reusing most of that
           | width for vectors then the design is just wasting power.
        
       | dis-sys wrote:
       | AMD EPYC Genoa will be a killing machine, with almost 1TBytes/sec
       | memory bandwidth and this avx512 extension...
       | 
       | good luck for intel's xeon.
        
         | xani_ wrote:
         | Genuinely good luck, it's never good when there is no
         | competition
        
         | sliken wrote:
         | I'm hearing 12 channels @ 5200 MT/sec or so. Sounds like
         | 500GB/sec, not 1TB/sec. Oh maybe you meant in a dual socket
         | config?
        
         | sekh60 wrote:
         | I've updated most of my home lab to AMD EPYC Rome processors.
         | Really can't beat the core counts for private cloud and the
         | price is amazing compared to Intel. Looking forward to Genoa
         | myself, though moving past Rome will be a ways away for my lab.
        
           | xani_ wrote:
           | Sounds like a hell of a lab! What you're doing on it, machine
           | learning ?
        
       | causi wrote:
       | Exciting stuff. AVX512 isn't just for specialized work projects.
       | It's also a huge performance boost for game console emulation.
        
         | mmastrac wrote:
         | When I was doing some work on Dolphin's JIT, AVX
         | implementations were always back of mind. It's a massive
         | tradeoff in so many cases but having access to these is
         | amazing.
        
         | stagger87 wrote:
         | That sounds like a specialized project :)
        
           | marginalia_nu wrote:
           | Any project becomes specialized if you work long enough on
           | it.
        
         | dtech wrote:
         | Interesting! Any reason why they're specifically good for
         | emulation?
        
           | xani_ wrote:
           | https://whatcookie.github.io/posts/why-is-avx-512-useful-
           | for...
        
           | causi wrote:
           | I don't have a deep understanding of the implementation, but
           | it gets you a 30% performance boost in Playstation 3
           | emulation.
           | 
           | https://www.tomshardware.com/news/ps3-emulator-
           | avx-512-30-pe...
        
       | mastax wrote:
       | > And it is basically impossible to hit the 230W power limit at
       | stock without direct die or sub-ambient.
       | 
       | Almost, but not quite. In GamersNexus' review they recorded
       | 250.8W measured at the EPS12V cables, while using an Arctic
       | Cooling Liquid Freezer II 360mm AIO with the fans at 100%. At
       | 230W/1.5V=153A a good VRM will generate about 17W of heat. That
       | leaves you a few watts for board power plane and socket resistive
       | losses (I don't have an estimate for that).
       | 
       | Not a very practical cooling solution for a day-to-day
       | workstation, but I do wonder if you could reduce the fan speeds a
       | bit while still maxing out the power limit.
        
         | philjohn wrote:
         | Then again, I have 2 420mm Black Ice Nemesis radiators in my
         | custom loop - even at relatively low speeds it can keep the
         | 5800X in there and 3080 Ti cool under constant high loads.
        
           | pclmulqdq wrote:
           | My mini-ITX work desktop has no problems with a 5900x and a
           | Radeon VII pro running Rocm work, using only a tiny heatsink
           | on the 5900x (and some high-airflow fans, but nothing too
           | incredibly loud). It doesn't thermal throttle, but tops out
           | around 80-90 degrees C.
           | 
           | The 7000-series seems to be a different story: you really
           | need a big cooler for those chips.
        
             | adrian_b wrote:
             | I have the same CPU + GPU combination, but used on an ATX
             | MB with a Noctua cooler with a double 120-mm fan.
             | 
             | While the larger case and cooler makes the cooling easier,
             | the fans are normally inaudible and the CPU stays under 45
             | Celsius degrees when not doing heavy work, and the
             | temperature may raise up to a little over 60 degrees
             | Celsius when 100% busy.
             | 
             | From what I have seen until now, cooling will no longer be
             | so easy for the 7000 series, unless you choose to run them
             | in the Eco mode.
        
         | snvzz wrote:
         | Amusingly, there seems to still be 70% of performance if
         | limiting the power to 65W.
         | 
         | This means the default power limits are not reasonable, and
         | only there to win the release day benchmarks.
        
         | ignaloidas wrote:
         | Worth noting that GN measured the power before the VRMs, while
         | the limit is applied after the VRMs. Assuming a 90% efficiency,
         | what GN measured would be 225.7W at the socket. Close, but
         | still not quite.
        
           | mastax wrote:
           | I accounted for VRM efficiency losses in the second sentence,
           | using data from a real X570 VRM.
        
       | fefe23 wrote:
       | I find the vpmullq part the most stunning.
       | 
       | This instruction is used in some bignum code, for example if you
       | are implementing RSA. Yet AMD implemented it three times faster
       | than Intel.
       | 
       | I'm also fascinated by AMD now making AVX512 worthwhile on
       | consumer devices (where they would until quite recently
       | artificially slow down Intel CPUs that had it), which presumably
       | will lead to widespread adoption where it matters. Intels
       | strategy of turning off AVX512 in the recent consumer devices
       | because their energy efficiency cores don't have it may turn out
       | to be a monumental mistake.
        
         | ComputerGuru wrote:
         | No one is going to be able to seriously use and support AVX512
         | (or be sufficiently motivated to implement support for it in
         | their libraries and especially applications) until Intel
         | finally gets its act together with regards to AVX512 and
         | decides it actually wants to commit to it being a thing.
         | 
         | The AVX2 rollout was (comparatively) flawless. The gains AVX512
         | brings over AVX2 are, for most people w/ specialty libs
         | excluded, not worth dealing with the terrible CPU support. And
         | Intel just keeps making the situation worse, taking one step
         | forward and two back.
        
           | bayindirh wrote:
           | The biggest problem is not support for the instruction set in
           | the silicon, but the performance penalty it brings.
           | 
           | SIMD hardware is the most power hungry block on Intel CPUs,
           | and the frequency penalty it brings is never completely
           | disclosed in the tech docs. Even Intel doesn't share that
           | information with you (as a serious customer) sometimes.
           | 
           | In HPC world, no instruction is too obscure or niche to use.
           | However, when you use these instructions too frequently, the
           | heat load it generates can slow you down instead of
           | accelerating you over the course of your job, so AVX512 is a
           | pretty mixed case in Intel CPUs.
           | 
           | Regardless of this penalty, numeric code benefits from wider
           | SIMD pipelines in most cases. At worst, you see no speedup,
           | but you're investing for the future.
           | 
           | On the other hand, we have seen applications which run faster
           | on previous generation hardware due to over-optimization.
        
             | coder543 wrote:
             | > The biggest problem is not support for the instruction
             | set in the silicon, but the performance penalty it brings.
             | 
             | Why is that sentence present tense instead of past tense?
             | Why does your entire comment make no mention of these
             | problems being specific to Intel? With the introduction of
             | Zen 4, your entire comment appears to be based on outdated
             | information. Zen 4 apparently implements AVX-512
             | efficiently, without the problems Intel implementations
             | experienced. That's what this whole discussion is about,
             | and that's what Phoronix found as well.[0]
             | 
             | [0]: https://www.phoronix.com/review/amd-zen4-avx512/6
        
             | Tuna-Fish wrote:
             | > However, when you use these instructions too frequently,
             | the heat load it generates can slow you down
             | 
             | It's not the heat load that slows you down. If you are
             | using them enough that you produce enough heat that you
             | have to downclock, it's still a win because the
             | instructions improved your throughput more than what you
             | lost in clocks.
             | 
             | The problem with Intel's initial AVX-512 implementation was
             | that they didn't clock down because of heat, they clocked
             | down pre-emptively and substantially whenever the CPU
             | executed even a single AVX-512 instruction, even if there
             | was no added heat load, and stayed on the lower clocks for
             | a long period. This worked fine any proper SIMD loads, but
             | was crushing in any situation where there was just a
             | handful of AVX-512 ops between long stretches, such as
             | using an AVX-512 optimized version of some library
             | function.
        
               | bayindirh wrote:
               | > [T]hey clocked down pre-emptively and substantially
               | whenever the CPU executed even a single AVX-512
               | instruction...
               | 
               | Because you were hitting the power envelope limits in the
               | CPU in these cases too. You might not see the heat, but
               | the CPU cannot carry the power required to keep that core
               | at non-AVX speeds with these power-hungry blocks operated
               | at full speed.
               | 
               | As I said, to add insult to the injury, Intel didn't
               | share the exact details of its AVX implementations and
               | frequency ranges it operates, either.
               | 
               | Ah, publicly sharing your findings is/was forbidden too.
        
               | adrian_b wrote:
               | No, as the above poster said, Intel slows down the CPU
               | before any actual increase in power consumption or
               | temperature occurs, because their fear that their power
               | limit and temperature controller will not be able to
               | react fast enough when the power increase eventually
               | happens.
               | 
               | Whatever control mechanism is used in the AMD Zen CPUs is
               | better than Intel's, so they downclock only when the
               | power consumption really increases and the clock
               | frequency recovers when the power consumption decreases,
               | so there is no penalty when using sporadically some
               | 512-bit instructions, like in the Intel CPUs.
        
           | jackmott42 wrote:
           | Imagine next gen consoles, suppose they stick with AMD. Then
           | every game studio and game engine studio is going to _love_
           | flinging some AVX-512 around. Developers will get more
           | experience with it, any game that runs on PC and Console is
           | going to look slow on PC if you have intel cpus with bad
           | support. More libraries and tools will get created that
           | people will want to use.
           | 
           | Adoption could accelerate quick!
        
             | kllrnohj wrote:
             | Next-next gen consoles are probably still a good 5+ years
             | away. AVX-512 for consumers will either have already become
             | "a thing" or it'll be dead & buried by then.
        
               | jackmott42 wrote:
               | People said that about it 5 years ago to. Yet here we
               | are. Nobody is going to just get rid of it, servers are
               | already using it.
        
         | pbsd wrote:
         | vpmullq is not that useful; in bignum code you also want the
         | upper part of the product, and there is no corresponding
         | vpmulhq instruction to get that.
         | 
         | On the other hand, vpmadd52luq and vpmadd52huq do give you
         | access to the lower and upper parts of a 52x52->104 bit
         | product, and those instructions perform well in the Intel
         | chips, 3x faster than vpmullq.
        
         | oxxoxoxooo wrote:
         | > This instruction is used in some bignum code
         | 
         | Could you be more specific? I think for that to work one would
         | also need the upper half of 64x64 multiplication and `vpmullq`
         | provides only the lower half. You could break one 64x64
         | multiplication into four 32x32 multiplications (i.e. emulate
         | the full 64x64 = 128 bits multiplication) but I was under the
         | impression that this was slow.
        
           | adrian_b wrote:
           | I assume that as you say, whoever used this instruction was
           | using it for multiplying 32-bit numbers.
           | 
           | On AMD Zen 4 and Intel Cannon Lake or newer (when AVX-512 is
           | supported), the fastest method to multiply big numbers is to
           | use the IFMA instructions, which reuse the floating-point
           | multipliers to generate 104-bit products of 52-bit numbers.
        
       | boundchecked wrote:
       | Not many people realize is that recent glibc brought AVX-512
       | optimized str* and mem* functions to the ifunc dispatch table,
       | your C code may have been using fancy mask registers on someone's
       | Intel laptop!
        
       | formerly_proven wrote:
       | > For all practical purposes, under a suitable all-core load, the
       | 7950X will be running at 95C Tj.Max all the time. If you throw a
       | bigger cooler on it, it will boost to higher clock speeds to get
       | right back to Tj.Max. Because of this, the performance of the
       | chip is dependent on the cooling. And it is basically impossible
       | to hit the 230W power limit at stock without direct die or sub-
       | ambient.
       | 
       | > If 95C sounds scary, wait to you see the voltages involved. AMD
       | advertises 5.7 GHz. In reality, a slightly higher value of 5.75
       | GHz seems to be the norm - often across half the cores
       | simultaneously. So it's not just a single core load. The Fmax is
       | 5.85 GHz, but I have never personally seen it go above 5.75.
       | 
       | 5.75 GHz is reached with 1.5 V Vcore.
       | 
       | The +50 MHz bump over advertised boost clocks was also present in
       | Zen 3, likely in response to the poor reception of Zen 2
       | behavior, which would usually fail to achieve the advertised
       | clocks.
        
       | loser777 wrote:
       | I'm genuinely curious of the details of how the 1.5v vCore
       | measurement was obtained. CPU-Z and software measurements in
       | general don't have the greatest reputation of being accurate,
       | especially with just-released generations of CPUs. Conventional
       | wisdom has been with newer manufacturing processes, less voltage
       | is required (and tolerated), and 1.5v vCore sounds truly insane
       | in 2022 for a "4nm" chip. For reference, I haven't heard of 1.5v
       | being a safe "24/7" voltage since the days of 90nm-130nm+ CPUs
       | circa 2005-2006. IIRC casual overclockers in the forums weren't
       | really comfortable with 1.5v even with 65nm Core 2, and this was
       | back when it was common to e.g., safely overclock your 2.4 GHz
       | Core 2 Quad to 3.4 GHz.
        
         | xani_ wrote:
         | Probably used same registers as previous.
         | 
         | Would be simple to confirm with some scope probing CPU power.
        
           | magila wrote:
           | The problem is the CPU itself isn't the one measuring
           | voltage, it gets that information from the motherboard's VRM
           | controller. The accuracy of the reported value can vary
           | depending on the controller, how it's configured by the
           | motherboard's firmware, and the physical circuit design.
           | 
           | That being said, with new motherboards generally using fully
           | digital VRM controllers the reported value should be pretty
           | close in most cases.
        
       | dragontamer wrote:
       | Excellent Teardown by "Mysticial" from mersenneforum.org.
       | 
       | Cliffnotes:
       | 
       | * Zen4 AVX512 is mostly double-pumped: a 256-bit native hardware
       | that processes two halves of the 512-bit register.
       | 
       | * No throttling observed
       | 
       | * 512-bit shuffle pipeline (!!). A powerful exception to the
       | "double-pumping" found in most other AVX512 instructions.
       | 
       | * AMD seemingly handles the AVX512 mask registers better than
       | Intel.
       | 
       | * Gather/Scatter slow on AMD's Zen4 implementation.
       | 
       | * Intel's 512-bit native load/store unit has clear advantages
       | over AMD's 256-bit load-store unit when reading/writing to L1
       | cache and beyond.
        
         | celrod wrote:
         | Looks like SIMD implementations that use LUTs should favor
         | small tables that fit in registers and use `vperm2ipd` as look
         | ups over larger tables + gather.
         | 
         | With 64 bits, you still get a LUT size of 16 (shuffle indexes
         | into two 8xdouble vectors), which can be good enough for
         | functions like log and exp.
        
         | daniel-cussen wrote:
         | Shuffle is the SIMD's killer app. It's apparently an
         | interesting but expensive circuit, but it's smart to prioritize
         | it. Absolute best instruction, hands down. So double-pumping
         | yes isn't full speed meaning single cycle, but that increases
         | the compatibility with AVX512 code. I guess if a program
         | executed itself as a function of its runtime from CPUID it
         | might not, and of course there's all kinds of...but for
         | pedestrian purposes, meaning everything on github, it's a step.
         | Hey 40% speedup on Cinebench, that's buen.
        
           | dragontamer wrote:
           | > Shuffle is the SIMD's killer app
           | 
           | A shame that AVX512 only has pshufb (aka: permute), and is
           | missing the GPU-instruction "bpermute", aka backwards
           | permute.
           | 
           | pshufb is effectively a "gather" instruction over a AVX
           | register. Equivalent to GPU permutes.
           | 
           | bpermute, in GPU land, is a "scatter" instruction over a
           | vector register. There's no CPU / AVX equivalent of it. But I
           | keep coming up with good uses of the bpermute instruction
           | (much like pshufb is crazy flexible, its inverse, the
           | backwards permute, is also crazy flexible).
           | 
           | --------
           | 
           | Almost any code that's finding itself "gathering" data across
           | a vector register, will inevitably "scatter" the data back at
           | some point.
           | 
           | Much like how "pext" is the "gather" instruction for 64-bits,
           | you need pdep to handle the equal-and-opposite case. Its
           | incredibly silly that AVX / AVX512 has implemented only one-
           | half of this concept (gather / pshufb / aka Permute).
           | 
           | I wish for the day that Intel/AMD implements (scatter /
           | backwards-pshufb / aka Backwards-Permute).
           | 
           | -------
           | 
           | Fortunately, I got Vega64 and NVidia Graphics Cards with both
           | permute and bpermute instructions for high-speed shuffling of
           | data. But CPU-space should benefit from this concept too.
        
             | daniel-cussen wrote:
             | OK that's cool, didn't know about bpermute. Made sense
             | there should be a counterpart. Well when you only have
             | pshufb, it works OK, yeah there's tons of gaps but if
             | you're clever and...and if you compromise speed...thanks
             | for telling me about bpermute!
        
           | giyanani wrote:
           | Why do you say shuffle is "SIMD's killer app"? I've only
           | dabbled in vector instructions from a learning perspective,
           | and seen others mention it's important too, but have yet to
           | understand why.
        
             | demindiro wrote:
             | I use PSHUFB to convert 24-bit RGB to 32-bit RGBX or BGRX.
             | Without a shuffle instruction it'd be quite a bit harder.
        
             | MrBuddyCasino wrote:
             | This Rust issue [0] was the best short summary of what an
             | SIMD Shuffle is I could find:
             | 
             | ,,A "shuffle", in SIMD terms, takes a SIMD vector (or
             | possibly two vectors) and a pattern of source lane indexes
             | (usually as an immediate), and then produces a new SIMD
             | vector where the output is the source lane values in the
             | pattern given."
             | 
             | [0] https://github.com/rust-lang/portable-simd/issues/11
        
             | magicalhippo wrote:
             | It's basically several moves for the price of one. Given
             | that you operate on multiple values at once, being able to
             | shuffle or duplicate values comes up all the time.
             | 
             | For example if you're filtering four image lines at a time
             | using a 1D filter kernel, you'll want to replicate the
             | filter coefficient to each SIMD element, so that you can
             | multiply each of the four pixel values with the same
             | coefficient. Shuffle lets you replicate a single
             | coefficient value into all the elements of a register in
             | one instruction.
        
               | daniel-cussen wrote:
               | Which is the point of SIMD. Several moves for the price
               | of one.
        
             | zX41ZdbW wrote:
             | Here is an overview of the usage of the shuffle instruction
             | to speed up decompression in ClickHouse:
             | https://habr.com/ru/company/yandex/blog/457612/
        
             | Veliladon wrote:
             | Because you can do things using bitmasks and single
             | instructions instead of brute forcing using multiple
             | instructions.
             | 
             | Let's say you have a whole heap of 8-bit numbers you want
             | to multiply by 2 and you have a set of 256-bit registers
             | and a nice SIMD multiply command. If you don't have a
             | shuffle you need to assemble your series of 2s for the
             | second operand for each lane before you can even start.
             | This is going to take hundreds of instructions and hundreds
             | of clocks. Shuffle means you load up lane 0 with the "2"
             | and then splat the contents of lane 0 across the other 31
             | lanes in two instructions and a few clocks using the
             | shuffle unit.
             | 
             | N.B. Shuffle isn't just about splatting. There's a whole
             | heap of different operations it can do that are useful. I
             | just picked a simple example with an obvious massive
             | performance increase for illustrative purposes.
        
               | Dylan16807 wrote:
               | I think that example is too simple to show the benefit of
               | shuffle. It's like explaining the benefit of an adder by
               | showing how you can move a value with X = Y + 0.
               | Especially since there's also a (much simpler) piece of
               | hardware dedicated to ultra-fast splat/broadcast (under
               | the right conditions).
        
               | stabbles wrote:
               | You're not talking about shuffle, you're talking about
               | broadcast. Shuffle instructions is where you take one or
               | two vectors, and output a third with elements from any
               | index of the input. So for example `out = [in[2], in[1]]`
               | is a shuffle of a vector of length 2.
               | 
               | It's useful for example if you have say RGB color data
               | stored contiguously in memory as say RGBRGBRGBRGB..., and
               | you want to vectorize operations on R, B and G
               | separately. You can load a few registers like
               | [RGBR][GBRG][BRGB], and then shuffle them to
               | [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how
               | to shuffle optimally, it takes a few shuffles to get
               | there.
               | 
               | More generally, if you have an array of structs, you
               | often need to go to struct of arrays to do vectorized
               | operations on the array, before returning to an array of
               | struct again.
               | 
               | Another example is fast matrix transpose (in fact you can
               | think of the RGB example a 3 by N matrix transpose to N
               | by 3, where N is the vector width -- AoS -> SoA is a
               | transpose too, in a sense). Suppose you have a matrix of
               | size N by N where N is the vector width, you need N lg N
               | shuffles to transpose the matrix.
        
         | janwas wrote:
         | Indeed a great article, well worth reading in full for anyone
         | who uses AVX-512.
         | 
         | Two other things that jumped out at me: VPCONFLICT is 10x as
         | fast, compressstoreu is >10x slower. Those might be enough to
         | warrant a Zen4-specific codepath in Highway.
        
           | celrod wrote:
           | The Intel optimization manual has a fun example where they
           | use vpconflict for vectorizing sparse dot products:
           | https://github.com/intel/optimization-
           | manual/blob/main/chap1...
           | 
           | I benchmarked it on Intel, and it was indeed quite fast/a
           | good improvement over the scalar version. Will be interesting
           | to try that on AMD.
        
         | sitkack wrote:
         | I think it is important to note that while double-pumped, using
         | 512-bit registers puts lower pressure on decode and enables the
         | pipelines to fill. So use 512-bit if you can.
        
           | celrod wrote:
           | Yeah, the claim was that this is why it hit higher clock
           | speeds. The front end will be hard pressed to hit/maintian 4
           | IPC, while 2 IPC is much easier.
        
           | adrian_b wrote:
           | It should also be noted that believing that Zen 4 is "double-
           | pumped" and the Intel CPUs are not "double-pumped" is
           | completely misleading.
           | 
           | On most Intel CPUs with AVX-512 support, there are 2 classes
           | of 512-bit instructions: instructions executed by combining a
           | pair of 256-bit units, thus having an equal throughput for
           | 512-bit instructions and 256-bit instructions, and the second
           | class of instructions, which are executed by combining a pair
           | of 256-bit execution units and also by extending to 512 bits
           | another 256-bit execution unit.
           | 
           | For the second class of instructions the Intel CPUs have a
           | throughput of two 512-bit instructions per cycle vs. three
           | 256-bit instructions per cycle.
           | 
           | Compared to the cheaper models of Intel CPUs, Zen 4, while
           | having the same throughput as Zen 3, i.e. two 512-bit
           | instructions per cycle vs. four 256-bit instructions per
           | cycle in Zen 3, either matches or exceeds the throughput of
           | the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen
           | 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA
           | per cycle can be executed.
           | 
           | The only important advantage of Intel appears in the most
           | expensive models of the server and workstation CPUs, i.e. in
           | most Xeon Gold, all Xeon Platinum and all of the Xeon W
           | models that have AVX-512 support.
           | 
           | In these more expensive models, there is a second 512-bit FMA
           | unit, which enables a double FMA throughput compared to Zen
           | 4. These models with double FMA throughput are also helped by
           | a double throughput for the loads from the L1 cache, which is
           | matched to the FMA throughput.
           | 
           | So the AVX-512 implementation in Zen 4 is superior to that in
           | the cheaper CPUs like Tiger Lake, even without taking into
           | account the few new execution units added in Zen 4, like the
           | 512-bit shuffle unit.
           | 
           | Only the Xeon Platinum and the like of the future Sapphire
           | Rapids will have a definitely greater throughput for the
           | floating-point operations than Zen 4, but they will also have
           | a significantly lower all-clock frequency (due to the
           | inferior manufacturing process), so the higher throughput per
           | clock cycle is not certain to overcome the deficit in clock
           | frequency.
        
       | asbeb wrote:
        
       | pella wrote:
       | phoronix:AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9
       | 7950X
       | 
       | https://www.phoronix.com/review/amd-zen4-avx512
       | 
       |  _" On average for the tested AVX-512 workloads, making use of
       | the AVX-512 instructions led to around 59% higher performance
       | compared to when artificially limiting the Ryzen 9 7950X to AVX2
       | / no-AVX512.
       | 
       | From these results I am rather impressed by the AVX-512
       | performance out of the AMD Ryzen 9 7950X. While initially being
       | disappointed when hearing of their "double pumping" approach
       | rather than going for a 512-bit data path, these benchmark
       | results speak for themselves. For software that can effectively
       | make use of AVX-512 (and compiled so), there is significant
       | performance uplift to enjoy while no negative impact in terms of
       | reduced CPU clock speeds / higher power consumption (with oneDNN
       | being one of the only exceptions seen so far in terms of higher
       | power draw).
       | 
       | AVX-512 is looking good on the Ryzen 7000 series and I'll
       | continue running more benchmarks over the weeks ahead. These
       | AVX-512 results make me all the more excited for AMD EPYC "Genoa"
       | where AVX-512 can be a lot more widely-used among HPC/server
       | workloads. "_
        
         | phire wrote:
         | I wonder how much of that 59% gain comes from the 512bit
         | registers/instructions themselves, and how much comes from the
         | new instructions and modes that come with AVX-512, and can
         | still be used with the narrower 256bit and 128bit registers.
         | 
         | Would be interesting to modify some of the benchmarks to be
         | limited to 256bit AVX-512 and see how they compare.
        
           | TinkersW wrote:
           | Mysticals report indicates much of it does come from wider
           | instructions, because it can saturate the core easier. Zen 3
           | was front end bottlenecked, so on Zen4 running AVX512 it can
           | more often hit 4x256. The new instructions are useful and
           | some help perf, but mostly only for pretty specialized stuff.
           | Masking is nice but I think people really exaggerate the
           | improvement from it, vblend was only 2 cycles.
        
       | paulmd wrote:
       | Haha, as someone who has been shouting "no, really, AVX-512 is
       | good, even if it's double-pumped, just wait for it guys" into the
       | void for years now, glad to see it finally hit the desktop for
       | real and that the AVX people are already leaning into it.
       | 
       | Years and years of "nobody needs AVX-512" and "linus says it's
       | just for benchmarks, he worked at transmeta two decades ago, he
       | knows better than Lisa Su" hot takes down the tubes ;)
        
       ___________________________________________________________________
       (page generated 2022-09-26 23:00 UTC)