[HN Gopher] Converting integers to decimal strings faster with A...
       ___________________________________________________________________
        
       Converting integers to decimal strings faster with AVX-512
        
       Author : ibobev
       Score  : 70 points
       Date   : 2022-03-29 16:58 UTC (6 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | Aardwolf wrote:
       | And meanwhile after 9 years of existing intel is still not
       | reliably adding avx 512 to flagship desktop/gaming cpu's, first
       | because of 5+ years of using the same old architecture and
       | process, now because I don't know why they disabled it when it
       | was there. Why was it not a problem for Intel to add mmx, several
       | sse iterations, avx, avx 2, but now 9 years of nothing, even
       | though parallel computing in annoying to program for and depend
       | on gpu's is only gaining in importance?
        
         | Findecanor wrote:
         | AVX-512 still isn't _one_ instruction set either. It is several
         | that are overlapping, with different feature sets.
         | 
         | I have seen several accounts of them being inefficient (drawing
         | much power, large chip area, dragging down the clock of the
         | entire chip). They have also been criticised for making context
         | switches longer.
         | 
         | There are some good ideas in there though. What I think Intel
         | should do is to redesign it as a single modern instruction set
         | for 128 and 256-bit vectors (or for scalable vector lengths
         | such as where ARM and RISC-V are going). Then express older
         | SSE-AVX2 instructions in the new instructions' microcode, with
         | the goal of replacing SSE-AVX2 in the long term. Otherwise, if
         | Intel doesn't have a instruction set that is modern and
         | feature-complete, I think Intel would be falling behind.
        
           | aseipp wrote:
           | You can already use the AVX-512 instructions with vectors
           | smaller than 512 bits; that, in fact, is one of the
           | extensions to the base microarchitecture, and is present in
           | every desktop/server CPU that has it available (but not the
           | original Xeon Phis, IIRC.) This is called the "Vector Length"
           | extension.
           | 
           | Edit: I noted you said "redesign" which is an important
           | point. I don't disagree in principle, I just point this out
           | to be clear you can use the new ISA with smaller vectors
           | already, to anyone who doesn't know this. No need to throw
           | out baby with bathwater. Maybe their "redesign" could just be
           | ratifying this as a separate ISA feature...
           | 
           | > drawing much power, large chip area, dragging down the
           | clock of the entire chip
           | 
           | The chip area thing has mostly been overblown IMO, the vector
           | unit itself is rather large but would be anyway (you can't
           | get away from this if you want fat vectors), and the register
           | file is shared anyway so it's "only" 2x bigger relative to a
           | 256b one, but still pales in comparison to the L2 or L3. The
           | power considerations are significantly better at least as of
           | Ice Lake/Tiger Lake, having pretty minimal impact on clock
           | speeds/power licensing, but it's worth nothing these have
           | only 1 FMA unit in consumer SKUs and are traditionally low
           | core count designs. Seeing AVX-512 performance/power profiles
           | on Sapphire Rapids or Ice Lake Xeon (dual FMAs, for one)
           | would be more interesting but again it wouldn't match desktop
           | SKUs in any case.
           | 
           | It's also worth noting AVX2 has similar problems e.g. on
           | Haswell machines where it first appeared, and would similarly
           | cause downclocking due to thermal constraints; but at the
           | time the blogosphere wasn't as advanced and efficient at
           | regurgitating wives' tales with zero context or wider
           | understanding, so you don't see it talked about as much.
           | 
           | In any case I suspect Intel is probably thinking about moving
           | in the direction of variable-width vector sizes. But they
           | probably think about a lot of things. It's unclear how this
           | will all pan out, though I do admit AVX-512 overall counts as
           | a bit of a "fumbled bag" for them, so to speak. I just want
           | the instructions! No bigger vectors needed.
        
             | moonchild wrote:
             | > "only" 2x bigger relative to a 256b one
             | 
             | 4x bigger; avx512 has twice as many registers.
        
               | gpderetta wrote:
               | The number of microarchitectural registers is only
               | loosely related to the number of architectural registers.
        
         | needusername wrote:
         | > now because I don't know why they disabled it when it was
         | there
         | 
         | E-cores (Gracemont) don't support AVX-512, only P-cores (Golden
         | Cove) support AVX-512
        
           | Aardwolf wrote:
           | Sure, but they deliberately released a BIOS version that
           | prevented people from enabling it when disabling E-cores
        
         | matja wrote:
         | _which_ AVX-512? :) -
         | https://twitter.com/InstLatX64/status/1114141314441011200/ph...
        
         | PragmaticPulp wrote:
         | > Why was it not a problem for Intel to add mmx, several sse
         | iterations, avx, avx 2, but now 9 years of nothing, even though
         | parallel computing in annoying to program for and depend on
         | gpu's is only gaining in importance?
         | 
         | AVX512 is significantly more intensive than previous generation
         | SIMD instruction sets.
         | 
         | It takes up a lot of die space and it uses a massive amount of
         | power.
         | 
         | It can use so much power and is so complex that the clock speed
         | of the CPU is reduced slightly when AVX512 instructions are
         | running. This led to an exaggerated outrage from
         | enthusiast/gaming users who didn't want their clock speed
         | reduced temporarily just because some program somewhere on
         | their machine executed some AVX512 instructions. To this day
         | you can still find articles about how to disable AVX512 for
         | this reason.
         | 
         | AVX512 is also incompatible with having a heterogeneous
         | architecture with smaller efficiency cores. The AVX512 part of
         | the die is massive and power-hungry, so it's not really an
         | option for making efficiency cores.
         | 
         | I think Intel is making the right choice to move away from
         | AVX512 for consumer use cases. Very few, if any, consumer-
         | target applications would even benefit from AVX512
         | optimizations. Very few companies would actually want to do it,
         | given that the optimizations wouldn't run on AMD CPUs or many
         | Intel CPUs.
         | 
         | It's best left as a specific optimization technique for very
         | specific use cases where you know you're controlling the server
         | hardware.
        
           | Aardwolf wrote:
           | MMX was 64-bit wide, SSE 128-bit wide and AVX 256-bit wide,
           | and they succeeded each other rapidly (when compared to the
           | situation now). AVX512 is another doubling, so what's the
           | hold up with this one compared to the previous two doublings?
           | 
           | Maybe it's time for AMD to do something about this, like when
           | they took the initiative to create x86-64
        
             | arcticbull wrote:
             | As the 4th doubling, that's a 16X increase over the base.
             | 
             | 512-bit registers, ALUs, data paths, etc, are all really
             | really physically big.
        
             | monocasa wrote:
             | My sense is that it's because it happened about when
             | Intel's litho progress began stalling out. It was designed
             | for a world the uarch folks were expecting with smaller
             | transistors than they ended up getting.
        
             | mcronce wrote:
             | Zen 4 is reported to have AVX-512 support. I'm not sure if
             | that'll be included on Ryzen or not, though; only Epyc is
             | "confirmed" as far as I know.
        
           | adrian_b wrote:
           | The size of the die space and the amount of the power have
           | very little to do with AVX-512.
           | 
           | AVX-512 is a better instruction set and many tasks can be
           | done with fewer instructions than when using AVX or SSE,
           | resulting in a lower energy consumption, even at the same
           | data width.
           | 
           | The increase in size and power consumption is due almost
           | entirely to the fact that AVX-512 has both twice the number
           | of registers and double-width registers in comparison with
           | AVX. Moreover the current implementations have a
           | corresponding widening of the execution units and datapaths.
           | 
           | If SSE or AVX would have been widened for higher performance,
           | they would have had the same increases in size and power, but
           | they would have remained less efficient instruction sets.
           | 
           | Even in the worst AVX-512 implementation, in Skylake Server,
           | doing any computation in AVX-512 mode reduces a lot the
           | energy consumption.
           | 
           | The problem with AVX-512 in Skylake Server and derived CPUs,
           | e.g. Cascade Lake, is that those Intel CPUs have worse
           | methods of limiting the power consumption than the
           | contemporaneous AMD Zen. Whatever method was used by Intel,
           | it reacted too slow during consumption peaks. Because of
           | that, the Intel CPUs had to reduce the clock frequency in
           | advance whenever they feared that a too large power
           | consumption could happen in the future, e.g. when they see a
           | sequence of AVX-512 instructions and they fear that more will
           | follow.
           | 
           | While this does not matter for programs that do long
           | computations with AVX-512, when the clock frequency really
           | needs to go down, it handicaps the programs that execute only
           | a few AVX-512 instructions, but enough to trigger the
           | decrease in clock frequency, which slows down the non-AVX-512
           | instructions that follow.
           | 
           | This was a serious problem for all Intel CPUs derived from
           | Skylake, where you must take care to not use AVX-512
           | instructions unless you intend to use many of them.
           | 
           | However it was not really a problem of AVX-512 but of Intel's
           | methods for power and die temperature control. Those can be
           | improved and Intel did improve them in later CPUs.
           | 
           | AVX-512 is not the only one that caused such undesirable
           | behaviors. Even in much older Intel CPUs, the same kind of
           | problems appear when you are interested to have maximum
           | single-thread performance, but some random background process
           | starts on another previously idle core. Even if that
           | background process consumes a negligible power, the CPU is
           | afraid that it might start to consume a lot and it reduces
           | drastically the maximum turbo frequency compared with the
           | case when a single core was active, causing the program that
           | interests you to slow down.
           | 
           | This is exactly the same kind of problem, and it is visible
           | especially on Windows, which has a huge quantity of enabled
           | system services that may start to execute unexpectedly, even
           | when you believe that the computer should be idle.
           | Nevertheless, the people got used to this behavior,
           | especially because it was little that they could do about it,
           | so it was much less discussed than the AVX-512 slowdown.
        
           | zdw wrote:
           | The power/clockspeed hit for AVX512 is less on the most
           | recent Intel CPUs, per previous discussion:
           | https://news.ycombinator.com/item?id=24215022
        
         | cl0ckt0wer wrote:
         | It looks like they're shifting to efficiency as a target. In
         | their latest 12th gen desktop CPUs, you could re-enable avx512
         | if you disabled the "efficiency" cores. Then they released a
         | BIOS update with that feature removed. Has there been an
         | instruction set that failed? Itanium maybe?
        
         | ceeplusplus wrote:
         | It costs a lot of area for something that's doesn't move the
         | needle for the majority of desktop applications (browsers,
         | games, office apps). The AVX512 units on Skylake-SP for example
         | are a substantial portion of the area of each core. At some
         | point you have to consider how many cores you could fit in the
         | area used for vector extensions and make a trade-off.
        
           | aseipp wrote:
           | This argument IMO is rather overblown. A quick die shot of
           | Skylake-X indicates to me that even if you completely nerfed
           | every AVX-512 unit appropriately (including the register
           | files) on say, an 8-core or 10-core processor, you're going
           | to get what? 1 or 2 more cores at maximum?[1] People make it
           | sound like 70% of every Skylake-X die just sits there and you
           | could have 50x more cores, when it's not that simple. You can
           | delete tons of features from a modern microprocessor to save
           | space but that isn't always the best way forward.
           | 
           | In any case, I think this whole argument really misses
           | another important thing, which is that the ISA and the vector
           | width are separate here different. If Intel would just give
           | us AVX-512 with 256b or 128b vectors, it would lead to a very
           | big increase in use IMO, without having to have massive
           | register files and vector data paths complicate things. The
           | instruction set improvements are good enough to justify this
           | IMO. (Alder Lake makes it a more complex case since they'd
           | need to do a lot of work to unify Gracemont/Golden Cove
           | before they could even think about that, but I still think it
           | would be great.) They could even just half the width of the
           | vector load/store units relative to the vector size like AMD
           | did with Zen 2 (e.g. 256b vecs are split into 2x128b ops.)
           | I'll still take it
           | 
           | [1] https://twitter.com/GPUsAreMagic/status/12568664655773941
           | 81/...
        
             | gpderetta wrote:
             | Exactly. Even just one AVX512 ALU instead of the typical
             | two would be an option (which in fact Intel has used on
             | some *lake variants).
        
           | Aardwolf wrote:
           | > It costs a lot of area for something that's doesn't move
           | the needle for the majority of desktop applications
           | 
           | But that could be a bit of a chicken and egg situation,
           | right?
        
         | jeffbee wrote:
         | I don't know where you came up with 9 years. The very first CPU
         | that had these features came out in May 2018. "Tiger Lake" has
         | these features and it is/was Intel's mainstream laptop CPU for
         | the past year or so. Adler Lake, their current generation,
         | lacks these features but I think it's understandable because
         | they had to add AVX, AVX2, BMI, BMI2, FMA, VAES, and a bunch of
         | other junk to the Tremont design in order to make a big/little
         | CPU that works seamlessly. Whether you think they should
         | instead have made a heterogeneous design that is harder to use
         | is another question.
        
           | gnabgib wrote:
           | Intel proposed AVX512 in 2013[0], with first appearance on
           | Xeon Phi x200 announced in 2013, launched in 2016[1], and
           | then on Skylake-X released in 2017[2]
           | 
           | [0]: https://en.wikipedia.org/wiki/AVX-512 [1]:
           | https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing [2]: h
           | ttps://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Hi..
           | .
        
             | adrian_b wrote:
             | Actually AVX-512 is considerably older than AVX.
             | 
             | Initially AVX-512 was known as "Larrabee New Instructions".
             | 
             | This instruction set, which included essential features,
             | which have been missing in both earlier and later Intel
             | ISAs, e.g. mask registers and scatter-gather instructions,
             | was developed a few years before 2009, by a team in which
             | many people had been brought from outside Intel.
             | 
             | The "Larrabee New Instructions" have been disclosed
             | publicly in 2009, then the first hardware implementation
             | available outside Intel, "Knights Ferry" was released in
             | May 2010. Due to poor performance against GPUs, it was
             | available only in development systems.
             | 
             | A year later, in 2011, Sandy Bridge was launched, the first
             | Intel product with AVX. Even if AVX had significant
             | improvements over SSE, it was seriously crippled in
             | comparison with the older AVX-512 a.k.a. "Larrabee New
             | Instructions".
             | 
             | It would have been much better for the Intel customers if
             | Sandy Bridge would have implemented a 256-bit version of
             | AVX-512 instead of implementing AVX. However Intel has
             | always attempted to implement as few improvements as
             | possible in each CPU generation, in order to minimize their
             | production costs and maximize their profits. This worked
             | very well for them as long as they did not have serious
             | competition.
             | 
             | The next implementation of AVX-512 (using the name "Intel
             | Many Integrated Cores Instructions"), was in Knights
             | Corner, the first Xeon Phi, launched in Q4 2012. This
             | version made some changes in the encoding of the
             | instructions and it also removed some instructions intended
             | for GPU applications.
             | 
             | The next implementation of AVX-512, which changed again the
             | encoding of the instructions to the one used today, and
             | which changed its name to AVX-512, was in Knights Landing,
             | which was launched in Q2 2016.
             | 
             | With the launch of Skylake Server, in Q3 2017, AVX-512
             | appeared for the first time in mainstream Intel CPUs, but
             | after removing some sets of instructions previously
             | available on Xeon Phi.
             | 
             | AVX-512 is a much more pleasant ISA than AVX, e.g. by using
             | the mask registers it is much easier to program loops when
             | the length and alignment of data is arbitrary.
             | Unfortunately the support for it is unpredictable, so it is
             | usually not worthwhile to optimize for it.
             | 
             | Hopefully the rumors that Zen 4 supports AVX-512 are true,
             | so its launch might be the real start of widespread use of
             | AVX-512.
        
             | wtallis wrote:
             | Also consumer Skylake started shipping in 2015 with some
             | non-functional silicon area reserved for AVX-512 register
             | files.
        
               | jeffbee wrote:
               | Skylake did not have IFMA or VBMI. The first
               | microarchitecture with both of those was Cannon Lake, Q2
               | 2018, which practically did not exist in the market, and
               | the first mainstream CPU with both of these was Ice Lake,
               | Q3 2019.
        
       | thebiss wrote:
       | If you're curious what gets generated, Godbolt maps the C code to
       | ASM, and provides explanations of the instructions on hover. Link
       | below processes the source using CLANG, providing a cleaner
       | result than GCC.
       | 
       | [0] https://godbolt.org/z/78KodqxaP
        
       | nitwit005 wrote:
       | > The code is a bit technical, but remarkably, it does not
       | require a table.
       | 
       | Those constants being used look a lot like a table.
        
         | stncls wrote:
         | Especially given the fact that the table in the baseline code
         | is only 200 bytes. That's less than four AVX-512 registers!
        
         | adrian_b wrote:
         | I assume that he referred to the fact that the code does not
         | need an array in memory, with constant values that must be
         | accessed with indexed loads. Using such a table in memory can
         | be relatively slow, as proven by the benchmarks for this
         | method.
         | 
         | There are constants, but they are used directly as arguments
         | for the AVX-512 intrinsics. They must still be loaded into the
         | AVX-512 registers from memory, but they are loaded from
         | locations already known at compile-time and the loads can be
         | scheduled optimally by the compiler, because they are not
         | dependent on any other instructions.
         | 
         | For a table stored in an array in memory, the loads can be done
         | only after the indices are computed and the loads may need to
         | be done in a random order, not in the sequential order of the
         | memory locations. When the latencies of the loads cannot be
         | hidden, they can slow a lot any algorithm.
        
           | nitwit005 wrote:
           | The code itself sits in memory, which means the "table" is
           | still in memory.
        
           | orlp wrote:
           | If your program is sufficiently large and diverse the
           | constants embedded in the instructions are no better than a
           | table load/lookup. The jump to the conversion routine is
           | unpredictable, and the routine will not be in the instruction
           | cache, causing the CPU to stall while waiting for the
           | instructions to load from RAM.
           | 
           | This will never show up in a microbenchmark where the
           | function's instructions are always hot. In fact, a lot of
           | microbenchmarking software "warms up" the code by calling it
           | a bunch of times before starting to time it, to maximize the
           | chances of them being able to ignore this reality.
        
             | dzaima wrote:
             | The AVX-512 constants aren't embedded in the instructions,
             | but the address to load being static means the CPU can
             | start to cache them when the decoder gets to them, possibly
             | even before it knows that the input to the function is.
             | 
             | In contrast, for the scalar code, the CPU must complete the
             | divisions (which'll become multiplications, but even those
             | still have relatively high latency) before it can even
             | begin to look for what to cache.
        
         | dzaima wrote:
         | The list of arguments to _mm512_setr_epi64 are just reciprocals
         | of powers of 10, multiplied by 2^52. The scalar code uses
         | division by powers of 10, which'd compile down to similar
         | multiplication constants; you just don't see them because the
         | compiler does that for the scalar code, but you have to do so
         | manually for AVX-512.
         | 
         | And permb_const is a list of indices for the result characters
         | within the vector register - the algorithm works on 64-bit
         | integers, but the result must be a list of bytes, so it picks
         | out every 8th one.
        
       | jeffbee wrote:
       | I'd be curious about use cases where this does or does not make
       | sense. On the one hand, you saved a few nanoseconds on the
       | integer-to-string encoder. But on the other hand you're committed
       | to storing and transferring up to 15 leading zeros, and there
       | must be some cost on the decoder to consuming the leading zeros.
       | So this clearly makes sense on a write-once-read-never
       | application but there must also be a point at which the full
       | lifecycle cost crosses over and this approach is worse.
        
         | aqrit wrote:
         | related to IEEE 754 double-precision floating-point round-
         | trips?
        
         | dzaima wrote:
         | The article compares padded length 15 output for both cases;
         | removing leading zeroes would have cost for both AVX-512 and
         | regular code.
        
           | jeffbee wrote:
           | Yeah, that's the "this approach" to which I refer though.
           | This is a micro-optimization of an approach that I'm not sure
           | has many beneficial applications.
        
             | dzaima wrote:
             | Ah. The article, as I see it, is primarily about AVX-512 vs
             | scalar code, not the 15 digits though. The fixed-length
             | restriction is purely to simplify the challenge.
             | 
             | To remove leading zeroes, you'd need to use one of
             | bsf/lzcnt/pcmpistri, and a masked store, which has some
             | cost, but still stays branchless and will probably easily
             | be compensated by the smaller cache/storage/network usage &
             | shorter decoding.
        
       | bumblebritches5 wrote:
        
       ___________________________________________________________________
       (page generated 2022-03-29 23:00 UTC)