[HN Gopher] Converting integers to decimal strings faster with A... ___________________________________________________________________ Converting integers to decimal strings faster with AVX-512 Author : ibobev Score : 70 points Date : 2022-03-29 16:58 UTC (6 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | Aardwolf wrote: | And meanwhile after 9 years of existing intel is still not | reliably adding avx 512 to flagship desktop/gaming cpu's, first | because of 5+ years of using the same old architecture and | process, now because I don't know why they disabled it when it | was there. Why was it not a problem for Intel to add mmx, several | sse iterations, avx, avx 2, but now 9 years of nothing, even | though parallel computing in annoying to program for and depend | on gpu's is only gaining in importance? | Findecanor wrote: | AVX-512 still isn't _one_ instruction set either. It is several | that are overlapping, with different feature sets. | | I have seen several accounts of them being inefficient (drawing | much power, large chip area, dragging down the clock of the | entire chip). They have also been criticised for making context | switches longer. | | There are some good ideas in there though. What I think Intel | should do is to redesign it as a single modern instruction set | for 128 and 256-bit vectors (or for scalable vector lengths | such as where ARM and RISC-V are going). Then express older | SSE-AVX2 instructions in the new instructions' microcode, with | the goal of replacing SSE-AVX2 in the long term. Otherwise, if | Intel doesn't have a instruction set that is modern and | feature-complete, I think Intel would be falling behind. | aseipp wrote: | You can already use the AVX-512 instructions with vectors | smaller than 512 bits; that, in fact, is one of the | extensions to the base microarchitecture, and is present in | every desktop/server CPU that has it available (but not the | original Xeon Phis, IIRC.) This is called the "Vector Length" | extension. | | Edit: I noted you said "redesign" which is an important | point. I don't disagree in principle, I just point this out | to be clear you can use the new ISA with smaller vectors | already, to anyone who doesn't know this. No need to throw | out baby with bathwater. Maybe their "redesign" could just be | ratifying this as a separate ISA feature... | | > drawing much power, large chip area, dragging down the | clock of the entire chip | | The chip area thing has mostly been overblown IMO, the vector | unit itself is rather large but would be anyway (you can't | get away from this if you want fat vectors), and the register | file is shared anyway so it's "only" 2x bigger relative to a | 256b one, but still pales in comparison to the L2 or L3. The | power considerations are significantly better at least as of | Ice Lake/Tiger Lake, having pretty minimal impact on clock | speeds/power licensing, but it's worth nothing these have | only 1 FMA unit in consumer SKUs and are traditionally low | core count designs. Seeing AVX-512 performance/power profiles | on Sapphire Rapids or Ice Lake Xeon (dual FMAs, for one) | would be more interesting but again it wouldn't match desktop | SKUs in any case. | | It's also worth noting AVX2 has similar problems e.g. on | Haswell machines where it first appeared, and would similarly | cause downclocking due to thermal constraints; but at the | time the blogosphere wasn't as advanced and efficient at | regurgitating wives' tales with zero context or wider | understanding, so you don't see it talked about as much. | | In any case I suspect Intel is probably thinking about moving | in the direction of variable-width vector sizes. But they | probably think about a lot of things. It's unclear how this | will all pan out, though I do admit AVX-512 overall counts as | a bit of a "fumbled bag" for them, so to speak. I just want | the instructions! No bigger vectors needed. | moonchild wrote: | > "only" 2x bigger relative to a 256b one | | 4x bigger; avx512 has twice as many registers. | gpderetta wrote: | The number of microarchitectural registers is only | loosely related to the number of architectural registers. | needusername wrote: | > now because I don't know why they disabled it when it was | there | | E-cores (Gracemont) don't support AVX-512, only P-cores (Golden | Cove) support AVX-512 | Aardwolf wrote: | Sure, but they deliberately released a BIOS version that | prevented people from enabling it when disabling E-cores | matja wrote: | _which_ AVX-512? :) - | https://twitter.com/InstLatX64/status/1114141314441011200/ph... | PragmaticPulp wrote: | > Why was it not a problem for Intel to add mmx, several sse | iterations, avx, avx 2, but now 9 years of nothing, even though | parallel computing in annoying to program for and depend on | gpu's is only gaining in importance? | | AVX512 is significantly more intensive than previous generation | SIMD instruction sets. | | It takes up a lot of die space and it uses a massive amount of | power. | | It can use so much power and is so complex that the clock speed | of the CPU is reduced slightly when AVX512 instructions are | running. This led to an exaggerated outrage from | enthusiast/gaming users who didn't want their clock speed | reduced temporarily just because some program somewhere on | their machine executed some AVX512 instructions. To this day | you can still find articles about how to disable AVX512 for | this reason. | | AVX512 is also incompatible with having a heterogeneous | architecture with smaller efficiency cores. The AVX512 part of | the die is massive and power-hungry, so it's not really an | option for making efficiency cores. | | I think Intel is making the right choice to move away from | AVX512 for consumer use cases. Very few, if any, consumer- | target applications would even benefit from AVX512 | optimizations. Very few companies would actually want to do it, | given that the optimizations wouldn't run on AMD CPUs or many | Intel CPUs. | | It's best left as a specific optimization technique for very | specific use cases where you know you're controlling the server | hardware. | Aardwolf wrote: | MMX was 64-bit wide, SSE 128-bit wide and AVX 256-bit wide, | and they succeeded each other rapidly (when compared to the | situation now). AVX512 is another doubling, so what's the | hold up with this one compared to the previous two doublings? | | Maybe it's time for AMD to do something about this, like when | they took the initiative to create x86-64 | arcticbull wrote: | As the 4th doubling, that's a 16X increase over the base. | | 512-bit registers, ALUs, data paths, etc, are all really | really physically big. | monocasa wrote: | My sense is that it's because it happened about when | Intel's litho progress began stalling out. It was designed | for a world the uarch folks were expecting with smaller | transistors than they ended up getting. | mcronce wrote: | Zen 4 is reported to have AVX-512 support. I'm not sure if | that'll be included on Ryzen or not, though; only Epyc is | "confirmed" as far as I know. | adrian_b wrote: | The size of the die space and the amount of the power have | very little to do with AVX-512. | | AVX-512 is a better instruction set and many tasks can be | done with fewer instructions than when using AVX or SSE, | resulting in a lower energy consumption, even at the same | data width. | | The increase in size and power consumption is due almost | entirely to the fact that AVX-512 has both twice the number | of registers and double-width registers in comparison with | AVX. Moreover the current implementations have a | corresponding widening of the execution units and datapaths. | | If SSE or AVX would have been widened for higher performance, | they would have had the same increases in size and power, but | they would have remained less efficient instruction sets. | | Even in the worst AVX-512 implementation, in Skylake Server, | doing any computation in AVX-512 mode reduces a lot the | energy consumption. | | The problem with AVX-512 in Skylake Server and derived CPUs, | e.g. Cascade Lake, is that those Intel CPUs have worse | methods of limiting the power consumption than the | contemporaneous AMD Zen. Whatever method was used by Intel, | it reacted too slow during consumption peaks. Because of | that, the Intel CPUs had to reduce the clock frequency in | advance whenever they feared that a too large power | consumption could happen in the future, e.g. when they see a | sequence of AVX-512 instructions and they fear that more will | follow. | | While this does not matter for programs that do long | computations with AVX-512, when the clock frequency really | needs to go down, it handicaps the programs that execute only | a few AVX-512 instructions, but enough to trigger the | decrease in clock frequency, which slows down the non-AVX-512 | instructions that follow. | | This was a serious problem for all Intel CPUs derived from | Skylake, where you must take care to not use AVX-512 | instructions unless you intend to use many of them. | | However it was not really a problem of AVX-512 but of Intel's | methods for power and die temperature control. Those can be | improved and Intel did improve them in later CPUs. | | AVX-512 is not the only one that caused such undesirable | behaviors. Even in much older Intel CPUs, the same kind of | problems appear when you are interested to have maximum | single-thread performance, but some random background process | starts on another previously idle core. Even if that | background process consumes a negligible power, the CPU is | afraid that it might start to consume a lot and it reduces | drastically the maximum turbo frequency compared with the | case when a single core was active, causing the program that | interests you to slow down. | | This is exactly the same kind of problem, and it is visible | especially on Windows, which has a huge quantity of enabled | system services that may start to execute unexpectedly, even | when you believe that the computer should be idle. | Nevertheless, the people got used to this behavior, | especially because it was little that they could do about it, | so it was much less discussed than the AVX-512 slowdown. | zdw wrote: | The power/clockspeed hit for AVX512 is less on the most | recent Intel CPUs, per previous discussion: | https://news.ycombinator.com/item?id=24215022 | cl0ckt0wer wrote: | It looks like they're shifting to efficiency as a target. In | their latest 12th gen desktop CPUs, you could re-enable avx512 | if you disabled the "efficiency" cores. Then they released a | BIOS update with that feature removed. Has there been an | instruction set that failed? Itanium maybe? | ceeplusplus wrote: | It costs a lot of area for something that's doesn't move the | needle for the majority of desktop applications (browsers, | games, office apps). The AVX512 units on Skylake-SP for example | are a substantial portion of the area of each core. At some | point you have to consider how many cores you could fit in the | area used for vector extensions and make a trade-off. | aseipp wrote: | This argument IMO is rather overblown. A quick die shot of | Skylake-X indicates to me that even if you completely nerfed | every AVX-512 unit appropriately (including the register | files) on say, an 8-core or 10-core processor, you're going | to get what? 1 or 2 more cores at maximum?[1] People make it | sound like 70% of every Skylake-X die just sits there and you | could have 50x more cores, when it's not that simple. You can | delete tons of features from a modern microprocessor to save | space but that isn't always the best way forward. | | In any case, I think this whole argument really misses | another important thing, which is that the ISA and the vector | width are separate here different. If Intel would just give | us AVX-512 with 256b or 128b vectors, it would lead to a very | big increase in use IMO, without having to have massive | register files and vector data paths complicate things. The | instruction set improvements are good enough to justify this | IMO. (Alder Lake makes it a more complex case since they'd | need to do a lot of work to unify Gracemont/Golden Cove | before they could even think about that, but I still think it | would be great.) They could even just half the width of the | vector load/store units relative to the vector size like AMD | did with Zen 2 (e.g. 256b vecs are split into 2x128b ops.) | I'll still take it | | [1] https://twitter.com/GPUsAreMagic/status/12568664655773941 | 81/... | gpderetta wrote: | Exactly. Even just one AVX512 ALU instead of the typical | two would be an option (which in fact Intel has used on | some *lake variants). | Aardwolf wrote: | > It costs a lot of area for something that's doesn't move | the needle for the majority of desktop applications | | But that could be a bit of a chicken and egg situation, | right? | jeffbee wrote: | I don't know where you came up with 9 years. The very first CPU | that had these features came out in May 2018. "Tiger Lake" has | these features and it is/was Intel's mainstream laptop CPU for | the past year or so. Adler Lake, their current generation, | lacks these features but I think it's understandable because | they had to add AVX, AVX2, BMI, BMI2, FMA, VAES, and a bunch of | other junk to the Tremont design in order to make a big/little | CPU that works seamlessly. Whether you think they should | instead have made a heterogeneous design that is harder to use | is another question. | gnabgib wrote: | Intel proposed AVX512 in 2013[0], with first appearance on | Xeon Phi x200 announced in 2013, launched in 2016[1], and | then on Skylake-X released in 2017[2] | | [0]: https://en.wikipedia.org/wiki/AVX-512 [1]: | https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing [2]: h | ttps://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Hi.. | . | adrian_b wrote: | Actually AVX-512 is considerably older than AVX. | | Initially AVX-512 was known as "Larrabee New Instructions". | | This instruction set, which included essential features, | which have been missing in both earlier and later Intel | ISAs, e.g. mask registers and scatter-gather instructions, | was developed a few years before 2009, by a team in which | many people had been brought from outside Intel. | | The "Larrabee New Instructions" have been disclosed | publicly in 2009, then the first hardware implementation | available outside Intel, "Knights Ferry" was released in | May 2010. Due to poor performance against GPUs, it was | available only in development systems. | | A year later, in 2011, Sandy Bridge was launched, the first | Intel product with AVX. Even if AVX had significant | improvements over SSE, it was seriously crippled in | comparison with the older AVX-512 a.k.a. "Larrabee New | Instructions". | | It would have been much better for the Intel customers if | Sandy Bridge would have implemented a 256-bit version of | AVX-512 instead of implementing AVX. However Intel has | always attempted to implement as few improvements as | possible in each CPU generation, in order to minimize their | production costs and maximize their profits. This worked | very well for them as long as they did not have serious | competition. | | The next implementation of AVX-512 (using the name "Intel | Many Integrated Cores Instructions"), was in Knights | Corner, the first Xeon Phi, launched in Q4 2012. This | version made some changes in the encoding of the | instructions and it also removed some instructions intended | for GPU applications. | | The next implementation of AVX-512, which changed again the | encoding of the instructions to the one used today, and | which changed its name to AVX-512, was in Knights Landing, | which was launched in Q2 2016. | | With the launch of Skylake Server, in Q3 2017, AVX-512 | appeared for the first time in mainstream Intel CPUs, but | after removing some sets of instructions previously | available on Xeon Phi. | | AVX-512 is a much more pleasant ISA than AVX, e.g. by using | the mask registers it is much easier to program loops when | the length and alignment of data is arbitrary. | Unfortunately the support for it is unpredictable, so it is | usually not worthwhile to optimize for it. | | Hopefully the rumors that Zen 4 supports AVX-512 are true, | so its launch might be the real start of widespread use of | AVX-512. | wtallis wrote: | Also consumer Skylake started shipping in 2015 with some | non-functional silicon area reserved for AVX-512 register | files. | jeffbee wrote: | Skylake did not have IFMA or VBMI. The first | microarchitecture with both of those was Cannon Lake, Q2 | 2018, which practically did not exist in the market, and | the first mainstream CPU with both of these was Ice Lake, | Q3 2019. | thebiss wrote: | If you're curious what gets generated, Godbolt maps the C code to | ASM, and provides explanations of the instructions on hover. Link | below processes the source using CLANG, providing a cleaner | result than GCC. | | [0] https://godbolt.org/z/78KodqxaP | nitwit005 wrote: | > The code is a bit technical, but remarkably, it does not | require a table. | | Those constants being used look a lot like a table. | stncls wrote: | Especially given the fact that the table in the baseline code | is only 200 bytes. That's less than four AVX-512 registers! | adrian_b wrote: | I assume that he referred to the fact that the code does not | need an array in memory, with constant values that must be | accessed with indexed loads. Using such a table in memory can | be relatively slow, as proven by the benchmarks for this | method. | | There are constants, but they are used directly as arguments | for the AVX-512 intrinsics. They must still be loaded into the | AVX-512 registers from memory, but they are loaded from | locations already known at compile-time and the loads can be | scheduled optimally by the compiler, because they are not | dependent on any other instructions. | | For a table stored in an array in memory, the loads can be done | only after the indices are computed and the loads may need to | be done in a random order, not in the sequential order of the | memory locations. When the latencies of the loads cannot be | hidden, they can slow a lot any algorithm. | nitwit005 wrote: | The code itself sits in memory, which means the "table" is | still in memory. | orlp wrote: | If your program is sufficiently large and diverse the | constants embedded in the instructions are no better than a | table load/lookup. The jump to the conversion routine is | unpredictable, and the routine will not be in the instruction | cache, causing the CPU to stall while waiting for the | instructions to load from RAM. | | This will never show up in a microbenchmark where the | function's instructions are always hot. In fact, a lot of | microbenchmarking software "warms up" the code by calling it | a bunch of times before starting to time it, to maximize the | chances of them being able to ignore this reality. | dzaima wrote: | The AVX-512 constants aren't embedded in the instructions, | but the address to load being static means the CPU can | start to cache them when the decoder gets to them, possibly | even before it knows that the input to the function is. | | In contrast, for the scalar code, the CPU must complete the | divisions (which'll become multiplications, but even those | still have relatively high latency) before it can even | begin to look for what to cache. | dzaima wrote: | The list of arguments to _mm512_setr_epi64 are just reciprocals | of powers of 10, multiplied by 2^52. The scalar code uses | division by powers of 10, which'd compile down to similar | multiplication constants; you just don't see them because the | compiler does that for the scalar code, but you have to do so | manually for AVX-512. | | And permb_const is a list of indices for the result characters | within the vector register - the algorithm works on 64-bit | integers, but the result must be a list of bytes, so it picks | out every 8th one. | jeffbee wrote: | I'd be curious about use cases where this does or does not make | sense. On the one hand, you saved a few nanoseconds on the | integer-to-string encoder. But on the other hand you're committed | to storing and transferring up to 15 leading zeros, and there | must be some cost on the decoder to consuming the leading zeros. | So this clearly makes sense on a write-once-read-never | application but there must also be a point at which the full | lifecycle cost crosses over and this approach is worse. | aqrit wrote: | related to IEEE 754 double-precision floating-point round- | trips? | dzaima wrote: | The article compares padded length 15 output for both cases; | removing leading zeroes would have cost for both AVX-512 and | regular code. | jeffbee wrote: | Yeah, that's the "this approach" to which I refer though. | This is a micro-optimization of an approach that I'm not sure | has many beneficial applications. | dzaima wrote: | Ah. The article, as I see it, is primarily about AVX-512 vs | scalar code, not the 15 digits though. The fixed-length | restriction is purely to simplify the challenge. | | To remove leading zeroes, you'd need to use one of | bsf/lzcnt/pcmpistri, and a masked store, which has some | cost, but still stays branchless and will probably easily | be compensated by the smaller cache/storage/network usage & | shorter decoding. | bumblebritches5 wrote: ___________________________________________________________________ (page generated 2022-03-29 23:00 UTC)