[HN Gopher] Parsing JSON faster with Intel AVX-512 ___________________________________________________________________ Parsing JSON faster with Intel AVX-512 Author : ashvardanian Score : 97 points Date : 2022-05-25 21:29 UTC (1 days ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | worewood wrote: | Using specialized instructions not always turn into performance | improvements. Processors are pretty smart these days and the | generated u-ops may be the same | skavi wrote: | Hopefully we'll see AVX-512 in Intel's little cores soon. | Centaur's last CPU architecture proves that it is possible to | implement the extension without a huge amount of area [0]. Once | that happens, I expect we'll finally consistently see AVX-512 on | new Intel processors. The masks really are a huge improvement to | the design. | | AMD should be implementing AVX-512 on their own cores soon as | well. Once Armv9 (with SVE2) becomes dominant, we'll pretty much | be in a golden age of SIMD. | | [0]: https://chipsandcheese.com/2022/04/30/examining-centaur- | chas... | torginus wrote: | I'm kinda torn on AVX-512 (and SIMD in general). On one hand, | AVX-512 finally introduced a sane programming model with mask | registers for branching code, which makes the lives of | compilers much easier. | | On the other hand, the tooling for turning high-level languages | into SIMD code is not there yet, ISPC refuses to support ARM, | and is still kind of a novelty tool. | | Additionally, 512-bit wide vectors are just too big - the | resulting vector units take up too much die space even on _big_ | cores, and the power consumption causes issues causing said | dies to downclock. Probably it won 't be viable on small cores. | dr_zoidberg wrote: | > Additionally, 512-bit wide vectors are just too big - the | resulting vector units take up too much die space even on big | cores, and the power consumption causes issues causing said | dies to downclock. | | This is no longer true, citing [0]: | | > At least, it means we need to adjust our mental model of | the frequency related cost of AVX-512 instructions. Rather | than the prior-generation verdict of "AVX-512 generally | causes significant downclocking", on these Ice Lake and | Rocket Lake client chips we can say that AVX-512 causes | insignificant (usually, none at all) license-based | downclocking and I expect this to be true on other ICL and | RKL client chips as well. | | And we still have to see AMDs implementation of AVX512 on | Zen4 to know what behavior and limits it may have (if any). | | [0] https://travisdowns.github.io/blog/2020/08/19/icl- | avx512-fre... | jeffbee wrote: | Considering that the execution units, register file, etc that | support AVX-512 are themselves nearly as large as the entire | Gracemont core ... don't hold your breath. | brigade wrote: | You don't need larger than the 128-bit ALUs or the | 207x128-bit register file Gracemont already has to implement | AVX-512. It doesn't make sense on its own with that backend, | but for ISA compatibility with a big core it does. | Dylan16807 wrote: | Can the shuffling instructions be reasonably efficient with | a small ALU? | brigade wrote: | Depends on what you consider reasonable. Worst case is | 512-bit vpermi2*, which could be implemented with 16x | 128-bit vpermi2-like uops, if the needed masking was | implicit. | | Which to me is reasonable for ISA compatibility. (Also | considering that having to deal with ISA incompatibility | across active cores is _not_ reasonable at all.) | jeffbee wrote: | I'm not sure that users would accept that. You could have a | situation where an ifunc is resolved on a fast core with a | slightly superior AVX-512 definition, but then the thread | migrates to an efficiency core and the AVX-512 definition | is dramatically slower than what could have been achieved | with AVX2 (e.g. if a microcoded permute was 16x slower). | brigade wrote: | Most reasonable would be a hypothetical AVX-256 that was | AVX-512VL minus ZMM registers. Intel chose against that. | | So the only reasonable options for a big little system | are to not have little cores, or for nothing to support | AVX-512, or for the little cores to support AVX-512 as | best they can. Then thread director can weight AVX-512 | usage even heavier than it already weights AVX2. | dragontamer wrote: | > we'll pretty much be in a golden age of SIMD. | | We already are in the golden age of SIMD. NVidia and AMD GPUs | are easier and easier to program through standard interfaces. | | Intel / AMD are pushing SIMD on a CPU, which is useful for | sure, but always is going to be smaller in scope than a | dedicated SIMD-processor like A100, 3060, AMD Vega, AMD 6800 xt | and the like. | | SIMD-on-a-CPU is useful because you can perform SIMD over the | L1 cache as communication (rather than traversing L1 -> L2 -> | L3 -> DDR4 / PCIe -> GPU VRAM -> GPU Registers -> SIMD, and | back). But if you have a large-scale operation that can work | SIMD, the GPU-traversal absolutely works and is commonly done. | skavi wrote: | Good point. Should have clarified I was referring to CPU | SIMD. | dragontamer wrote: | AVX2 is not as good as AVX512. But AVX2 still has vgather | instructions, pshufb, and a few other useful tricks. | | AVX512 and ARM SVE2 bring the CPU up to parity with maybe | 2010s-era GPUs or so (full gather/scatter, more permutation | instructions, etc. etc.). But GPUs continued to evolve. | Butterfly-shuffles are the generic any-to-any network | building block, and are exposed in PTX (NVidia assembly) | shfl.bfly, and AMD DPP (data-parallel primitives). | | Having a richer set of lane-to-lane shuffling (especially | ready-to-use butterfly networks) would be best. It really | is surprising how many problems require those rich-sets of | data-movement instructions, or otherwise benefit from them. | | NEON and SVE had hard-coded data-movement for specific | applications. The general-purpose instruction (pshufb) is | kinda like permute/shfl from AMD/NVidia. A backwards- | permute IIRC doesn't exist yet on CPU-side. | | And butterfly networks are the general-purpose solution, | capable of implementing any arbitrary data-movement in just | log(width) steps. (pshufb / permute instructions would be | the full-sized butterfly network, but some cases might be | "easier" and faster to execute with only a limited number | of butterfly swaps, such as what inevitably comes up in | sorting) | | -------- | | Still, all of these operations can be implemented in AVX2 | (albeit slower / less efficiently). So its not like the | "language" of AVX2 / AVX is incomplete... its just missing | a few general-purpose instructions that could lead to | better performance. | PragmaticPulp wrote: | > Could we do better? Assuredly. There are many AVX-512 that we | are not using yet. We do not use ternary Boolean operations | (vpternlog). We are not using the new powerful shuffle functions | (e.g., vpermt2b). We have an example of coevolution: better | hardware requires new software which, in turn, makes the hardware | shine. | | > Of course, to get these new benefits, you need recent Intel | processors with adequate AVX-512 support | | AVX-512 support can be confusing because it's often referred to | as a single instruction set. | | AVX-512 is actually a large family of instructions that have | different availability depending on the CPU. It's not enough to | say that a CPU has AVX-512 because it's not a binary question. | You have to know _which_ AVX-512 instructions are supported on a | particular CPU. | | Wikipedia has a partial chart of AVX-512 support by CPU: | https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 | | Note that some instructions that are available in one generation | of CPU are can actually be unavailable (superseded, usually) in | the next generation of the CPU. If you go deep enough into | AVX-512 optimization, you essentially end up targeting a specific | CPU for the code. This is not a big deal if you're deploying | software to 10,000 carefully controlled cloud servers with known | specifications, but it makes general use and especially consumer | use much harder. | robocat wrote: | To add, they are using[2] the relatively recent VBMI2 | instructions of AVX512. This article[1] talks about the | advantages of VBMI on IceLake released 2021. | | [1] https://www.singlestore.com/blog/a-programmers-perspective/ | comments https://news.ycombinator.com/item?id=28179111 | | [2] https://news.ycombinator.com/item?id=31522464 | mikepurvis wrote: | Are there good libraries for doing runtime feature detection? | Eg, include three versions of hot function X in the binary, and | have it seamlessly insert the correct function pointer at | startup? Or have the function contain multiple bodies and just | JMP to the correct block of code? | | I know you can do this yourself, but last time I looked it was | a heavily manual process-- you had to basically define a plugin | interface and dynamically load your selected implementation | from a separate shared object. What are the barriers to having | compilers able to be hinted into transparently generating | multiple versions of key functions? | bremac wrote: | I'm unsure about library support, but gcc and clang support | function multi-versioning (FMV), which resolves the function | based on CPUID the first time the function is called. | | This LWN article has some additional information: | https://lwn.net/Articles/691932/ | mikepurvis wrote: | TIL! I guess it makes sense that popular numeric libraries | like BLAS, Eigen, and so-on would take advantage of this, | but I wonder how widely used it is overall. | loeg wrote: | GCC has offered Function Multiversioning for about a decade | now (GCC ~4.8 or 4.9). GCC 6's resolver apparently uses CPUID | to resolve the ifunc once at program start: | https://lwn.net/Articles/691932/ . | | Clang added it in 7.0.0: https://releases.llvm.org/7.0.0/tool | s/clang/docs/AttributeRe... | | A nice presentation on it: | https://llvm.org/devmtg/2014-10/Slides/Christopher- | Function%... | indygreg2 wrote: | While this IFUNC feature does exist and it is useful, when | I performed binary analysis on every package in Ubuntu in | January, I found that only ~11 distinct packages have | IFUNCs. It certainly looks like this ELF feature is not | really used much [in open source] outside of GNU toolchain- | level software! | | https://gregoryszorc.com/blog/2022/01/09/bulk-analyze- | linux-... | TkTech wrote: | I've wanted to use them many times in the past, but the | limited support on other compilers (looking at you MSVC) | always made it a non-starter. If I have to support some | other method of feature detection anyways, there's no | point. | colejohnson66 wrote: | Check out Agner Fog's vectorclass library: | https://github.com/vectorclass/version2 | beached_whale wrote: | If only intel wasnt dropping support for it on a lot of cpus | jrimbault wrote: | I'm using this comment as jumping point. | | What's cost/opportunity to optimizing for a specific | platform/instruction set ? At what point is it worth doing, | when isn't it worth doing ? AVX-512 strikes as something... | "ephemeral". | skavi wrote: | writing code for SIMD can get you absolutely massive | performance improvements. Whether it's worth the added | complexity depends on the situation. If your data is already | arranged in a cache friendly way (SoA), it shouldn't be | incredibly difficult to use SIMD intrinsics to optimize. I'd | first take a look at what the compiler is already generating | for you to see if manual intervention is worth it. | throwaway92394 wrote: | Well I mean this article is demoing a 28% improvement (if I | did my math right) for json parsing. | | Sure AVX-512 is only applicable to specific workloads, and | even many of those workloads the cost/opportunity of | optimizing for AVX-512 might not be worth it. But there | clearly ARE usecases that would benefit, and it might be | worth it for more consumer applications to optimize for | AVX-512 - but only if it can be used. | | The way I see it is that the benefit of optimizing for | AVX-512 is far higher if it becomes normal for consumer CPUs | to have it. A 28% improvement is pretty decent, but it's only | worth implementing if enough people can utilize it. | beached_whale wrote: | For many maybe not, but when writing the foundations of | software it is good to start fast. There are libraries that | abstract various simd architectures now too. Simdjson has | their own and there are ones lile KUMI | jcranmer wrote: | Intel is not dropping support for it on a lot of CPUs. | | The only thing they've done is disable it in the hybrid Alder | Lake cores, presumably because the E-cores couldn't support it | (while the P-cores could), and they didn't want to deal with | the headaches of ISA extensions being supported only on some | cores in the system. | Aardwolf wrote: | > Intel is not dropping support for it on a lot of CPUs. | | There are 0 current generation consumer CPUs of neither Intel | nor AMD that have it | | > The only thing they've done is disable it in the hybrid | Alder Lake cores | | Which happen to be _all_ the current generation Intel CPUs | beached_whale wrote: | Ah, headlines foiled me. I read disabled in Alder lake all | together. | temac wrote: | It is disabled in all consumer Alder lake (and I don't | remember if there will be Xeon of that gen with P-core only | -- IIRC Intel stop the AVX512 validation late on those | cores, but it was still before it was formally finished, so | probably not). At one point it worked with some Bios on | P-core only chips or if you disabled E-core on hybrid ones, | but with up-to-date Intel microcode it does not work | anymore. | coder543 wrote: | > The only thing they've done is disable it in the hybrid | Alder Lake cores | | That is incorrect. You can buy Alder Lake CPUs that only have | one type of core (the i3 series only has P-cores, for | example), and those do not support AVX-512 either. They're | not "hybrid" in any way. | | Some of their motherboard partners initially allowed you to | access AVX-512, but Intel has put a stop to this and the | feature is disabled on _all_ Alder Lake CPU SKUs, period. | | Newer Alder Lake chips have AVX-512 fused off entirely: | https://www.tomshardware.com/news/intel-nukes-alder-lake- | avx... | | > Intel is not dropping support for it on a lot of CPUs. | | That seems like a pretty questionable statement. Intel might | keep AVX-512 around for Xeon, but it seems extremely dead on | the consumer market. If Intel decides to bring it back for | the next generation, that would be strange and very poor | planning. | gpderetta wrote: | It seems likely that the reason is that some intel | customers are willing to pai a significant premium for the | feature and intel doesn't want it to be available for cheap | nomel wrote: | Well, if it means more cores, it's almost certainly worth it, | in the grand scheme of things. | bfrog wrote: | At what point is JSON not the right option? Surely when trying to | do this sort of thing? | | At what point is it saner to use something like flatbuffers or | capnproto style message encoding instead. | smabie wrote: | Often you do not get the choice of whether you want to be | parsing json or not. | vardump wrote: | Sometimes you just don't have a choice when you need to | interface with a third party data feed or software. | | Isn't it better to have all options open? | avg_dev wrote: | Good thought. If you are coding in C++ maybe you can use some | sort of binary serialization thing. Even in other languages if | json parsing is a bottleneck it can possibly be optimized away | through use of a binary wire format. That said vector | operations available to programmers is always a welcome thing | I'd say. And who knows how much production json parsing this | library really does, it could be a ton. | | I'm torn. I've worked at shops where we aim over time to reduce | response time while serving business logic and using | statistical models that get iterated on. Even there I haven't | seen a blatant need for non-JSON rpc. But I know my experience | doesn't mirror everyone's. And I like seeing and learning about | instruction sets. I'm currently taking a course in parallel | computing and I just used a avx2 for the first time in a toy | program to subtract one vector from another in a single | instruction which while not particularly useful is a window | into more interesting things and is still SIMD. | | I think on the whole making json parsing for a large enough | fraction of processors is probably a huge win for the | environment. But who is parsing json in C++? | ollien wrote: | > But who is parsing json in C++? | | Well, Facebook for one! Folly has lots of utilities for this | (see folly::dynamic[1]). We make extensive use of this at my | (non-Facebook) job. | | [1] https://github.com/facebook/folly/blob/master/folly/docs/ | Dyn... | timerol wrote: | > Of course, to get these new benefits, you need recent Intel | processors with adequate AVX-512 support and, evidently, you also | need relatively recent C++ processors. Some of the recent laptop- | class Intel processors do not support AVX-512 but you should be | fine if you rely on AWS and have big Intel nodes. | | What is meant by "relatively recent C++ processors"? Is that | supposed to be "compilers"? | [deleted] | Narishma wrote: | It's supposed to be Intel, not C++. | NegativeLatency wrote: | "new" is relative since they've been out for almost 10 years: | https://www.intel.com/content/www/us/en/developer/articles/t... | jeffbee wrote: | This code uses VBMI2, which just came out quite recently. ___________________________________________________________________ (page generated 2022-05-26 23:00 UTC)