hngopher.com

       [HN Gopher] Parsing JSON faster with Intel AVX-512
       ___________________________________________________________________
        
       Parsing JSON faster with Intel AVX-512
        
       Author : ashvardanian
       Score  : 97 points
       Date   : 2022-05-25 21:29 UTC (1 days ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | worewood wrote:
       | Using specialized instructions not always turn into performance
       | improvements. Processors are pretty smart these days and the
       | generated u-ops may be the same
        
       | skavi wrote:
       | Hopefully we'll see AVX-512 in Intel's little cores soon.
       | Centaur's last CPU architecture proves that it is possible to
       | implement the extension without a huge amount of area [0]. Once
       | that happens, I expect we'll finally consistently see AVX-512 on
       | new Intel processors. The masks really are a huge improvement to
       | the design.
       | 
       | AMD should be implementing AVX-512 on their own cores soon as
       | well. Once Armv9 (with SVE2) becomes dominant, we'll pretty much
       | be in a golden age of SIMD.
       | 
       | [0]: https://chipsandcheese.com/2022/04/30/examining-centaur-
       | chas...
        
         | torginus wrote:
         | I'm kinda torn on AVX-512 (and SIMD in general). On one hand,
         | AVX-512 finally introduced a sane programming model with mask
         | registers for branching code, which makes the lives of
         | compilers much easier.
         | 
         | On the other hand, the tooling for turning high-level languages
         | into SIMD code is not there yet, ISPC refuses to support ARM,
         | and is still kind of a novelty tool.
         | 
         | Additionally, 512-bit wide vectors are just too big - the
         | resulting vector units take up too much die space even on _big_
         | cores, and the power consumption causes issues causing said
         | dies to downclock. Probably it won 't be viable on small cores.
        
           | dr_zoidberg wrote:
           | > Additionally, 512-bit wide vectors are just too big - the
           | resulting vector units take up too much die space even on big
           | cores, and the power consumption causes issues causing said
           | dies to downclock.
           | 
           | This is no longer true, citing [0]:
           | 
           | > At least, it means we need to adjust our mental model of
           | the frequency related cost of AVX-512 instructions. Rather
           | than the prior-generation verdict of "AVX-512 generally
           | causes significant downclocking", on these Ice Lake and
           | Rocket Lake client chips we can say that AVX-512 causes
           | insignificant (usually, none at all) license-based
           | downclocking and I expect this to be true on other ICL and
           | RKL client chips as well.
           | 
           | And we still have to see AMDs implementation of AVX512 on
           | Zen4 to know what behavior and limits it may have (if any).
           | 
           | [0] https://travisdowns.github.io/blog/2020/08/19/icl-
           | avx512-fre...
        
         | jeffbee wrote:
         | Considering that the execution units, register file, etc that
         | support AVX-512 are themselves nearly as large as the entire
         | Gracemont core ... don't hold your breath.
        
           | brigade wrote:
           | You don't need larger than the 128-bit ALUs or the
           | 207x128-bit register file Gracemont already has to implement
           | AVX-512. It doesn't make sense on its own with that backend,
           | but for ISA compatibility with a big core it does.
        
             | Dylan16807 wrote:
             | Can the shuffling instructions be reasonably efficient with
             | a small ALU?
        
               | brigade wrote:
               | Depends on what you consider reasonable. Worst case is
               | 512-bit vpermi2*, which could be implemented with 16x
               | 128-bit vpermi2-like uops, if the needed masking was
               | implicit.
               | 
               | Which to me is reasonable for ISA compatibility. (Also
               | considering that having to deal with ISA incompatibility
               | across active cores is _not_ reasonable at all.)
        
             | jeffbee wrote:
             | I'm not sure that users would accept that. You could have a
             | situation where an ifunc is resolved on a fast core with a
             | slightly superior AVX-512 definition, but then the thread
             | migrates to an efficiency core and the AVX-512 definition
             | is dramatically slower than what could have been achieved
             | with AVX2 (e.g. if a microcoded permute was 16x slower).
        
               | brigade wrote:
               | Most reasonable would be a hypothetical AVX-256 that was
               | AVX-512VL minus ZMM registers. Intel chose against that.
               | 
               | So the only reasonable options for a big little system
               | are to not have little cores, or for nothing to support
               | AVX-512, or for the little cores to support AVX-512 as
               | best they can. Then thread director can weight AVX-512
               | usage even heavier than it already weights AVX2.
        
         | dragontamer wrote:
         | > we'll pretty much be in a golden age of SIMD.
         | 
         | We already are in the golden age of SIMD. NVidia and AMD GPUs
         | are easier and easier to program through standard interfaces.
         | 
         | Intel / AMD are pushing SIMD on a CPU, which is useful for
         | sure, but always is going to be smaller in scope than a
         | dedicated SIMD-processor like A100, 3060, AMD Vega, AMD 6800 xt
         | and the like.
         | 
         | SIMD-on-a-CPU is useful because you can perform SIMD over the
         | L1 cache as communication (rather than traversing L1 -> L2 ->
         | L3 -> DDR4 / PCIe -> GPU VRAM -> GPU Registers -> SIMD, and
         | back). But if you have a large-scale operation that can work
         | SIMD, the GPU-traversal absolutely works and is commonly done.
        
           | skavi wrote:
           | Good point. Should have clarified I was referring to CPU
           | SIMD.
        
             | dragontamer wrote:
             | AVX2 is not as good as AVX512. But AVX2 still has vgather
             | instructions, pshufb, and a few other useful tricks.
             | 
             | AVX512 and ARM SVE2 bring the CPU up to parity with maybe
             | 2010s-era GPUs or so (full gather/scatter, more permutation
             | instructions, etc. etc.). But GPUs continued to evolve.
             | Butterfly-shuffles are the generic any-to-any network
             | building block, and are exposed in PTX (NVidia assembly)
             | shfl.bfly, and AMD DPP (data-parallel primitives).
             | 
             | Having a richer set of lane-to-lane shuffling (especially
             | ready-to-use butterfly networks) would be best. It really
             | is surprising how many problems require those rich-sets of
             | data-movement instructions, or otherwise benefit from them.
             | 
             | NEON and SVE had hard-coded data-movement for specific
             | applications. The general-purpose instruction (pshufb) is
             | kinda like permute/shfl from AMD/NVidia. A backwards-
             | permute IIRC doesn't exist yet on CPU-side.
             | 
             | And butterfly networks are the general-purpose solution,
             | capable of implementing any arbitrary data-movement in just
             | log(width) steps. (pshufb / permute instructions would be
             | the full-sized butterfly network, but some cases might be
             | "easier" and faster to execute with only a limited number
             | of butterfly swaps, such as what inevitably comes up in
             | sorting)
             | 
             | --------
             | 
             | Still, all of these operations can be implemented in AVX2
             | (albeit slower / less efficiently). So its not like the
             | "language" of AVX2 / AVX is incomplete... its just missing
             | a few general-purpose instructions that could lead to
             | better performance.
        
       | PragmaticPulp wrote:
       | > Could we do better? Assuredly. There are many AVX-512 that we
       | are not using yet. We do not use ternary Boolean operations
       | (vpternlog). We are not using the new powerful shuffle functions
       | (e.g., vpermt2b). We have an example of coevolution: better
       | hardware requires new software which, in turn, makes the hardware
       | shine.
       | 
       | > Of course, to get these new benefits, you need recent Intel
       | processors with adequate AVX-512 support
       | 
       | AVX-512 support can be confusing because it's often referred to
       | as a single instruction set.
       | 
       | AVX-512 is actually a large family of instructions that have
       | different availability depending on the CPU. It's not enough to
       | say that a CPU has AVX-512 because it's not a binary question.
       | You have to know _which_ AVX-512 instructions are supported on a
       | particular CPU.
       | 
       | Wikipedia has a partial chart of AVX-512 support by CPU:
       | https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
       | 
       | Note that some instructions that are available in one generation
       | of CPU are can actually be unavailable (superseded, usually) in
       | the next generation of the CPU. If you go deep enough into
       | AVX-512 optimization, you essentially end up targeting a specific
       | CPU for the code. This is not a big deal if you're deploying
       | software to 10,000 carefully controlled cloud servers with known
       | specifications, but it makes general use and especially consumer
       | use much harder.
        
         | robocat wrote:
         | To add, they are using[2] the relatively recent VBMI2
         | instructions of AVX512. This article[1] talks about the
         | advantages of VBMI on IceLake released 2021.
         | 
         | [1] https://www.singlestore.com/blog/a-programmers-perspective/
         | comments https://news.ycombinator.com/item?id=28179111
         | 
         | [2] https://news.ycombinator.com/item?id=31522464
        
         | mikepurvis wrote:
         | Are there good libraries for doing runtime feature detection?
         | Eg, include three versions of hot function X in the binary, and
         | have it seamlessly insert the correct function pointer at
         | startup? Or have the function contain multiple bodies and just
         | JMP to the correct block of code?
         | 
         | I know you can do this yourself, but last time I looked it was
         | a heavily manual process-- you had to basically define a plugin
         | interface and dynamically load your selected implementation
         | from a separate shared object. What are the barriers to having
         | compilers able to be hinted into transparently generating
         | multiple versions of key functions?
        
           | bremac wrote:
           | I'm unsure about library support, but gcc and clang support
           | function multi-versioning (FMV), which resolves the function
           | based on CPUID the first time the function is called.
           | 
           | This LWN article has some additional information:
           | https://lwn.net/Articles/691932/
        
             | mikepurvis wrote:
             | TIL! I guess it makes sense that popular numeric libraries
             | like BLAS, Eigen, and so-on would take advantage of this,
             | but I wonder how widely used it is overall.
        
           | loeg wrote:
           | GCC has offered Function Multiversioning for about a decade
           | now (GCC ~4.8 or 4.9). GCC 6's resolver apparently uses CPUID
           | to resolve the ifunc once at program start:
           | https://lwn.net/Articles/691932/ .
           | 
           | Clang added it in 7.0.0: https://releases.llvm.org/7.0.0/tool
           | s/clang/docs/AttributeRe...
           | 
           | A nice presentation on it:
           | https://llvm.org/devmtg/2014-10/Slides/Christopher-
           | Function%...
        
             | indygreg2 wrote:
             | While this IFUNC feature does exist and it is useful, when
             | I performed binary analysis on every package in Ubuntu in
             | January, I found that only ~11 distinct packages have
             | IFUNCs. It certainly looks like this ELF feature is not
             | really used much [in open source] outside of GNU toolchain-
             | level software!
             | 
             | https://gregoryszorc.com/blog/2022/01/09/bulk-analyze-
             | linux-...
        
               | TkTech wrote:
               | I've wanted to use them many times in the past, but the
               | limited support on other compilers (looking at you MSVC)
               | always made it a non-starter. If I have to support some
               | other method of feature detection anyways, there's no
               | point.
        
           | colejohnson66 wrote:
           | Check out Agner Fog's vectorclass library:
           | https://github.com/vectorclass/version2
        
       | beached_whale wrote:
       | If only intel wasnt dropping support for it on a lot of cpus
        
         | jrimbault wrote:
         | I'm using this comment as jumping point.
         | 
         | What's cost/opportunity to optimizing for a specific
         | platform/instruction set ? At what point is it worth doing,
         | when isn't it worth doing ? AVX-512 strikes as something...
         | "ephemeral".
        
           | skavi wrote:
           | writing code for SIMD can get you absolutely massive
           | performance improvements. Whether it's worth the added
           | complexity depends on the situation. If your data is already
           | arranged in a cache friendly way (SoA), it shouldn't be
           | incredibly difficult to use SIMD intrinsics to optimize. I'd
           | first take a look at what the compiler is already generating
           | for you to see if manual intervention is worth it.
        
           | throwaway92394 wrote:
           | Well I mean this article is demoing a 28% improvement (if I
           | did my math right) for json parsing.
           | 
           | Sure AVX-512 is only applicable to specific workloads, and
           | even many of those workloads the cost/opportunity of
           | optimizing for AVX-512 might not be worth it. But there
           | clearly ARE usecases that would benefit, and it might be
           | worth it for more consumer applications to optimize for
           | AVX-512 - but only if it can be used.
           | 
           | The way I see it is that the benefit of optimizing for
           | AVX-512 is far higher if it becomes normal for consumer CPUs
           | to have it. A 28% improvement is pretty decent, but it's only
           | worth implementing if enough people can utilize it.
        
           | beached_whale wrote:
           | For many maybe not, but when writing the foundations of
           | software it is good to start fast. There are libraries that
           | abstract various simd architectures now too. Simdjson has
           | their own and there are ones lile KUMI
        
         | jcranmer wrote:
         | Intel is not dropping support for it on a lot of CPUs.
         | 
         | The only thing they've done is disable it in the hybrid Alder
         | Lake cores, presumably because the E-cores couldn't support it
         | (while the P-cores could), and they didn't want to deal with
         | the headaches of ISA extensions being supported only on some
         | cores in the system.
        
           | Aardwolf wrote:
           | > Intel is not dropping support for it on a lot of CPUs.
           | 
           | There are 0 current generation consumer CPUs of neither Intel
           | nor AMD that have it
           | 
           | > The only thing they've done is disable it in the hybrid
           | Alder Lake cores
           | 
           | Which happen to be _all_ the current generation Intel CPUs
        
           | beached_whale wrote:
           | Ah, headlines foiled me. I read disabled in Alder lake all
           | together.
        
             | temac wrote:
             | It is disabled in all consumer Alder lake (and I don't
             | remember if there will be Xeon of that gen with P-core only
             | -- IIRC Intel stop the AVX512 validation late on those
             | cores, but it was still before it was formally finished, so
             | probably not). At one point it worked with some Bios on
             | P-core only chips or if you disabled E-core on hybrid ones,
             | but with up-to-date Intel microcode it does not work
             | anymore.
        
           | coder543 wrote:
           | > The only thing they've done is disable it in the hybrid
           | Alder Lake cores
           | 
           | That is incorrect. You can buy Alder Lake CPUs that only have
           | one type of core (the i3 series only has P-cores, for
           | example), and those do not support AVX-512 either. They're
           | not "hybrid" in any way.
           | 
           | Some of their motherboard partners initially allowed you to
           | access AVX-512, but Intel has put a stop to this and the
           | feature is disabled on _all_ Alder Lake CPU SKUs, period.
           | 
           | Newer Alder Lake chips have AVX-512 fused off entirely:
           | https://www.tomshardware.com/news/intel-nukes-alder-lake-
           | avx...
           | 
           | > Intel is not dropping support for it on a lot of CPUs.
           | 
           | That seems like a pretty questionable statement. Intel might
           | keep AVX-512 around for Xeon, but it seems extremely dead on
           | the consumer market. If Intel decides to bring it back for
           | the next generation, that would be strange and very poor
           | planning.
        
             | gpderetta wrote:
             | It seems likely that the reason is that some intel
             | customers are willing to pai a significant premium for the
             | feature and intel doesn't want it to be available for cheap
        
         | nomel wrote:
         | Well, if it means more cores, it's almost certainly worth it,
         | in the grand scheme of things.
        
       | bfrog wrote:
       | At what point is JSON not the right option? Surely when trying to
       | do this sort of thing?
       | 
       | At what point is it saner to use something like flatbuffers or
       | capnproto style message encoding instead.
        
         | smabie wrote:
         | Often you do not get the choice of whether you want to be
         | parsing json or not.
        
         | vardump wrote:
         | Sometimes you just don't have a choice when you need to
         | interface with a third party data feed or software.
         | 
         | Isn't it better to have all options open?
        
         | avg_dev wrote:
         | Good thought. If you are coding in C++ maybe you can use some
         | sort of binary serialization thing. Even in other languages if
         | json parsing is a bottleneck it can possibly be optimized away
         | through use of a binary wire format. That said vector
         | operations available to programmers is always a welcome thing
         | I'd say. And who knows how much production json parsing this
         | library really does, it could be a ton.
         | 
         | I'm torn. I've worked at shops where we aim over time to reduce
         | response time while serving business logic and using
         | statistical models that get iterated on. Even there I haven't
         | seen a blatant need for non-JSON rpc. But I know my experience
         | doesn't mirror everyone's. And I like seeing and learning about
         | instruction sets. I'm currently taking a course in parallel
         | computing and I just used a avx2 for the first time in a toy
         | program to subtract one vector from another in a single
         | instruction which while not particularly useful is a window
         | into more interesting things and is still SIMD.
         | 
         | I think on the whole making json parsing for a large enough
         | fraction of processors is probably a huge win for the
         | environment. But who is parsing json in C++?
        
           | ollien wrote:
           | > But who is parsing json in C++?
           | 
           | Well, Facebook for one! Folly has lots of utilities for this
           | (see folly::dynamic[1]). We make extensive use of this at my
           | (non-Facebook) job.
           | 
           | [1] https://github.com/facebook/folly/blob/master/folly/docs/
           | Dyn...
        
       | timerol wrote:
       | > Of course, to get these new benefits, you need recent Intel
       | processors with adequate AVX-512 support and, evidently, you also
       | need relatively recent C++ processors. Some of the recent laptop-
       | class Intel processors do not support AVX-512 but you should be
       | fine if you rely on AWS and have big Intel nodes.
       | 
       | What is meant by "relatively recent C++ processors"? Is that
       | supposed to be "compilers"?
        
         | [deleted]
        
         | Narishma wrote:
         | It's supposed to be Intel, not C++.
        
       | NegativeLatency wrote:
       | "new" is relative since they've been out for almost 10 years:
       | https://www.intel.com/content/www/us/en/developer/articles/t...
        
         | jeffbee wrote:
         | This code uses VBMI2, which just came out quite recently.
        
       ___________________________________________________________________
       (page generated 2022-05-26 23:00 UTC)