[HN Gopher] Faster CRC32 on the Apple M1
       ___________________________________________________________________
        
       Faster CRC32 on the Apple M1
        
       Author : panic
       Score  : 251 points
       Date   : 2022-05-22 15:36 UTC (7 hours ago)
        
 (HTM) web link (dougallj.wordpress.com)
 (TXT) w3m dump (dougallj.wordpress.com)
        
       | terrelln wrote:
       | Could you combine both techniques to run both the SIMD version on
       | some chunks and the crc32 instruction on other chunks, in
       | parallel? Of course this would only work if they execute on
       | different ports.
        
         | dougall wrote:
         | Hmm, yeah, this might work out... Two SIMD uops process 16
         | bytes, so each SIMD uop is doing eight bytes of work - the same
         | as CRC32X, but with more frontend pressure (and preferable
         | because they can run on any of the four SIMD ports, not just
         | the one distinct CRC32X port).
         | 
         | It gets a bit messy, and we can't expect a ton from this
         | approach - the same loop with only the loads only runs at
         | ~86GB/s, but it'd be worth a shot.
        
         | AlotOfReading wrote:
         | Yes, and ZLib provides an excellent example of how to do this
         | (it's called braiding in the source code). There's an
         | additional initialization and combination cost though. It
         | doesn't really make sense for short messages.
        
           | terrelln wrote:
           | Thanks for the pointer, will have to take a look!
        
         | sgtnoodle wrote:
         | It seems like CRC inherently depends on results from earlier
         | calculations, so it would be hard to parallelize like that. You
         | could potentially do multiple independent CRC calculations in
         | parallel, but then you're getting into more niche use cases.
        
           | aaaaaaaaaaab wrote:
           | Wrong. CRC is just polynomial division, which is simple to do
           | in a divide and conquer fashion. It's pretty easy to derive
           | CRC(A concat B) from CRC(A) and CRC(B). It needs a
           | multiplication and a XOR.
        
       | 13of40 wrote:
       | Weird, I talked to the guy who invented the crypto scheme for ZIP
       | and he said he invented the CRC algorithm for it as well. I
       | wonder if there's more backstory there.
       | 
       | Edit: He invented the crypto not the CRC, which Phil Katz was
       | already using.
        
         | AlotOfReading wrote:
         | That's a very strange claim. As far as I know, the same CRC has
         | been used for ZIP since it was invented by the PKWARE guys in
         | the 80s. Moreover, no one deeply understood the properties and
         | tradeoffs of various CRCs the way we do now, so everyone used
         | largely identical algorithms with only trivial variations. Phil
         | Katz did the same and reused the same polynomial that was in
         | ethernet and dozens of other standards, which in turn had
         | originated from this 1975 report:
         | https://apps.dtic.mil/sti/pdfs/ADA013939.pdf
         | 
         | He wasn't even the first to put a CRC in an archive format, as
         | the predecessor format ARC had a CRC-16 doing the same thing.
        
           | carbonbee wrote:
           | I think what OP meant to write: the zip encryption algorithm
           | is a custom stream cypher that uses crc32 as the main
           | building block.
           | 
           | (It's a very bad cypher, vulnerable to known plaintext and
           | other attacks, don't use it for anything except light
           | scrambling).
        
             | mappu wrote:
             | Nowadays most zip programs will default to using AES (in
             | the way WinZip invented) instead of ZipCrypto.
        
           | 13of40 wrote:
           | Ah, OK, I looked in my email and it was the guy who invented
           | the encryption scheme, but he said "Yes, I invented it
           | [referring to the encryption]. It wasn't based on anything
           | else, except that it used the same CRC he [Phil] was already
           | using in zip." (As a historical note, he also said the crypto
           | scheme was intended to be exportable, which at that time
           | meant "intentionally weak".)
        
       | IncRnd wrote:
       | It's always important to know which crc you are using.
       | 
       | Looking at
       | https://developer.arm.com/documentation/ddi0596/2020-12/Base...,
       | based upon the polynomial constant, the CRC32 class of
       | instructions appear to calculate a CCITT 32 reversed polynomial.
       | Are there any ARM developers who can help me out here? Does this
       | apply in the same way to the M1?
        
         | dougall wrote:
         | The post glosses over it a bit - the CRC32X instruction always
         | uses the common polynomial 0x04C11DB7 (matching zlib, commonly
         | just called CRC-32), and there's a second instruction, CRC32CX,
         | which is the same but uses the polynomial 0x1EDC6F41, known as
         | CRC-32C (Castagnoli). The constants in the post are also for
         | 0x04C11DB7, but the linked Intel article explains how to they
         | can be calculated for arbitrary polynomials, so the faster
         | method is also generic, which is nice.
        
       | DeathArrow wrote:
       | I wonder how fast can someone get it to run on Intel's 12
       | generation core CPU.
       | 
       | It seems a good idea to start a Code Golf competition.
        
         | fwsgonzo wrote:
         | https://github.com/komrad36/CRC
         | 
         | I measured it on my computer to run at 32GB/s, i7 4770k.
        
         | jeffbee wrote:
         | I don't think they will hit 75GB/s because Intel doesn't make
         | that kind of memory throughput available to a single core.
        
         | dragontamer wrote:
         | https://www.intel.com/content/dam/www/public/us/en/documents...
        
         | Twirrim wrote:
         | Likely as much noise as signal, but anyway:
         | 
         | I'm using an 8th(?) generation Intel, i7-8665U.
         | 
         | https://github.com/htot/crc32c has some interesting
         | implementations of CRC32 algorithms of different speeds, the
         | highest I see is (function, aligned, bytes, MiB/s) :
         | crc32cIntelC     true 16 3907.613         crc32cIntelC     true
         | 64 15096.758         crc32cIntelC     true 128 24692.803
         | crc32cIntelC     true 192 22732.392         crc32cIntelC
         | true 256 16233.397         crc32cIntelC     true 288 16748.952
         | crc32cIntelC     true 512 19862.039         crc32cIntelC
         | true 1024 22373.350         crc32cIntelC     true 1032
         | 22482.031         crc32cIntelC     true 4096 24690.531
         | crc32cIntelC     true 8192 24992.827
         | 
         | So pushing 25GiB/s on a 3ish year old CPU.
        
           | wtallis wrote:
           | FYI, CRC32C is not the same checksum as CRC32. CRC32C is used
           | in iSCSI, btrfs, ext4. CRC32 is used in Ethernet, SATA, gzip.
           | Intel's SSE4.2 provides an instruction implementing CRC32C
           | but not CRC32. ARM defines instructions for both.
           | 
           | This article seems to miss that distinction but appears to be
           | testing CRC32, so it's not quite correct to compare against
           | something using Intel's CRC32C instruction.
        
           | neurostimulant wrote:
           | I'm using i7-4790 (7 year old cpu) and the numbers are
           | slightly better. Maybe because it's a desktop.
           | crc32cIntelC        true    16  4025.334         crc32cIntelC
           | true    64  15749.095         crc32cIntelC        true    128
           | 26608.064         crc32cIntelC        true    192 25828.486
           | crc32cIntelC        true    256 17448.436
           | crc32cIntelC        true    288 18336.381
           | crc32cIntelC        true    512 22635.590
           | crc32cIntelC        true    1024    24654.248
           | crc32cIntelC        true    1032    24180.107
           | crc32cIntelC        true    4096    28251.903
           | crc32cIntelC        true    8192    28768.134
        
           | jeffbee wrote:
           | Core i7-12700K:                 crc32cIntelC     true 64
           | 26210.561       crc32cIntelC     true 128 35870.309
           | crc32cIntelC     true 192 36850.224       crc32cIntelC
           | true 256 30343.690       crc32cIntelC     true 288 30671.327
           | crc32cIntelC     true 512 32443.251       crc32cIntelC
           | true 1024 34654.719       crc32cIntelC     true 1032
           | 34265.440       crc32cIntelC     true 4096 38111.089
           | crc32cIntelC     true 8192 38634.925
        
       | DantesKite wrote:
       | I really like that the author gave some context at the top. So
       | many times I struggle to read or realize the importance of a
       | concept because there just isn't enough context for me to follow
       | along.
       | 
       | And certainly not all blogs have to, but it's nice when it is.
        
       | parentheses wrote:
       | It is very interesting that since the release of the M1 chip, CPU
       | performance on Apple silicon has really come under a microscope.
       | It leads me to ask:
       | 
       | Was Apple silicon always this best-in-class and we weren't
       | looking this closely as a community?
        
         | mhh__ wrote:
         | You couldn't buy one without a screen, or run arbitrary code on
         | them.
        
         | minhazm wrote:
         | Apple's chips in their iPhones & iPads have been outperforming
         | the competition (Qualcomm & Samsung) for a long time now in
         | both power efficiency and performance. Apple has usually been
         | around 2 yrs ahead of the competition. The Qualcomm Snapdragon
         | 888 chip with 8 cores has a Geekbench multi-core score of
         | 3592[1]. The six-score Apple A15 Bionic scored 4673. The Apple
         | chip has ~30% better multi-core performance with 25% fewer
         | cores than the Qualcomm chip. In single-core performance the
         | difference is even larger, with the Apple chips performing ~47%
         | faster. You can get the same A15 Bionic in both the $429 iPhone
         | SE and the $1100+ iPhone 13 Pro Max.
         | 
         | It hasn't really been a huge deal though because people don't
         | develop directly on an iPhone, so it doesn't affect their every
         | day productivity all that much. Also phone's have reached the
         | point of "fast enough" a few years ago, it's hard to tell the
         | difference between an iPhone 13 Pro and an 11 Pro unless you
         | use them side by side. But with the release of the M1 chip,
         | people are getting the performance & energy efficiency gains in
         | their every day workflows.
         | 
         | [1]. https://browser.geekbench.com/android-benchmarks
         | 
         | [2]. https://browser.geekbench.com/ios-benchmarks
        
         | skavi wrote:
         | Yes, and anyone who had been reading AnandTech's excellent
         | mobile reviews (by Andrei Frumusanu) knew what M1 would bring.
        
         | Aissen wrote:
         | Everyone was looking, and it was well known that it was best
         | in-class; two random examples:
         | https://www.anandtech.com/show/7335/the-iphone-5s-review/4
         | 
         | https://twitter.com/codinghorror/status/912047023871860737
        
       | ntoskrnl wrote:
       | So ARM64 has dedicated instructions for CRC32, but implementing
       | it by hand using SIMD is still faster. Score another point for
       | RISC.
        
         | pclmulqdq wrote:
         | It is not faster to use SIMD by hand - it is faster to use the
         | vector unit alongside the integer unit, using both paths at the
         | same time.
        
         | dragontamer wrote:
         | SIMD is a very powerful parallelization technique, with
         | marvelous gains whenever I see it used. It seems like a
         | fundamentally more efficient form of compute, but is very
         | difficult to design algorithms for.
         | 
         | I'd argue against "SIMD" as being "RISC", since you need all
         | sorts of complicated instructions (ex: gather/scatter) to
         | really support the methodology well in practice.
        
           | tremon wrote:
           | But scatter/gather is a primitive operation for SIMD, so if
           | you want a RISC-based version of it, that's exactly what you
           | would provide. Having dedicated instructions for specific
           | operations (whether for crc/aes/nnp or whatever) feels like a
           | CISC-based approach, so I think I agree with the GP.
           | 
           | RISC vs CISC is about the simplicity of the instruction set,
           | not about whether it's easy to use.
        
             | mhh__ wrote:
             | These days I'd argue risc vs cisc is more about regularity
             | and directness than the size of the ISA as per se.
             | 
             | I'd argue AArch64 isn't particularly RISC by the standards
             | of the past but it sets the bar and tone for RISC today.
        
               | dragontamer wrote:
               | And which SIMD instruction set should we be talking
               | about? NEON-instructions or with the SVE instruction set?
               | 
               | And if we're talking about multiple instruction-sets
               | designed for the same purpose, is this thing really RISC
               | anymore? Or do you really mean "just not x86" when you
               | say RISC ??
        
               | mhh__ wrote:
               | That depends on how precisely you define the purpose.
               | NEON and SVE seem to be aimed at different intensities of
               | work.
        
             | dragontamer wrote:
             | > But scatter/gather is a primitive operation for SIMD
             | 
             | Not in NEON, and therefore not in M1. AVX512 and SVE add
             | scatter/gather instructions.
             | 
             | Intel/AMD's AVX has vgather instructions, but is missing
             | vscatter until AVX512.
             | 
             | > Having dedicated instructions for specific operations
             | (whether for crc/aes/nnp or whatever) feels like a CISC-
             | based approach, so I think I agree with the GP.
             | 
             | Not only are there AES instructions on ARM, but there's
             | also SHA-instructions. The mix-columns step of AES more or
             | less demands dedicated hardware if you want high-speed
             | today, so everybody implements that as a hardware specific
             | instruction.
        
         | tlb wrote:
         | Similarly, I wish that on x86, REP STOSB was the fastest way to
         | copy memory. Because it only takes a few bytes in the icache.
         | But fast memcpys end up being hundreds of bytes, to work with
         | larger words while handling start and end alignment.
        
           | userbinator wrote:
           | It still is in general situations (i.e. not the
           | microbenchmarks where the ridiculously bloated unrolled
           | "optimised" implementations may have a very slight edge.) I
           | believe the Linux kernel uses it for this reason.
        
             | jabl wrote:
             | The kernel is a bit of a special case since very likely a
             | syscall starts off with a cold I$, and also there's a lot
             | of extra overhead if you insist on using SIMD registers.
             | 
             | In general I agree with you though, optimizing memcpy
             | implementations only against microbenchmarks is dumb.
        
           | saagarjha wrote:
           | With ERMS it's definitely not going to be slow, so it's a
           | good choice when you're in a constrained environment (high
           | instruction cache pressure, can't use vector instructions).
        
         | userbinator wrote:
         | That's very shortsighted thinking. The dedicated instruction
         | could be optimised by the hardware in a future revision to
         | become much faster.
        
         | MichaelZuo wrote:
         | It's very impressive someone messing around for a few hours
         | could get the m1 chip to more than 2x the performance. Easy
         | gains like that really shouldn't be possible, assuming Apple's
         | silicon team are competent. Maybe there's some hidden gotcha
         | here?
        
           | dzaima wrote:
           | CRC32X works on 8 bytes at a time and has a throughput of one
           | invocation per cycle, whereas the SIMD operates on blocks in
           | parallel (the chromium code does 64 bytes an iteration, with
           | a lot of instruction-level parallelism too). Theoretically M1
           | could have thrown more silicon at it to allow more than one
           | CRC32X invocation per cycle, but that's not very useful if
           | you can achieve the same with SIMD anyway.
        
           | Sirened wrote:
           | This is way more common than you'd think, and it's not by
           | accident. Engineering teams optimize the paths that are
           | heavily used to get the biggest improvement across the
           | platform as a whole. CRC32X is certainly not as heavily used
           | as NEON and so if you're forced to decide between spending
           | area on being able to fuse extra instructions for NEON and
           | slightly improving throughput for CRC32X, the obvious choice
           | is NEON. You see this way more obviously on Intel's x86-64
           | cores where many of the highly used instructions are fast
           | path decoded but some of the weirder CISC instructions that
           | nobody really uses are offloaded to very slow microcode.
        
             | bee_rider wrote:
             | I wonder -- could CRC32X be something that would also,
             | specifically, not as interesting for Apple? They are mostly
             | optimizing for desktop workloads. I wonder if worrying
             | about checksuming, especially maximizing the throughput of
             | checksum operations, is more of a server thing. (Like we
             | have to checksum when we download things on desktop, but
             | that's a one-off, and I guess things get checksummed in the
             | filesystem, but even the nice NVME drives are pretty slow
             | from the CPUs point of view).
        
               | astrange wrote:
               | MachO uses codesigning with adhoc signatures as a form of
               | checksumming, and there's also TCP and whatever drives
               | do. So it's the converse, it's so common the dedicated
               | hardware does it instead of the CPU. And it's not all the
               | same algorithm.
        
           | AlotOfReading wrote:
           | Intel's algorithm is very clever and would take a lot of
           | space to implement in hardware. The implementation underlying
           | the CRC32** instructions is probably some set of shift
           | registers. That's a pretty good space/speed tradeoff to make.
           | 
           | My largely uninformed guess is that they added the
           | instructions to get fast CRCs for the filesystem 'for free'.
           | There aren't many other cases where software CRC can be a
           | bottleneck that also use these polynomials.
        
           | interestica wrote:
           | Saving it for M2 to have something to show off?
        
           | nicoburns wrote:
           | I feel like CRC32 may be simple enough (and close enough to
           | the kind of operation like adding and bit-shifting that
           | general-purpose CPUs are good at anyway, that perhaps it
           | doesn't benefit as much from dedicated silicon as other
           | algorithms would.
        
           | [deleted]
        
           | naniwaduni wrote:
           | > Easy gains like that really shouldn't be possible,
           | 
           | Easy gains are _everywhere_. The  "gotcha", if you can call
           | it that, is that optimizing particular operations comes with
           | space tradeoffs that are more expensive when you do them in
           | hardware.
        
         | d_tr wrote:
         | I am not taking any hard stance on the usefulness of the
         | specialized instruction, but M1 is a very wide and powerful
         | core, so this won't be true everywhere.
         | 
         | The single instruction might also be more power-efficient and
         | keep other resources free for other stuff.
        
           | brigade wrote:
           | Specifically, the M1's NEON ALUs can handle 64 bytes per
           | cycle on the big cores. Most Cortex can only do 32 bytes per
           | cycle (Cortex X1/X2 is the exception), and the lower end
           | designs are only 16B/cycle. Plus the pmull+eor fusion on M1
           | increases effective throughput by 33%, and I don't know if
           | that's implemented on any Cortex.
           | 
           | Without those, the NEON version would fall to the same
           | throughput as one CRC32X per cycle, or half the throughput on
           | cores with 2x64bit ALUs.
        
           | athrowaway3z wrote:
           | Also not taking hard stances, but both cases are suspect.
           | Power efficiency because being 3 times faster means you're
           | done 3 times earlier. Keeping other resources free because I
           | suspect a CRC calculation is generally followed by an `if eq`
           | statement. ( Even with out-of-order or speculative execution
           | this creates a bottle neck that is nice to remove 3x faster )
        
             | stingraycharles wrote:
             | If you're writing optimized code, hardly ever would you
             | evaluate one CRC check at a time. You would process them in
             | chunks, as the OP stated, but would just let a compiler do
             | the auto-vectorization.
             | 
             | This is even more true in the case of CRC, where there's
             | clearly almost always one branch that wins: this is perfect
             | for branch prediction, which would mean the whole "if eq"
             | condition is preemptively skipped.
        
               | saagarjha wrote:
               | The compiler probably isn't going to be able to
               | autovectorize a CRC unless you help it out.
        
             | dottrap wrote:
             | I think power efficiency has a lot more variables now so it
             | is not easy to know if consumption is linear with time.
             | CPUs now dynamically throttle themselves, plus now Apple
             | has advertised that its M1 cores are divided up between
             | high-performance and high-efficiency efficiency cores, let
             | alone how the underlying chip itself may consumer power
             | differently for implementing different instructions.
             | 
             | So for a hypothetical example, it could be that using
             | general purpose SIMD triggers the system to throttle up the
             | CPUs and/or move to the high performance CPUs, whereas the
             | dedicated CRC instructions might exist on the high-
             | efficiency cores and not trigger any throttling.
             | 
             | I've forgotten all my computer architecture theory, but if
             | I look back at Ohm's law and look at power, the equation is
             | P = I^2 * R. Handwaving from my forgotten theory a bit
             | here, ramping up the CPUs increases current, and we see
             | that it is a squared factor. So by cutting the time by say
             | a factor of 3 does mean you are done 3 times faster (which
             | is a linear component), you still have to contend that you
             | have a squared component in current which may have been
             | increased.
             | 
             | I have no clue if the M1 actually does any of this, but
             | merely stating that it is not obvious what is happening in
             | terms of power efficiency. We've seen other examples of
             | this. For example, I've read that Intel's AVX family
             | instruction generally increases the power consumption and
             | frequency of when utilized, but non-obviously, it often
             | runs at a lower frequency when in 256 or 512 wide forms
             | compared to the lesser widths (which then requires more
             | work on the developer to figure out what is the optimal
             | performance path as wider isn't necessarily faster). And as
             | another example, when Apple shipped 2 video cards in their
             | Macbooks, some general purpose Mac desktop application
             | developers who cared about battery life were tip-toeing
             | around different high level Apple APIs (e.g. Cocoa, Core
             | Animation, etc.) because some APIs under the hood
             | automatically triggered the high performance GPU to switch
             | on (and eat power), while these general purpose desktop
             | applications didn't want or need the extra performance (at
             | the cost of eating the user's battery).
        
               | saagarjha wrote:
               | > whereas the dedicated CRC instructions might exist on
               | the high-efficiency cores
               | 
               | M1 has a heterogeneous ISA, FWIW.
        
               | astrange wrote:
               | Homogenous? The P and E cores have all the same
               | instructions. You won't get suddenly moved from one to
               | the other or hit emulations.
        
               | saagarjha wrote:
               | Oops, yes, that's what I meant. Thanks for catching that!
        
         | rowanG077 wrote:
         | It's not really a fair comparison. There is only one CRC32 unit
         | which means it can't make use of superscalar (at least if I
         | understand the article correctly). If it would have more CRC32
         | units that would be the most efficient.
        
         | adrian_b wrote:
         | No, the faster implementation uses another dedicated
         | instruction, which happens to be more general than CRC32, i.e.
         | the multiplication of polynomials having binary coefficients.
         | 
         | So this has little to do with RISC, except the general
         | principle that the instructions that are used more frequently
         | should be implemented to be faster, a principle that has been
         | used by the M1 designers and by any other competent CPU
         | designers.
         | 
         | In this case, ARM has added the polynomial multiplication
         | instruction a few years after Intel, with the same main purpose
         | of accelerating the authenticated encryption with AES. There is
         | little doubt that ARM was inspired by the Intel Westmere new
         | instructions (announced by Intel a few years before the
         | Westmere launch in 2010).
         | 
         | The dedicated CRC32 instruction could have been made much
         | faster, but the designers of the M1 core did not believe that
         | this is worthwhile, because that instruction is not used often.
         | 
         | The polynomial multiplication is used by many more
         | applications, because it can implement CRC computations based
         | on any polynomial, no only that one specified for CRC32, and it
         | can also be used in a great number of other algorithms that are
         | based on the properties of the fields whose elements are
         | polynomials with binary coefficients.
         | 
         | So it made sense to have a better implementation for the
         | polynomial multiplication, which allows greater speeds in many
         | algorithms, including the CRC computation.
        
           | saagarjha wrote:
           | Amusingly the Rosetta runtime uses crc32x
        
           | IshKebab wrote:
           | That sounds like a point for RISC to me?
           | 
           | Ok maybe it is just a point against really complex
           | instructions. There's clearly an optimum middle ground.
        
             | DonHopkins wrote:
             | Mary Payne designed the VAX floating point POLY instruction
             | at DEC.
             | 
             | But the microcode and hardware floating point
             | implementations did it slightly differently. Then the
             | MicroVAX dropped it, then picked it up again but wrong,
             | then fixed it, then lost it again.
             | 
             | http://simh.trailing-edge.com/docs/vax_poly.pdf
             | 
             | https://documentation.help/VAX11/op_POLY.htm
             | 
             | https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_o
             | p...
             | 
             | >Multiply-accumulate operation
             | 
             | >The Digital Equipment Corporation (DEC) VAX's POLY
             | instruction is used for evaluating polynomials with
             | Horner's rule using a succession of multiply and add steps.
             | Instruction descriptions do not specify whether the
             | multiply and add are performed using a single FMA step.
             | This instruction has been a part of the VAX instruction set
             | since its original 11/780 implementation in 1977.
             | 
             | https://news.ycombinator.com/item?id=20558618
             | 
             | >VAX was a crazy town ISA too, with stuff like single
             | isntruction polynomial evaluation.
        
             | astrange wrote:
             | Complex instructions are often good ideas. They're best at
             | combining lots of bitshifting (hardware is good at that and
             | it can factor things out) but even for memory ops it can be
             | good (the HW can optimize them by knowing things like cache
             | line sizes).
             | 
             | They get a bad rap because the only really complex ISA left
             | is x86 and it just had especially bad ideas about which
             | operations to use its shortest codes on. Nobody uses BOUND
             | to the point some CPUs don't even include it.
             | 
             | One point against them in SIMD is there definitely is an
             | instruction explosion there, but I haven't seen a
             | convincing better idea, and I think the RISC-V people's
             | vector proposal is bad and shows they have serious knowing
             | what they're talking about issues.
        
               | adgjlsfhk1 wrote:
               | what's wrong with the riscv approach?
        
               | astrange wrote:
               | They want to go back to the 70s (not meant as an insult)
               | and use Cray-style vector instructions instead of SIMD.
               | Vector instructions are kind of like loops in one
               | instruction; you set a vector length register and then
               | all the vector instructions change to run on that much
               | data.
               | 
               | That's ok for big vectors you'd find in like scientific
               | computations, but I believe it's bad for anything I've
               | ever written in SIMD. Games and multimedia usually use
               | short vectors (say 4 ints) or mixed lengths (2 and 4 at
               | once) and are pretty comfortable with how x86/ARM/PPC
               | work.
               | 
               | Not saying it couldn't be better, but the RISCV designers
               | wrote an article about how their approach would be better
               | basically entirely because they thought SIMD adds too
               | many new instructions and isn't aesthetically pretty
               | enough. Which doesn't matter.
               | 
               | Also I remember them calling SIMD something weird and
               | politically incorrect in the article but can't remember
               | what it was...
        
               | adgjlsfhk1 wrote:
               | To me, the much bigger problem with SIMD is that it is
               | really hard to program for since vector sizes keep
               | getting bigger, and it really doesn't look good at all
               | for big-little designs which seem to be the future. Most
               | compilers still aren't good at generating AVX-512 code,
               | and very few applications use it because it requires
               | optimization for a very small portion of the market. With
               | a variable length vector, everyone gets good code.
               | 
               | Also, I'm not clear why you think riscv style vector
               | instructions will perform worse on 2-4 length vectors.
        
               | devit wrote:
               | The RISC-V design is the sensible one since it does not
               | hardcode the vector register size; the other designs are
               | idiotic since a new instruction set needs to be created
               | every time the vector register size is increased
               | (MMX->SSE2->AVX2->AVX-512) and you can't use shorter
               | vector registers on lower-power CPUs without reducing
               | application compatibility.
               | 
               | And it doesn't "loop" in one instruction, the vsetvl
               | instruction will set the vector length to the _minimum_
               | of the vector register size and the amount of data left
               | to process.
        
               | atq2119 wrote:
               | Isn't there a sort of obvious "best of both worlds" by
               | having a vector instruction ISA with a vector length
               | register, plus the promise that a length >= 4 is always
               | supported and good support for intra-group-of-4 shuffles?
               | 
               | Then you can do whatever you did for games and multimedia
               | in the past, except that you can process N
               | samples/pixels/vectors/whatever at once, where N = vector
               | length / 4, and your code can automatically make use of
               | chips that allow longer vectors without requiring a
               | recompile.
               | 
               | Mind you, I don't know if that's the direction that the
               | RISC-V people are taking. But it seems like a pretty
               | obvious thing to do.
        
       ___________________________________________________________________
       (page generated 2022-05-22 23:00 UTC)