[HN Gopher] Faster CRC32 on the Apple M1 ___________________________________________________________________ Faster CRC32 on the Apple M1 Author : panic Score : 251 points Date : 2022-05-22 15:36 UTC (7 hours ago) (HTM) web link (dougallj.wordpress.com) (TXT) w3m dump (dougallj.wordpress.com) | terrelln wrote: | Could you combine both techniques to run both the SIMD version on | some chunks and the crc32 instruction on other chunks, in | parallel? Of course this would only work if they execute on | different ports. | dougall wrote: | Hmm, yeah, this might work out... Two SIMD uops process 16 | bytes, so each SIMD uop is doing eight bytes of work - the same | as CRC32X, but with more frontend pressure (and preferable | because they can run on any of the four SIMD ports, not just | the one distinct CRC32X port). | | It gets a bit messy, and we can't expect a ton from this | approach - the same loop with only the loads only runs at | ~86GB/s, but it'd be worth a shot. | AlotOfReading wrote: | Yes, and ZLib provides an excellent example of how to do this | (it's called braiding in the source code). There's an | additional initialization and combination cost though. It | doesn't really make sense for short messages. | terrelln wrote: | Thanks for the pointer, will have to take a look! | sgtnoodle wrote: | It seems like CRC inherently depends on results from earlier | calculations, so it would be hard to parallelize like that. You | could potentially do multiple independent CRC calculations in | parallel, but then you're getting into more niche use cases. | aaaaaaaaaaab wrote: | Wrong. CRC is just polynomial division, which is simple to do | in a divide and conquer fashion. It's pretty easy to derive | CRC(A concat B) from CRC(A) and CRC(B). It needs a | multiplication and a XOR. | 13of40 wrote: | Weird, I talked to the guy who invented the crypto scheme for ZIP | and he said he invented the CRC algorithm for it as well. I | wonder if there's more backstory there. | | Edit: He invented the crypto not the CRC, which Phil Katz was | already using. | AlotOfReading wrote: | That's a very strange claim. As far as I know, the same CRC has | been used for ZIP since it was invented by the PKWARE guys in | the 80s. Moreover, no one deeply understood the properties and | tradeoffs of various CRCs the way we do now, so everyone used | largely identical algorithms with only trivial variations. Phil | Katz did the same and reused the same polynomial that was in | ethernet and dozens of other standards, which in turn had | originated from this 1975 report: | https://apps.dtic.mil/sti/pdfs/ADA013939.pdf | | He wasn't even the first to put a CRC in an archive format, as | the predecessor format ARC had a CRC-16 doing the same thing. | carbonbee wrote: | I think what OP meant to write: the zip encryption algorithm | is a custom stream cypher that uses crc32 as the main | building block. | | (It's a very bad cypher, vulnerable to known plaintext and | other attacks, don't use it for anything except light | scrambling). | mappu wrote: | Nowadays most zip programs will default to using AES (in | the way WinZip invented) instead of ZipCrypto. | 13of40 wrote: | Ah, OK, I looked in my email and it was the guy who invented | the encryption scheme, but he said "Yes, I invented it | [referring to the encryption]. It wasn't based on anything | else, except that it used the same CRC he [Phil] was already | using in zip." (As a historical note, he also said the crypto | scheme was intended to be exportable, which at that time | meant "intentionally weak".) | IncRnd wrote: | It's always important to know which crc you are using. | | Looking at | https://developer.arm.com/documentation/ddi0596/2020-12/Base..., | based upon the polynomial constant, the CRC32 class of | instructions appear to calculate a CCITT 32 reversed polynomial. | Are there any ARM developers who can help me out here? Does this | apply in the same way to the M1? | dougall wrote: | The post glosses over it a bit - the CRC32X instruction always | uses the common polynomial 0x04C11DB7 (matching zlib, commonly | just called CRC-32), and there's a second instruction, CRC32CX, | which is the same but uses the polynomial 0x1EDC6F41, known as | CRC-32C (Castagnoli). The constants in the post are also for | 0x04C11DB7, but the linked Intel article explains how to they | can be calculated for arbitrary polynomials, so the faster | method is also generic, which is nice. | DeathArrow wrote: | I wonder how fast can someone get it to run on Intel's 12 | generation core CPU. | | It seems a good idea to start a Code Golf competition. | fwsgonzo wrote: | https://github.com/komrad36/CRC | | I measured it on my computer to run at 32GB/s, i7 4770k. | jeffbee wrote: | I don't think they will hit 75GB/s because Intel doesn't make | that kind of memory throughput available to a single core. | dragontamer wrote: | https://www.intel.com/content/dam/www/public/us/en/documents... | Twirrim wrote: | Likely as much noise as signal, but anyway: | | I'm using an 8th(?) generation Intel, i7-8665U. | | https://github.com/htot/crc32c has some interesting | implementations of CRC32 algorithms of different speeds, the | highest I see is (function, aligned, bytes, MiB/s) : | crc32cIntelC true 16 3907.613 crc32cIntelC true | 64 15096.758 crc32cIntelC true 128 24692.803 | crc32cIntelC true 192 22732.392 crc32cIntelC | true 256 16233.397 crc32cIntelC true 288 16748.952 | crc32cIntelC true 512 19862.039 crc32cIntelC | true 1024 22373.350 crc32cIntelC true 1032 | 22482.031 crc32cIntelC true 4096 24690.531 | crc32cIntelC true 8192 24992.827 | | So pushing 25GiB/s on a 3ish year old CPU. | wtallis wrote: | FYI, CRC32C is not the same checksum as CRC32. CRC32C is used | in iSCSI, btrfs, ext4. CRC32 is used in Ethernet, SATA, gzip. | Intel's SSE4.2 provides an instruction implementing CRC32C | but not CRC32. ARM defines instructions for both. | | This article seems to miss that distinction but appears to be | testing CRC32, so it's not quite correct to compare against | something using Intel's CRC32C instruction. | neurostimulant wrote: | I'm using i7-4790 (7 year old cpu) and the numbers are | slightly better. Maybe because it's a desktop. | crc32cIntelC true 16 4025.334 crc32cIntelC | true 64 15749.095 crc32cIntelC true 128 | 26608.064 crc32cIntelC true 192 25828.486 | crc32cIntelC true 256 17448.436 | crc32cIntelC true 288 18336.381 | crc32cIntelC true 512 22635.590 | crc32cIntelC true 1024 24654.248 | crc32cIntelC true 1032 24180.107 | crc32cIntelC true 4096 28251.903 | crc32cIntelC true 8192 28768.134 | jeffbee wrote: | Core i7-12700K: crc32cIntelC true 64 | 26210.561 crc32cIntelC true 128 35870.309 | crc32cIntelC true 192 36850.224 crc32cIntelC | true 256 30343.690 crc32cIntelC true 288 30671.327 | crc32cIntelC true 512 32443.251 crc32cIntelC | true 1024 34654.719 crc32cIntelC true 1032 | 34265.440 crc32cIntelC true 4096 38111.089 | crc32cIntelC true 8192 38634.925 | DantesKite wrote: | I really like that the author gave some context at the top. So | many times I struggle to read or realize the importance of a | concept because there just isn't enough context for me to follow | along. | | And certainly not all blogs have to, but it's nice when it is. | parentheses wrote: | It is very interesting that since the release of the M1 chip, CPU | performance on Apple silicon has really come under a microscope. | It leads me to ask: | | Was Apple silicon always this best-in-class and we weren't | looking this closely as a community? | mhh__ wrote: | You couldn't buy one without a screen, or run arbitrary code on | them. | minhazm wrote: | Apple's chips in their iPhones & iPads have been outperforming | the competition (Qualcomm & Samsung) for a long time now in | both power efficiency and performance. Apple has usually been | around 2 yrs ahead of the competition. The Qualcomm Snapdragon | 888 chip with 8 cores has a Geekbench multi-core score of | 3592[1]. The six-score Apple A15 Bionic scored 4673. The Apple | chip has ~30% better multi-core performance with 25% fewer | cores than the Qualcomm chip. In single-core performance the | difference is even larger, with the Apple chips performing ~47% | faster. You can get the same A15 Bionic in both the $429 iPhone | SE and the $1100+ iPhone 13 Pro Max. | | It hasn't really been a huge deal though because people don't | develop directly on an iPhone, so it doesn't affect their every | day productivity all that much. Also phone's have reached the | point of "fast enough" a few years ago, it's hard to tell the | difference between an iPhone 13 Pro and an 11 Pro unless you | use them side by side. But with the release of the M1 chip, | people are getting the performance & energy efficiency gains in | their every day workflows. | | [1]. https://browser.geekbench.com/android-benchmarks | | [2]. https://browser.geekbench.com/ios-benchmarks | skavi wrote: | Yes, and anyone who had been reading AnandTech's excellent | mobile reviews (by Andrei Frumusanu) knew what M1 would bring. | Aissen wrote: | Everyone was looking, and it was well known that it was best | in-class; two random examples: | https://www.anandtech.com/show/7335/the-iphone-5s-review/4 | | https://twitter.com/codinghorror/status/912047023871860737 | ntoskrnl wrote: | So ARM64 has dedicated instructions for CRC32, but implementing | it by hand using SIMD is still faster. Score another point for | RISC. | pclmulqdq wrote: | It is not faster to use SIMD by hand - it is faster to use the | vector unit alongside the integer unit, using both paths at the | same time. | dragontamer wrote: | SIMD is a very powerful parallelization technique, with | marvelous gains whenever I see it used. It seems like a | fundamentally more efficient form of compute, but is very | difficult to design algorithms for. | | I'd argue against "SIMD" as being "RISC", since you need all | sorts of complicated instructions (ex: gather/scatter) to | really support the methodology well in practice. | tremon wrote: | But scatter/gather is a primitive operation for SIMD, so if | you want a RISC-based version of it, that's exactly what you | would provide. Having dedicated instructions for specific | operations (whether for crc/aes/nnp or whatever) feels like a | CISC-based approach, so I think I agree with the GP. | | RISC vs CISC is about the simplicity of the instruction set, | not about whether it's easy to use. | mhh__ wrote: | These days I'd argue risc vs cisc is more about regularity | and directness than the size of the ISA as per se. | | I'd argue AArch64 isn't particularly RISC by the standards | of the past but it sets the bar and tone for RISC today. | dragontamer wrote: | And which SIMD instruction set should we be talking | about? NEON-instructions or with the SVE instruction set? | | And if we're talking about multiple instruction-sets | designed for the same purpose, is this thing really RISC | anymore? Or do you really mean "just not x86" when you | say RISC ?? | mhh__ wrote: | That depends on how precisely you define the purpose. | NEON and SVE seem to be aimed at different intensities of | work. | dragontamer wrote: | > But scatter/gather is a primitive operation for SIMD | | Not in NEON, and therefore not in M1. AVX512 and SVE add | scatter/gather instructions. | | Intel/AMD's AVX has vgather instructions, but is missing | vscatter until AVX512. | | > Having dedicated instructions for specific operations | (whether for crc/aes/nnp or whatever) feels like a CISC- | based approach, so I think I agree with the GP. | | Not only are there AES instructions on ARM, but there's | also SHA-instructions. The mix-columns step of AES more or | less demands dedicated hardware if you want high-speed | today, so everybody implements that as a hardware specific | instruction. | tlb wrote: | Similarly, I wish that on x86, REP STOSB was the fastest way to | copy memory. Because it only takes a few bytes in the icache. | But fast memcpys end up being hundreds of bytes, to work with | larger words while handling start and end alignment. | userbinator wrote: | It still is in general situations (i.e. not the | microbenchmarks where the ridiculously bloated unrolled | "optimised" implementations may have a very slight edge.) I | believe the Linux kernel uses it for this reason. | jabl wrote: | The kernel is a bit of a special case since very likely a | syscall starts off with a cold I$, and also there's a lot | of extra overhead if you insist on using SIMD registers. | | In general I agree with you though, optimizing memcpy | implementations only against microbenchmarks is dumb. | saagarjha wrote: | With ERMS it's definitely not going to be slow, so it's a | good choice when you're in a constrained environment (high | instruction cache pressure, can't use vector instructions). | userbinator wrote: | That's very shortsighted thinking. The dedicated instruction | could be optimised by the hardware in a future revision to | become much faster. | MichaelZuo wrote: | It's very impressive someone messing around for a few hours | could get the m1 chip to more than 2x the performance. Easy | gains like that really shouldn't be possible, assuming Apple's | silicon team are competent. Maybe there's some hidden gotcha | here? | dzaima wrote: | CRC32X works on 8 bytes at a time and has a throughput of one | invocation per cycle, whereas the SIMD operates on blocks in | parallel (the chromium code does 64 bytes an iteration, with | a lot of instruction-level parallelism too). Theoretically M1 | could have thrown more silicon at it to allow more than one | CRC32X invocation per cycle, but that's not very useful if | you can achieve the same with SIMD anyway. | Sirened wrote: | This is way more common than you'd think, and it's not by | accident. Engineering teams optimize the paths that are | heavily used to get the biggest improvement across the | platform as a whole. CRC32X is certainly not as heavily used | as NEON and so if you're forced to decide between spending | area on being able to fuse extra instructions for NEON and | slightly improving throughput for CRC32X, the obvious choice | is NEON. You see this way more obviously on Intel's x86-64 | cores where many of the highly used instructions are fast | path decoded but some of the weirder CISC instructions that | nobody really uses are offloaded to very slow microcode. | bee_rider wrote: | I wonder -- could CRC32X be something that would also, | specifically, not as interesting for Apple? They are mostly | optimizing for desktop workloads. I wonder if worrying | about checksuming, especially maximizing the throughput of | checksum operations, is more of a server thing. (Like we | have to checksum when we download things on desktop, but | that's a one-off, and I guess things get checksummed in the | filesystem, but even the nice NVME drives are pretty slow | from the CPUs point of view). | astrange wrote: | MachO uses codesigning with adhoc signatures as a form of | checksumming, and there's also TCP and whatever drives | do. So it's the converse, it's so common the dedicated | hardware does it instead of the CPU. And it's not all the | same algorithm. | AlotOfReading wrote: | Intel's algorithm is very clever and would take a lot of | space to implement in hardware. The implementation underlying | the CRC32** instructions is probably some set of shift | registers. That's a pretty good space/speed tradeoff to make. | | My largely uninformed guess is that they added the | instructions to get fast CRCs for the filesystem 'for free'. | There aren't many other cases where software CRC can be a | bottleneck that also use these polynomials. | interestica wrote: | Saving it for M2 to have something to show off? | nicoburns wrote: | I feel like CRC32 may be simple enough (and close enough to | the kind of operation like adding and bit-shifting that | general-purpose CPUs are good at anyway, that perhaps it | doesn't benefit as much from dedicated silicon as other | algorithms would. | [deleted] | naniwaduni wrote: | > Easy gains like that really shouldn't be possible, | | Easy gains are _everywhere_. The "gotcha", if you can call | it that, is that optimizing particular operations comes with | space tradeoffs that are more expensive when you do them in | hardware. | d_tr wrote: | I am not taking any hard stance on the usefulness of the | specialized instruction, but M1 is a very wide and powerful | core, so this won't be true everywhere. | | The single instruction might also be more power-efficient and | keep other resources free for other stuff. | brigade wrote: | Specifically, the M1's NEON ALUs can handle 64 bytes per | cycle on the big cores. Most Cortex can only do 32 bytes per | cycle (Cortex X1/X2 is the exception), and the lower end | designs are only 16B/cycle. Plus the pmull+eor fusion on M1 | increases effective throughput by 33%, and I don't know if | that's implemented on any Cortex. | | Without those, the NEON version would fall to the same | throughput as one CRC32X per cycle, or half the throughput on | cores with 2x64bit ALUs. | athrowaway3z wrote: | Also not taking hard stances, but both cases are suspect. | Power efficiency because being 3 times faster means you're | done 3 times earlier. Keeping other resources free because I | suspect a CRC calculation is generally followed by an `if eq` | statement. ( Even with out-of-order or speculative execution | this creates a bottle neck that is nice to remove 3x faster ) | stingraycharles wrote: | If you're writing optimized code, hardly ever would you | evaluate one CRC check at a time. You would process them in | chunks, as the OP stated, but would just let a compiler do | the auto-vectorization. | | This is even more true in the case of CRC, where there's | clearly almost always one branch that wins: this is perfect | for branch prediction, which would mean the whole "if eq" | condition is preemptively skipped. | saagarjha wrote: | The compiler probably isn't going to be able to | autovectorize a CRC unless you help it out. | dottrap wrote: | I think power efficiency has a lot more variables now so it | is not easy to know if consumption is linear with time. | CPUs now dynamically throttle themselves, plus now Apple | has advertised that its M1 cores are divided up between | high-performance and high-efficiency efficiency cores, let | alone how the underlying chip itself may consumer power | differently for implementing different instructions. | | So for a hypothetical example, it could be that using | general purpose SIMD triggers the system to throttle up the | CPUs and/or move to the high performance CPUs, whereas the | dedicated CRC instructions might exist on the high- | efficiency cores and not trigger any throttling. | | I've forgotten all my computer architecture theory, but if | I look back at Ohm's law and look at power, the equation is | P = I^2 * R. Handwaving from my forgotten theory a bit | here, ramping up the CPUs increases current, and we see | that it is a squared factor. So by cutting the time by say | a factor of 3 does mean you are done 3 times faster (which | is a linear component), you still have to contend that you | have a squared component in current which may have been | increased. | | I have no clue if the M1 actually does any of this, but | merely stating that it is not obvious what is happening in | terms of power efficiency. We've seen other examples of | this. For example, I've read that Intel's AVX family | instruction generally increases the power consumption and | frequency of when utilized, but non-obviously, it often | runs at a lower frequency when in 256 or 512 wide forms | compared to the lesser widths (which then requires more | work on the developer to figure out what is the optimal | performance path as wider isn't necessarily faster). And as | another example, when Apple shipped 2 video cards in their | Macbooks, some general purpose Mac desktop application | developers who cared about battery life were tip-toeing | around different high level Apple APIs (e.g. Cocoa, Core | Animation, etc.) because some APIs under the hood | automatically triggered the high performance GPU to switch | on (and eat power), while these general purpose desktop | applications didn't want or need the extra performance (at | the cost of eating the user's battery). | saagarjha wrote: | > whereas the dedicated CRC instructions might exist on | the high-efficiency cores | | M1 has a heterogeneous ISA, FWIW. | astrange wrote: | Homogenous? The P and E cores have all the same | instructions. You won't get suddenly moved from one to | the other or hit emulations. | saagarjha wrote: | Oops, yes, that's what I meant. Thanks for catching that! | rowanG077 wrote: | It's not really a fair comparison. There is only one CRC32 unit | which means it can't make use of superscalar (at least if I | understand the article correctly). If it would have more CRC32 | units that would be the most efficient. | adrian_b wrote: | No, the faster implementation uses another dedicated | instruction, which happens to be more general than CRC32, i.e. | the multiplication of polynomials having binary coefficients. | | So this has little to do with RISC, except the general | principle that the instructions that are used more frequently | should be implemented to be faster, a principle that has been | used by the M1 designers and by any other competent CPU | designers. | | In this case, ARM has added the polynomial multiplication | instruction a few years after Intel, with the same main purpose | of accelerating the authenticated encryption with AES. There is | little doubt that ARM was inspired by the Intel Westmere new | instructions (announced by Intel a few years before the | Westmere launch in 2010). | | The dedicated CRC32 instruction could have been made much | faster, but the designers of the M1 core did not believe that | this is worthwhile, because that instruction is not used often. | | The polynomial multiplication is used by many more | applications, because it can implement CRC computations based | on any polynomial, no only that one specified for CRC32, and it | can also be used in a great number of other algorithms that are | based on the properties of the fields whose elements are | polynomials with binary coefficients. | | So it made sense to have a better implementation for the | polynomial multiplication, which allows greater speeds in many | algorithms, including the CRC computation. | saagarjha wrote: | Amusingly the Rosetta runtime uses crc32x | IshKebab wrote: | That sounds like a point for RISC to me? | | Ok maybe it is just a point against really complex | instructions. There's clearly an optimum middle ground. | DonHopkins wrote: | Mary Payne designed the VAX floating point POLY instruction | at DEC. | | But the microcode and hardware floating point | implementations did it slightly differently. Then the | MicroVAX dropped it, then picked it up again but wrong, | then fixed it, then lost it again. | | http://simh.trailing-edge.com/docs/vax_poly.pdf | | https://documentation.help/VAX11/op_POLY.htm | | https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_o | p... | | >Multiply-accumulate operation | | >The Digital Equipment Corporation (DEC) VAX's POLY | instruction is used for evaluating polynomials with | Horner's rule using a succession of multiply and add steps. | Instruction descriptions do not specify whether the | multiply and add are performed using a single FMA step. | This instruction has been a part of the VAX instruction set | since its original 11/780 implementation in 1977. | | https://news.ycombinator.com/item?id=20558618 | | >VAX was a crazy town ISA too, with stuff like single | isntruction polynomial evaluation. | astrange wrote: | Complex instructions are often good ideas. They're best at | combining lots of bitshifting (hardware is good at that and | it can factor things out) but even for memory ops it can be | good (the HW can optimize them by knowing things like cache | line sizes). | | They get a bad rap because the only really complex ISA left | is x86 and it just had especially bad ideas about which | operations to use its shortest codes on. Nobody uses BOUND | to the point some CPUs don't even include it. | | One point against them in SIMD is there definitely is an | instruction explosion there, but I haven't seen a | convincing better idea, and I think the RISC-V people's | vector proposal is bad and shows they have serious knowing | what they're talking about issues. | adgjlsfhk1 wrote: | what's wrong with the riscv approach? | astrange wrote: | They want to go back to the 70s (not meant as an insult) | and use Cray-style vector instructions instead of SIMD. | Vector instructions are kind of like loops in one | instruction; you set a vector length register and then | all the vector instructions change to run on that much | data. | | That's ok for big vectors you'd find in like scientific | computations, but I believe it's bad for anything I've | ever written in SIMD. Games and multimedia usually use | short vectors (say 4 ints) or mixed lengths (2 and 4 at | once) and are pretty comfortable with how x86/ARM/PPC | work. | | Not saying it couldn't be better, but the RISCV designers | wrote an article about how their approach would be better | basically entirely because they thought SIMD adds too | many new instructions and isn't aesthetically pretty | enough. Which doesn't matter. | | Also I remember them calling SIMD something weird and | politically incorrect in the article but can't remember | what it was... | adgjlsfhk1 wrote: | To me, the much bigger problem with SIMD is that it is | really hard to program for since vector sizes keep | getting bigger, and it really doesn't look good at all | for big-little designs which seem to be the future. Most | compilers still aren't good at generating AVX-512 code, | and very few applications use it because it requires | optimization for a very small portion of the market. With | a variable length vector, everyone gets good code. | | Also, I'm not clear why you think riscv style vector | instructions will perform worse on 2-4 length vectors. | devit wrote: | The RISC-V design is the sensible one since it does not | hardcode the vector register size; the other designs are | idiotic since a new instruction set needs to be created | every time the vector register size is increased | (MMX->SSE2->AVX2->AVX-512) and you can't use shorter | vector registers on lower-power CPUs without reducing | application compatibility. | | And it doesn't "loop" in one instruction, the vsetvl | instruction will set the vector length to the _minimum_ | of the vector register size and the amount of data left | to process. | atq2119 wrote: | Isn't there a sort of obvious "best of both worlds" by | having a vector instruction ISA with a vector length | register, plus the promise that a length >= 4 is always | supported and good support for intra-group-of-4 shuffles? | | Then you can do whatever you did for games and multimedia | in the past, except that you can process N | samples/pixels/vectors/whatever at once, where N = vector | length / 4, and your code can automatically make use of | chips that allow longer vectors without requiring a | recompile. | | Mind you, I don't know if that's the direction that the | RISC-V people are taking. But it seems like a pretty | obvious thing to do. ___________________________________________________________________ (page generated 2022-05-22 23:00 UTC)