[HN Gopher] Do not taunt happy fun branch predictor ___________________________________________________________________ Do not taunt happy fun branch predictor Author : mkeeter Score : 169 points Date : 2023-01-25 16:41 UTC (6 hours ago) (HTM) web link (www.mattkeeter.com) (TXT) w3m dump (www.mattkeeter.com) | efitz wrote: | I think that the days where hand tuned assembly language | outperform compiler generated code are largely behind us (let | loose the contrary anecdotes). | | Compilers and microprocessors are way more complex than they were | back in the 80s or 90s, and compiler engineers know way more | about how instructions are actually executed than the vast | majority of programmers. | jcalvinowens wrote: | I think about it like this: | | If I ask you to write assembler faster than what the compiler | emits for one very specific CPU model, you'll pretty much | always be able to do that (given enough time). | | But as CPUs become more complex, and compilers get better, | achieving that win requires more and more "overfitting" to | implementation details of the specific hardware, in ways that | are not helpful or even counterproductive on older or newer | models of the same CPU family. | | It's still worth it sometimes, of course. It's just more work | and a lower value proposition on average. | ravi-delia wrote: | The only exception (pointed out in this post) is leveraging | SIMD instructions, which aren't nearly as well exploited by | compilers. I doubt, of course, that this will last all that | long, but for now there are too many subtle differences that | _might_ matter and totally different instruction sets between | cpus and even generations. | makapuf wrote: | ... and yet in this article, the author beats by 10x a C | compiled code with hand tuned assembly. (By using SIMD and | unrolling, which the compiler did not. Granted linear compiler | code is faster than hand made linear assembly) | zokier wrote: | Author beats compiler because floating point constraints, | with -ffast-math compiler vectorizes the code.. I don't have | arm64 hardware to test the result, but its probably again | pretty fast: https://gcc.godbolt.org/z/xvjY8P4cM | rowanG077 wrote: | I used to think that was true. Then I had a crypto course in | uni which required us to write and optimize 3 different hashing | and encryption algorithms. | | I was stunned by the first once, which I first did in C and | then moved to ASM. The C code was pretty straightforward. But | it was trivial to beat in ASM by something like 120%. | | That course taught me how bad compilers(Or rather GCC) are at | high-level register optimization and using the barrel shifter. | Basically all the performance I squeezed out of ASM was because | the compiler just wasn't figuring basic stuff out. Mind you I | was(am) not an ASM expert. That piece of code was the first | thing I ever write in ASM. And yet I was able to easily beat a | decades old world class compiler. | wolf550e wrote: | compiler autovectorization is poor, people often outperform it | when writing SIMD code. intrinsics are so low level they might | as well be assembly. | JonChesterfield wrote: | I've had really good results from the two in LLVM (one works | on loops, one within basic blocks). Optimal loop body when | the pointers had alignment metadata attached, though at the | time it failed to unroll the tail. | | Using intrinsics with the control flow in C or C++ works | really well - you get the right instruction selection from | the intrinsics, easy to reason about control flow and the | compiler deals with register allocation (and possibly | instruction scheduling) which are a pain to do by hand. | iamsmooney wrote: | Title is a reference to this old SNL skit: | https://www.youtube.com/watch?v=GmqeZl8OI2M | sophacles wrote: | It's definitely a classic - if you haven't seen it I'd | recommend it even if it wasn't related to the article :D | bioint7812 wrote: | More translation: | | > for reasons | | https://www.urbandictionary.com/define.php?term=for%20reason... | jerf wrote: | This is another good example of how our CPUs are in many ways | specialized C processors. C is a structured programming language | that uses functions, so our processors like functions. If you | jump out of that paradigm, even if the assembly instructions | nominally seem to allow it, you'll run more slowly. Even when it | seems like what you're offering is a shortcut to the CPU. | | This is neither praise nor criticism of the current CPU paradigm; | it's just something you need to understand if you want the best | performance out of our machines. | | A different paradigm, like a concatenative-paradigm-based | program, might naively be more inclined to compile into code that | looks more like what the author tried, jumping between | implementations of the stack operators without it actually being | "functions". One can imagine processors that would be "happier" | with that, and would be bothered by things that look like | function returns more. But that's not the CPUs we have. | masklinn wrote: | Author was not jumping out of the paradigm here, they were | deliberately misusing constructs specialised for the paradigm | (br/ret). That's like saying the toolcase is a specialised | screw processor because you're trying to drive nails using a | screwdriver and it does not go well. | | And C is hardly the first or only procedural langage. | ablob wrote: | This has more to do with the ISA than C I'd assume. C was built | as an "easy assembler". Furthermore, the computer doesn't | really care about structure (in the structured programming | sense). | | In this case the implementation of the ISA stores information | on each 'ret' depending on some 'bl' that came before. One can | imagine that a different optimization technique which actually | leads to a speedup exists. Without a branch-predictor, the | program that Matt wrote might've been faster. | | Imo, this has nothing to do with paradigms, but how control- | flow interacts with the system. This interaction is different | between different implementations of an architecture. | | Code written for Cache-locality and paradigms that work well | with it, for example, only became a "winner" after caches were | widely implemented. Before that, the cost of sequential array | access and random array access was identical. With caches, a | linear search for an element can be faster than a binary | search, even though it requires a lot more memory accessess. | Thus, the optimal implementation for finding an element in an | array is now dependent on size as well. (I.e. after an ordered | array reaches a certain size, binary search should become | faster on average). | int_19h wrote: | The point is that the ISA is assuming that BL and RET opcodes | come in pairs - which is an assumption that does reflect what | structured code is typically compiled down to. Going by the | semantics of the opcodes themselves, there's no reason why | RET should be treated differently from a simple indirect jump | here. | masklinn wrote: | > there's no reason why RET should be treated differently | from a simple indirect jump here. | | There's _its very existence_. If you're looking for a | general purpose indirect jump, use that. | tsimionescu wrote: | As far as I understand, the ISA explicitly intends for BL | and RET to come in pairs. The only point of RET is to jump | to the address last stored by BL. If you don't need this | behavior, there's no reason to use RET - as the article | itself shows, B x30 does the exact same job, and doesn't | come with the extra assumptions. | classichasclass wrote: | I agree, because this semantic difference between br x30 and | ret doesn't exist in many other RISCs which also mostly run | C-like languages. Power just has blr, and MIPS jr $ra, for | example. This feels more like a well-intentioned footgun in | the ISA. | shadowofneptune wrote: | The potential population for that footgun is compiler | writers, who should safely fall into 'know what they are | doing' territory. | flohofwoe wrote: | It's just a natural side effect of software and hardware | forming a symbiosis and that they cannot really be separated. | New software is (or should be!) built to run well on existing | hardware, and new hardware is built to run existing software | well. Besides, subroutine call instructions were a thing long | before C became mainstream and manual assembly coding ruled | supreme. | [deleted] | brigade wrote: | If you want to encode a series of indirect jumps between | concatenated functions, you can do that already and the return | address stack won't get involved; you'll simply get the normal | branch predictors. | | But generalized branch prediction is expensive (multiple | entries in the BTB hashed by the previous program flow, and | mispredicts if the next function in the chain changes); the | point of an RAS is that it's a _very_ cheap way to keep some | 100% predictable branches from using those general resources. | Concatenation pretty much requires branches without unique | characteristics, so no fast path shortcuts and everything is a | bit slower. | pranith wrote: | Great investigative work! The stack structure you refer to here | is called the Return Address Stack (RAS). | ShroudedNight wrote: | I feel like this treatment is incomplete without having tested | the scenario where the unmatched ret is replaced with a br lr. | | EDIT: Reading the documentation after the fact, it appears that | that was what br x30 was - naively I had interpreted the hex as a | fixed offset to a label. | snerbles wrote: | While I was in undergrad I toyed around with abusing the branch | predictor on a few different machines, compiling something like | the following with optimizations off - it performs an identical | computation regardless of branch outcome: void | branchLoop(unsigned int condition, unsigned int &sum) { | // put something suitably large here unsigned int | loopCount = 0x0fffffff; unsigned int i; | // compile with -O0 or this gets optimized away for | (i = 0; i < loopCount; i++) if ((i & condition) | == 0) sum++; else | sum++; } | | The Core Duo on my Thinkpad T60 had some very distinct slowdowns | on certain bit patterns, which were not repeatable on the handful | of other CPUs I had access to at the time. I haven't tried this | with more modern CPUs, however. | gpderetta wrote: | Predictors are getting better and better at recognizing long | patterns (sometime at the cost of not being optimal with short | patterns). | Malic wrote: | Ow. My head hurts. | | And this is why optimizing compilers are some of the most complex | programs there are. (or so I have been taught) | shadowgovt wrote: | It's also why the modern rule of thumb is "don't optimize by | writing your own assembly." | | The rule is a boiled-down version of the larger notion "Don't | optimize by writing your own assembly, _because even with | domain knowledge of the problem you 're trying to solve, you're | probably not more clever than the engineer-decades that went | into building your compiler toolchain and processor | architecture, unless you're an expert in both fields, in which | case good luck and shoulder that maintenance burden._" | | The rule of thumb drops a lot of detail on the ground but is a | good first approximation. | astrobe_ wrote: | ... Which is kind of worrying; is it really a good thing that | processors are so complex that you need "decades" to use them | to fully. Bottom line, you end up with chaotic (in the | "sensitive to the slightest change") performance behavior. | | OTOH this reminds of another saying, "don't roll your own | crypto". But all those "don't" are a bit frustrating. | olliej wrote: | But you don't need decades of experience: we have compilers | and optimizers to do that. | bioint7812 wrote: | It's interesting that the impedence the author was | experiencing was one of the CPU incorrectly "speculating" | about the intent. We as readers are left to speculate | about the problem being solved by the author. | | Based on the content of his recent articles, we could | assume he is continuing his development of a JIT for a | Rust port of his graphics engine. Given that assumption, | I would argue that the compiler writers are lagging here | --specifically lack of dynamic compilation. For example, | is there a JIT compiler for the Rust language? I was | thinking about reproducing his experiment in SBCL, which | does have dynamic compilation--although it wouldn't be a | real "apples" to "apples" comparison because my work | machine is a x86_64. | olliej wrote: | JITs have very different constraints (but also many | advantages) vs AOT compilers, so I don't think in the | general case a language compiler can "support" JITs | directly. llvm/clang for instance have a jit mode... for | compiling C/C++, not some other language. | | Also obviously the compiler writers (AOT or JIT) need to | know about and understand very minute details of how a | cpu is behaving. I was responding to a person saying it | was worrying that devs need that kind of knowledge and | experience by saying that that isn't true because the | tools exist (I feel that there's some analogy to what | knowledge you need for a car: changing oil, servicing | engine, replacing engine, designing an engine...). Once | you are writing a JIT you're in the "making the tools" | group, so you now need to know more than the vast | majority of devs (not just more than the "average" dev) | that's inescapable. | miloignis wrote: | I've seen people get frustrated by the "don't"s before, but | I think that's generally taking the first degree | approximation too literally. Feel free to hand-write | assembly or roll your own crypto, but _don 't_ depend on it | for anything serious unless you are an expert or have it | reviewed by one. Doing so for learning and fun is fine, if | that's clearly called out such that no one accidentally | depends on it for something serious. There's only one way | to become good at something, and that's good practice! | | In a professional setting there's a responsibility to the | end user which generally precludes doing these dangerous | things - that is, one should feel free to take up | woodworking as a hobby but shouldn't offer to build | someone's house unless you're a licensed professional. | [deleted] | pjdesno wrote: | Maybe better expressed as "don't write your own assembly | unless you know why it might be better than the compiler". | | Trying to beat the compiler at optimizing normal code is kind | of like trying to beat a calculator with paper and pencil - | computers are just better at that sort of thing than people | are. | | One use case is where you want to use bizarre CPU functions | (popcount, encryption, load CR3, etc.) that no one's taught | the compiler how to generate, although for some of them you | might be better off using compiler intrinsics. | | Another is when you're dealing with things _underneath_ the | language abstraction, like the Boost co-routines mentioned in | a link a few comments above. Of course, if the language folks | decide to add the capability you want (e.g. C++20 | coroutines), you 're back to being better off using the | compiler. | | Finally there are pedagogical reasons, e.g. sometimes I show | my classes the world's simplest "Hello World", using a few | lines of assembler to invoke the write and exit syscalls. | JonChesterfield wrote: | The boost coroutines mentioned are not the same thing as | the C++20 ones. Boost captures the stack, that's what the | register shuffling is for. C++ is a compiler transform that | moves some state onto the heap, but probably not the whole | stack, and builds a switch dispatch style thing to | transform the control flow. This is why C++ comes with co_* | annotations and coroutines don't. | avgcorrection wrote: | I will never understand the trend of "using quotes around | things". Which is a shorthand version for "using quotes | around things that I wanted to point to and say, hey, this is | something that "goes over here", you know, inside these | quotes, to make sure that you understand exactly what I'm | delimiting, since _using commas, and semicolons, and colons | wouldn't fit for some reason. Oh look the thing that I'm | quoting now consumes 80% of this paragraph. But this is way | better than just saying "the modern rule of thumb is to not | optimize by writing your own assembly." Because then it isn't | 100% clear that the rule of thumb is delimited by (exactly) | "[do] not optimize by writing your own assembly." Ya see?_ " | RodgerTheGreat wrote: | I would just phrase it as "if you optimize by writing your | own assembly, don't expect your program or its performance to | be portable." | bob1029 wrote: | The state of modern compilers is pretty staggering to me. I've | seen some code folding in RyuJIT that makes me feel inferior as | a developer. | | You've got a few compilers (Java, .NET, et. al.) which are | capable of re-compiling hot path code during live execution and | then seamlessly transitioning to those paths. This | recompilation can be based upon the statistics of the live | process, so it's almost like a sort of adaptive AI. Which paths | are hot in production does not need to be known at compile time | with these approaches. | JonChesterfield wrote: | There's a sense in which they're complicated. It's a sequence | of graph transforms which mostly deal with non-polynomial time | problems using heuristics, where mistakes in the transforms can | manifest quite a long way away from the error. There's a | significant risk that they're implemented in languages unique | to that compiler toolchain, as compiler devs are quite prone to | solving problems by writing compilers. | | There's also a sense in which they're really simple. The input | format and output format are (usually) well defined. The user | interface is largely printing things to stderr and giving up, | possibly in a retry loop when there's an editor involved. The | program dependency graph is often quite small so the bug you're | looking at is probably in the source code you checked out. | Security is not the dominant concern you have elsewhere. | gpderetta wrote: | Yes, never push and ret. Here is something I wrote (/me checks | calendar) more than 15 years ago about optimizing coroutine | control flow: | https://www.crystalclearsoftware.com/soc/coroutine/coroutine... | water-your-self wrote: | If an HN reader wanted to play around with similar digging, what | would be the essential tools to be aware of and where best could | he start? | | Assuming prior knowledge of assembly/C but without much | experience decompiling or testing speed. | shoo wrote: | Learn how to use a decent profiler. if you're running linux, | that's probably perf: | | https://man7.org/linux/man-pages/man1/perf.1.html | | https://www.brendangregg.com/perf.html | | Here's a fun article from the cloudflare blog that gives an | example of using of perf to diagnose performance of a small | utility: https://blog.cloudflare.com/when-bloom-filters-dont- | bloom/ | | Matt Godbolt's compiler explorer is also worth checking out: | https://godbolt.org/ | dcow wrote: | Your compiler can spit out assembly, you just need to know how | to read it. Sounds like the author was also using Xcode | Instruments | https://help.apple.com/instruments/mac/current/#/dev7b09c84f... | to check cpu counters. And they were using criterion | https://crates.io/crates/criterion to microbenchmark. | | My guess would be that the author is porting some C code to | Rust and making sure not to regress performance along the way | (probably hopefully trying to increase it). Likely their | program was written in Rust and the section they were trying to | optimize called some old c code. Sounds like they rewrote the | section in Rust since Rust <-> C ffi calls break out of the | happy realm the rust compiler likes and end up causing a | performance hit themselves. You can write inline assembly in | Rust using the macro https://doc.rust- | lang.org/reference/inline-assembly.html. | CalChris wrote: | It took me a while to understand the mismatched bl/ret pairs | because I'm used to reading matched bl/ret pairs. This confusion | has to be similar to what the silicon is 'thinking': _Human, I | see matched bl /ret pairs all the time and I'm good at them. Why | are you giving me mismatched pairs? I'll do the right thing but | I'm not so good at them._ | | Still, this seems like function inlining. But why not just inline | and use a regular branch loop? Is _foo()_ also being called from | elsewhere? Is space at a premium? | masklinn wrote: | > Upon seeing this program, it's a common reaction to ask "why | is foo a subroutine at all?" | | > The answer is "because this is a didactic example, not code | that's trying to go as fast as possible". | ogogmad wrote: | Minor erratum: Floating point addition actually _is_ commutative; | it 's in fact non-associative. | lmm wrote: | It's not commutative; -0.0 + 0.0 = -0.0 but 0.0 + -0.0 = 0.0. | dekhn wrote: | with some significant exceptions, such as NaNs. | recursive wrote: | Can you think of some `x` where `x + NaN` is not identical to | `NaN + x`? I can't. | dekhn wrote: | you mean, like 1 + NaN = NaN and NaN + 1 = NaN, but NaN != | NaN? (I'm not a numerical expert, just repeating what | others have told me) | CountSessine wrote: | Yes. An NaN in IEEE754 has all 1's in the exponent, and | then the high bit of the mantissa determines whether it's | quiet or signalling, but then rest of the mantissa is/can | be a "payload". | bruce343434 wrote: | Sign bit determines signalling | robocat wrote: | Please double check your facts before disagreeing with | somebody so abruptly. | | Sign bit is NOT the signalling/quiet bit. Bit 51 (edit or | bit 50 - damn ***** IEEE for not publishing important | standards for free public access) is according to the | first result I looked at: | https://craftinginterpreters.com/optimization.html | | Edit 2 from IEEE 754 (2008 version): | 6.2.1 NaN encodings in binary formats This | subclause further specifies the encodings of NaNs as bit | strings when they are the results of operations. When | encoded, all NaNs have a sign bit and a pattern of bits | necessary to identify the encoding as a NaN and which | determines its kind (sNaN vs. qNaN). The remaining bits, | which are in the trailing significand field, encode the | payload, which might be diagnostic information (see | above). All binary NaN bit strings have all the | bits of the biased exponent field E set to 1 (see 3.4). A | quiet NaN bit string should be encoded with the first bit | (d1) of the trailing significand field T being 1. A | signaling NaN bit string should be encoded with the first | bit of the trailing significand field being 0. If the | first bit of the trailing significand field is 0, some | other bit of the trailing significand field must be non- | zero to distinguish the NaN from infinity. In the | preferred encoding just described, a signaling NaN shall | be quieted by setting d1 to 1, leaving the remaining bits | of T unchanged. 6.3 The sign bit When | either an input or result is NaN, this standard does not | interpret the sign of a NaN. Note, however, that | operations on bit strings--copy, negate, abs, copySign-- | specify the sign bit of a NaN result, sometimes based | upon the sign bit of a NaN operand. The logical predicate | totalOrder is also affected by the sign bit of a NaN | operand. For all other operations, this standard does not | specify the sign bit of a NaN result, even when there is | only one input NaN, or when the NaN is produced from an | invalid operation. When neither the inputs nor | result are NaN, the sign of a product or quotient is the | exclusive OR of the operands' signs; the sign of a sum, | or of a difference x-y regarded as a sum x+(-y), differs | from at most one of the addends' signs; and the sign of | the result of conversions, the quantize operation, the | roundTo- Integral operations, and the | roundToIntegralExact (see 5.3.1) is the sign of the first | or only operand. These rules shall apply even when | operands or results are zero or infinite. When the | sum of two operands with opposite signs (or the | difference of two operands with like signs) is exactly | zero, the sign of that sum (or difference) shall be +0 in | all rounding-direction attributes except | roundTowardNegative; under that attribute, the sign of an | exact zero sum (or difference) shall be -0. However, x + | x = x - (-x) retains the same sign as x even when x is | zero. When (axb)+c is exactly zero, the sign of | fusedMultiplyAdd(a, b, c) shall be determined by the | rules above for a sum of operands. When the exact result | of (a x b) + c is non-zero yet the result of | fusedMultiplyAdd is zero because of rounding, the zero | result takes the sign of the exact result. Except | that squareRoot(-0) shall be -0, every numeric squareRoot | result shall have a positive sign. | | I.e. you are definitely wrong. The sign bit can be + or - | for NaN (presumably a side-effect of the encoding for | +/-Infinity ). And then that leads to a bunch of arse | (section 6.3) because the spec needs to decide what | happens to the sign bit in a bunch of different | situations. PS: fucking infinity. Infinity should have | been NaN. Infinity [?] Infinity, except in in the | egghead-land IEEE (side note: egghead is a compliment | IMHO). Mind you, easy to see mistakes in retrospect, but | corner cases are shit in programming. I do like NaN, | although reading comments here, and the IEEE spec, forces | me to learn how little I now about NaN encodings. Oh, and | any NaN should equal any other NaN. Mathematically | obviously not, but logically yes and IEEE is for | programming. NaN is already defined as a nonsense, so at | least keep the nonsense consistent. Changing _if =_ to | _if [?]_ should not introduce subtle logic bugs. | | Ranting edit #755: and while we are at it, -0 is an | abomination in the eyes of the Great Architect in the | Matrix - it should never have been allowed - perhaps -0 | should have been NaN with signalling bits in the exponent | (even though that would prevent some language virtual | machine optimisations where 53 bits of NaN get used to | pack other information, but the win would be compelling | because reducing bugs due special cases is huge IMHO). | How many developers understand IEEE corner cases: fuck | all in my long experience. | bruce343434 wrote: | The source I had in my head as I was replying was | https://posithub.org/docs/Posits4.pdf pages 31/32 which | implies that the sign bit is responsible for signalling- | ness. Neither of these are a primary source however. | | A random stackoverflow answer without sources seems to | confirm your pov. Now I don't know what to believe. How | is the sign bit used in NaNs? Would they really waste | that bit? | stephencanon wrote: | Hi, former IEEE 754 committee member here: languages and | architectures are allowed to choose how they encode | signalingness, but they cannot use the sign bit to do it. | The most common choice is to use the high-order but of | the significand field, but most choices you might make is | represented by some architecture. | robocat wrote: | Awesome! So the important part of section 6.2.1 (2008) is | the "should" is not a "must"? "A quiet | NaN bit string should be encoded with the first bit (d1) | of the trailing significand field T being 1. A signaling | NaN bit string should be encoded with the first bit of | the trailing significand field being 0." | | Or is it that some implementations before 2008 used other | bits? | | PS: I might be complaining, but I do sincerely thank you | for your hard work. I certainly would not want to be on a | standards committee, so I sincerely appreciate the | efforts of the committed whom fight so hard to make | things better. Your comment just goes to show how many | corner cases there are to the corner cases!! | MaulingMonkey wrote: | If x is another NaN with a different payload, the result is | likely a NaN with a payload corresponding to the left side | of the addition. | raphlinus wrote: | That is correct, here is a playground link which shows | that result: https://play.rust- | lang.org/?version=stable&mode=debug&editio... | stephencanon wrote: | This varies tremendously across architectures and uarches | and compilers. The result will be a quiet NaN, and you | can't say anything beyond that. | jcranmer wrote: | It depends on your model of NaN. If you think there is only | one NaN value, then the two values return the same result. | If there are multiple distinct NaN values (i.e., you | distinguish between NaNs with different payloads), then the | resulting payload of an arithmetic operation is... not | well-specified. | | Most hardware architectures (x86, ARM fall into this | category) pick the rule that the payload is the Nth operand | (usually first, sometimes second) when multiple inputs are | NaN. I believe there's some hardware where the lesser of | the payloads (converted to an integer) is picked instead. | RISC-V dispenses with NaN payload propagation entirely. | There is theoretically the ability for hardware to generate | a new unique NaN with the hardware address of the | instruction into the payload, but I don't believe any | hardware actually does it. | | Most programming languages do not generally model NaN with | greater fidelity than "there is a NaN value" or maybe | "there is a distinction between qNaN and sNaN." | [deleted] | marcosdumay wrote: | Yes, it's commutative. But it is associative. | | Make a number with 3 bits of mantissa, and go add a thousand | repetitions of 1. You can get hundreds of different results | depending on the order you add them up. | wizzwizz4 wrote: | > _But it is associative._ | | You're describing a non-associative operation. | ogogmad wrote: | Exactly. See the definition: | https://mathworld.wolfram.com/Associative.html | [deleted] | mkeeter wrote: | Thanks, fixed! | titzer wrote: | > More specifically, the branch predictor probably keeps an | internal stack of function return addresses, which is pushed to | whenever a bl is executed. When the branch predictor sees a ret | coming down the pipeline, it assumes that you're returning to the | address associated with the most recent bl (and begins | prefetching / speculative execution / whatever), then pops that | top address from its internal stack. | | There's no need for "probably" here. The micro-architectural | mechanism is known as a return stack buffer[1] and is generally | separate from the branch predictor unit, though the processor may | make use of indirect branch prediction entries for returns as | well. | | [1] It is, indeed, a tiny little stack of return addresses and | indeed, the article hit performance issues by misaligning it. The | (Intel chips') RSB is behind the Retbleed vulnerabilities. | londons_explore wrote: | Observation: Almost any code, when micro-optimized, can gain | about 10x performance. | | So, if we had the time and energy, we could probably make all of | computing at least 10x faster. | | But we don't have the time or energy to dedicate that much effort | to every line of code... But perhaps AI does? | fluoridation wrote: | Not true by a long shot, unfortunately. Getting even a 100% | performance gain out of manual optimization is unusual. Usually | the only way to get significant gains is by either switching | algorithms or by relaxing problem requirements. | david2ndaccount wrote: | He didn't gain 10x from a micro optimization, he gained that | much by converting it to use SIMD which is a macro | optimization. You usually have to structure your program in | such a way to be SIMD friendly. In this case it was already | simd-friendly (adding a large number of floats). | Jensson wrote: | > In this case it was already simd-friendly (adding a large | number of floats). | | So it was micro optimization afterall! | lmm wrote: | It was a micro optimization because it was a toy example. | Doing floating-point arithmetic (and not already using BLAS | etc.) is very niche in real code. | Jensson wrote: | It is still micro optimization to ensure that those | instructions are used when the compiler is too dumb to | use them. You can say that they are niche, but that is | different from it being micro optimization. | shoo wrote: | > Almost any code, when micro-optimized, can gain about 10x | performance. | | well, it depends on where you start from. the speedups can | become arbitrarily impressive sounding when the starting point | is arbitrarily inefficient. | | e.g. if you're starting with a python script that wasn't | written with performance in mind and has ended up being | compute-bound in pure python number crunching code, if you | rewrite the thing in C while thinking a little about | appropriate data structures and memory allocation/access -- | i.e. replacing a festival of python dict lookups and very | frequent memory allocation for tiny intermediate results with | indexing into preallocated arrays or so on -- it's quite common | to see speedups of 500x - 1000x. this is before micro- | optimisation, before introducing SIMD or multi-threading or | setting the compiler to build for your exact CPU or so on. | Pet_Ant wrote: | Whenever you read/hear/think "AI" replace it with "a | probabilistic process". If you can validate the output then | it's really a beam search ( | https://en.wikipedia.org/wiki/Beam_search ) of the solution | space, and not really "AI". | | If you can't, then it's really a crap shoot as to whether the | output is at all valid. If we want to use more opaque | processes, I think we need more transparent outputs. If a | neural net can produce a machine checkable proof or supply it | with the optimisation that's great, but otherwise it's just hot | air. | celeritascelery wrote: | I don't think that is generally true. He only got a large | speedup because he used SIMD, which has nothing to do with | micro optimization. I would say a better take away is that | micro optimization is really hard and you will often make | things worse if you don't know what you are doing. Even if you | do, you are only going to get a few percentage points. | thethirdone wrote: | My experience micro optimizing things is that even without | SIMD, most software can get at least a 5x in performance. | With SIMD, you can often get 50x improvements. | | The reason why people thing "Even if you do, you are only | going to get a few percentage points." is because it | generally takes 5-50x the developer time to optimize such | code. If it takes half a day to write naive code to do | something like validate utf8, it probably takes ~25 workdays | to make a fast SIMD version. If you instead spend an extra | half a day, there is a good chance you get a 10-50% speedup | using normal code. | mumumu wrote: | This is true on a few streaming application (such as | parsing). | | And most of the speedup is because of tricks to avoid doing | beaches. There is a great blog post from one of the authors | of JSON SIMD discussing this. | | I'm on mobile, there is a link for the blog post on the | simd JSON github repository. | bee_rider wrote: | It also can we weird and non-obvious. For example depending | on the instruction mix and hardware it might not be worth | getting dinged by AVX-512 clocks. And if you are, say, | writing the UTF validation code as a library (more | favorable to put effort into a library!) you might not know | where the code is being used, so you might not even know | the instruction mix... | FullyFunctional wrote: | Well, _of course_ , the Return Address Stack (RAS) predictor | maintains its own call stack and you need to understand how it | works. However, there's a subtler way to break it: recurse too | deeply. The RAS only has a fixed, small, and _implementation | dependent_ length. If you use deep recursion with non-trivial | control flow (in particular multiple call sites), then the RAS | will starting missing once you return from beyond that limit. | | Another consequence of the RAS is that co-routines switching is | more expensive than they might appear at first. RISC-V has | encoding hints to mark call(jal)/returns that are actually co- | routine switching but the full cost can't be avoided. | gpderetta wrote: | you can mitigate the cost by not 'call'-ing into your coroutine | switch function but inlining the code into the surrounding | coroutine. As a bonus you get a bit better branch prediction on | your yield because distinct yields will share less state. | | Of course there is always going to be a penality for stackful | coroutines that yield deep into a callstack. | sylware wrote: | I write assembly mainly _not_ because it is faster, but because I | don't depend on an absurdely complex and massive compiler. | packetlost wrote: | So... how does that work? What sort of work do you do that you | have the time to write raw ASM and still be productive? I'm | asking in earnest, because I'm curious what sort of workflows | still allow for writing ASM directly outside of very specific | cases (such as initializing embedded MCUs, for example) | jefftk wrote: | _> The SIMD code does come with one asterisk, though: because | floating-point addition is not associative, and it performs the | summation in a different order, it may not get the same result as | straight-line code. In retrospect, this is likely why the | compiler doesn 't generate SIMD instructions to compute the sum!_ | | What if you set -funsafe-math-optimizations, which allows | "optimizations that allow arbitrary reassociations and | transformations with no accuracy guarantees"? | makapuf wrote: | You could just turn gcc to 11 and use -Ofast | zokier wrote: | Both clang and gcc vectorize if you ask them nicely: | https://gcc.godbolt.org/z/xvjY8P4cM | [deleted] ___________________________________________________________________ (page generated 2023-01-25 23:00 UTC)