[HN Gopher] Do not taunt happy fun branch predictor
       ___________________________________________________________________
        
       Do not taunt happy fun branch predictor
        
       Author : mkeeter
       Score  : 169 points
       Date   : 2023-01-25 16:41 UTC (6 hours ago)
        
 (HTM) web link (www.mattkeeter.com)
 (TXT) w3m dump (www.mattkeeter.com)
        
       | efitz wrote:
       | I think that the days where hand tuned assembly language
       | outperform compiler generated code are largely behind us (let
       | loose the contrary anecdotes).
       | 
       | Compilers and microprocessors are way more complex than they were
       | back in the 80s or 90s, and compiler engineers know way more
       | about how instructions are actually executed than the vast
       | majority of programmers.
        
         | jcalvinowens wrote:
         | I think about it like this:
         | 
         | If I ask you to write assembler faster than what the compiler
         | emits for one very specific CPU model, you'll pretty much
         | always be able to do that (given enough time).
         | 
         | But as CPUs become more complex, and compilers get better,
         | achieving that win requires more and more "overfitting" to
         | implementation details of the specific hardware, in ways that
         | are not helpful or even counterproductive on older or newer
         | models of the same CPU family.
         | 
         | It's still worth it sometimes, of course. It's just more work
         | and a lower value proposition on average.
        
         | ravi-delia wrote:
         | The only exception (pointed out in this post) is leveraging
         | SIMD instructions, which aren't nearly as well exploited by
         | compilers. I doubt, of course, that this will last all that
         | long, but for now there are too many subtle differences that
         | _might_ matter and totally different instruction sets between
         | cpus and even generations.
        
         | makapuf wrote:
         | ... and yet in this article, the author beats by 10x a C
         | compiled code with hand tuned assembly. (By using SIMD and
         | unrolling, which the compiler did not. Granted linear compiler
         | code is faster than hand made linear assembly)
        
           | zokier wrote:
           | Author beats compiler because floating point constraints,
           | with -ffast-math compiler vectorizes the code.. I don't have
           | arm64 hardware to test the result, but its probably again
           | pretty fast: https://gcc.godbolt.org/z/xvjY8P4cM
        
         | rowanG077 wrote:
         | I used to think that was true. Then I had a crypto course in
         | uni which required us to write and optimize 3 different hashing
         | and encryption algorithms.
         | 
         | I was stunned by the first once, which I first did in C and
         | then moved to ASM. The C code was pretty straightforward. But
         | it was trivial to beat in ASM by something like 120%.
         | 
         | That course taught me how bad compilers(Or rather GCC) are at
         | high-level register optimization and using the barrel shifter.
         | Basically all the performance I squeezed out of ASM was because
         | the compiler just wasn't figuring basic stuff out. Mind you I
         | was(am) not an ASM expert. That piece of code was the first
         | thing I ever write in ASM. And yet I was able to easily beat a
         | decades old world class compiler.
        
         | wolf550e wrote:
         | compiler autovectorization is poor, people often outperform it
         | when writing SIMD code. intrinsics are so low level they might
         | as well be assembly.
        
           | JonChesterfield wrote:
           | I've had really good results from the two in LLVM (one works
           | on loops, one within basic blocks). Optimal loop body when
           | the pointers had alignment metadata attached, though at the
           | time it failed to unroll the tail.
           | 
           | Using intrinsics with the control flow in C or C++ works
           | really well - you get the right instruction selection from
           | the intrinsics, easy to reason about control flow and the
           | compiler deals with register allocation (and possibly
           | instruction scheduling) which are a pain to do by hand.
        
       | iamsmooney wrote:
       | Title is a reference to this old SNL skit:
       | https://www.youtube.com/watch?v=GmqeZl8OI2M
        
         | sophacles wrote:
         | It's definitely a classic - if you haven't seen it I'd
         | recommend it even if it wasn't related to the article :D
        
         | bioint7812 wrote:
         | More translation:
         | 
         | > for reasons
         | 
         | https://www.urbandictionary.com/define.php?term=for%20reason...
        
       | jerf wrote:
       | This is another good example of how our CPUs are in many ways
       | specialized C processors. C is a structured programming language
       | that uses functions, so our processors like functions. If you
       | jump out of that paradigm, even if the assembly instructions
       | nominally seem to allow it, you'll run more slowly. Even when it
       | seems like what you're offering is a shortcut to the CPU.
       | 
       | This is neither praise nor criticism of the current CPU paradigm;
       | it's just something you need to understand if you want the best
       | performance out of our machines.
       | 
       | A different paradigm, like a concatenative-paradigm-based
       | program, might naively be more inclined to compile into code that
       | looks more like what the author tried, jumping between
       | implementations of the stack operators without it actually being
       | "functions". One can imagine processors that would be "happier"
       | with that, and would be bothered by things that look like
       | function returns more. But that's not the CPUs we have.
        
         | masklinn wrote:
         | Author was not jumping out of the paradigm here, they were
         | deliberately misusing constructs specialised for the paradigm
         | (br/ret). That's like saying the toolcase is a specialised
         | screw processor because you're trying to drive nails using a
         | screwdriver and it does not go well.
         | 
         | And C is hardly the first or only procedural langage.
        
         | ablob wrote:
         | This has more to do with the ISA than C I'd assume. C was built
         | as an "easy assembler". Furthermore, the computer doesn't
         | really care about structure (in the structured programming
         | sense).
         | 
         | In this case the implementation of the ISA stores information
         | on each 'ret' depending on some 'bl' that came before. One can
         | imagine that a different optimization technique which actually
         | leads to a speedup exists. Without a branch-predictor, the
         | program that Matt wrote might've been faster.
         | 
         | Imo, this has nothing to do with paradigms, but how control-
         | flow interacts with the system. This interaction is different
         | between different implementations of an architecture.
         | 
         | Code written for Cache-locality and paradigms that work well
         | with it, for example, only became a "winner" after caches were
         | widely implemented. Before that, the cost of sequential array
         | access and random array access was identical. With caches, a
         | linear search for an element can be faster than a binary
         | search, even though it requires a lot more memory accessess.
         | Thus, the optimal implementation for finding an element in an
         | array is now dependent on size as well. (I.e. after an ordered
         | array reaches a certain size, binary search should become
         | faster on average).
        
           | int_19h wrote:
           | The point is that the ISA is assuming that BL and RET opcodes
           | come in pairs - which is an assumption that does reflect what
           | structured code is typically compiled down to. Going by the
           | semantics of the opcodes themselves, there's no reason why
           | RET should be treated differently from a simple indirect jump
           | here.
        
             | masklinn wrote:
             | > there's no reason why RET should be treated differently
             | from a simple indirect jump here.
             | 
             | There's _its very existence_. If you're looking for a
             | general purpose indirect jump, use that.
        
             | tsimionescu wrote:
             | As far as I understand, the ISA explicitly intends for BL
             | and RET to come in pairs. The only point of RET is to jump
             | to the address last stored by BL. If you don't need this
             | behavior, there's no reason to use RET - as the article
             | itself shows, B x30 does the exact same job, and doesn't
             | come with the extra assumptions.
        
           | classichasclass wrote:
           | I agree, because this semantic difference between br x30 and
           | ret doesn't exist in many other RISCs which also mostly run
           | C-like languages. Power just has blr, and MIPS jr $ra, for
           | example. This feels more like a well-intentioned footgun in
           | the ISA.
        
             | shadowofneptune wrote:
             | The potential population for that footgun is compiler
             | writers, who should safely fall into 'know what they are
             | doing' territory.
        
         | flohofwoe wrote:
         | It's just a natural side effect of software and hardware
         | forming a symbiosis and that they cannot really be separated.
         | New software is (or should be!) built to run well on existing
         | hardware, and new hardware is built to run existing software
         | well. Besides, subroutine call instructions were a thing long
         | before C became mainstream and manual assembly coding ruled
         | supreme.
        
         | [deleted]
        
         | brigade wrote:
         | If you want to encode a series of indirect jumps between
         | concatenated functions, you can do that already and the return
         | address stack won't get involved; you'll simply get the normal
         | branch predictors.
         | 
         | But generalized branch prediction is expensive (multiple
         | entries in the BTB hashed by the previous program flow, and
         | mispredicts if the next function in the chain changes); the
         | point of an RAS is that it's a _very_ cheap way to keep some
         | 100% predictable branches from using those general resources.
         | Concatenation pretty much requires branches without unique
         | characteristics, so no fast path shortcuts and everything is a
         | bit slower.
        
       | pranith wrote:
       | Great investigative work! The stack structure you refer to here
       | is called the Return Address Stack (RAS).
        
       | ShroudedNight wrote:
       | I feel like this treatment is incomplete without having tested
       | the scenario where the unmatched ret is replaced with a br lr.
       | 
       | EDIT: Reading the documentation after the fact, it appears that
       | that was what br x30 was - naively I had interpreted the hex as a
       | fixed offset to a label.
        
       | snerbles wrote:
       | While I was in undergrad I toyed around with abusing the branch
       | predictor on a few different machines, compiling something like
       | the following with optimizations off - it performs an identical
       | computation regardless of branch outcome:                   void
       | branchLoop(unsigned int condition, unsigned int &sum)         {
       | // put something suitably large here             unsigned int
       | loopCount = 0x0fffffff;                  unsigned int i;
       | // compile with -O0 or this gets optimized away             for
       | (i = 0; i < loopCount; i++)                 if ((i & condition)
       | == 0)                     sum++;                 else
       | sum++;         }
       | 
       | The Core Duo on my Thinkpad T60 had some very distinct slowdowns
       | on certain bit patterns, which were not repeatable on the handful
       | of other CPUs I had access to at the time. I haven't tried this
       | with more modern CPUs, however.
        
         | gpderetta wrote:
         | Predictors are getting better and better at recognizing long
         | patterns (sometime at the cost of not being optimal with short
         | patterns).
        
       | Malic wrote:
       | Ow. My head hurts.
       | 
       | And this is why optimizing compilers are some of the most complex
       | programs there are. (or so I have been taught)
        
         | shadowgovt wrote:
         | It's also why the modern rule of thumb is "don't optimize by
         | writing your own assembly."
         | 
         | The rule is a boiled-down version of the larger notion "Don't
         | optimize by writing your own assembly, _because even with
         | domain knowledge of the problem you 're trying to solve, you're
         | probably not more clever than the engineer-decades that went
         | into building your compiler toolchain and processor
         | architecture, unless you're an expert in both fields, in which
         | case good luck and shoulder that maintenance burden._"
         | 
         | The rule of thumb drops a lot of detail on the ground but is a
         | good first approximation.
        
           | astrobe_ wrote:
           | ... Which is kind of worrying; is it really a good thing that
           | processors are so complex that you need "decades" to use them
           | to fully. Bottom line, you end up with chaotic (in the
           | "sensitive to the slightest change") performance behavior.
           | 
           | OTOH this reminds of another saying, "don't roll your own
           | crypto". But all those "don't" are a bit frustrating.
        
             | olliej wrote:
             | But you don't need decades of experience: we have compilers
             | and optimizers to do that.
        
               | bioint7812 wrote:
               | It's interesting that the impedence the author was
               | experiencing was one of the CPU incorrectly "speculating"
               | about the intent. We as readers are left to speculate
               | about the problem being solved by the author.
               | 
               | Based on the content of his recent articles, we could
               | assume he is continuing his development of a JIT for a
               | Rust port of his graphics engine. Given that assumption,
               | I would argue that the compiler writers are lagging here
               | --specifically lack of dynamic compilation. For example,
               | is there a JIT compiler for the Rust language? I was
               | thinking about reproducing his experiment in SBCL, which
               | does have dynamic compilation--although it wouldn't be a
               | real "apples" to "apples" comparison because my work
               | machine is a x86_64.
        
               | olliej wrote:
               | JITs have very different constraints (but also many
               | advantages) vs AOT compilers, so I don't think in the
               | general case a language compiler can "support" JITs
               | directly. llvm/clang for instance have a jit mode... for
               | compiling C/C++, not some other language.
               | 
               | Also obviously the compiler writers (AOT or JIT) need to
               | know about and understand very minute details of how a
               | cpu is behaving. I was responding to a person saying it
               | was worrying that devs need that kind of knowledge and
               | experience by saying that that isn't true because the
               | tools exist (I feel that there's some analogy to what
               | knowledge you need for a car: changing oil, servicing
               | engine, replacing engine, designing an engine...). Once
               | you are writing a JIT you're in the "making the tools"
               | group, so you now need to know more than the vast
               | majority of devs (not just more than the "average" dev)
               | that's inescapable.
        
             | miloignis wrote:
             | I've seen people get frustrated by the "don't"s before, but
             | I think that's generally taking the first degree
             | approximation too literally. Feel free to hand-write
             | assembly or roll your own crypto, but _don 't_ depend on it
             | for anything serious unless you are an expert or have it
             | reviewed by one. Doing so for learning and fun is fine, if
             | that's clearly called out such that no one accidentally
             | depends on it for something serious. There's only one way
             | to become good at something, and that's good practice!
             | 
             | In a professional setting there's a responsibility to the
             | end user which generally precludes doing these dangerous
             | things - that is, one should feel free to take up
             | woodworking as a hobby but shouldn't offer to build
             | someone's house unless you're a licensed professional.
        
             | [deleted]
        
           | pjdesno wrote:
           | Maybe better expressed as "don't write your own assembly
           | unless you know why it might be better than the compiler".
           | 
           | Trying to beat the compiler at optimizing normal code is kind
           | of like trying to beat a calculator with paper and pencil -
           | computers are just better at that sort of thing than people
           | are.
           | 
           | One use case is where you want to use bizarre CPU functions
           | (popcount, encryption, load CR3, etc.) that no one's taught
           | the compiler how to generate, although for some of them you
           | might be better off using compiler intrinsics.
           | 
           | Another is when you're dealing with things _underneath_ the
           | language abstraction, like the Boost co-routines mentioned in
           | a link a few comments above. Of course, if the language folks
           | decide to add the capability you want (e.g. C++20
           | coroutines), you 're back to being better off using the
           | compiler.
           | 
           | Finally there are pedagogical reasons, e.g. sometimes I show
           | my classes the world's simplest "Hello World", using a few
           | lines of assembler to invoke the write and exit syscalls.
        
             | JonChesterfield wrote:
             | The boost coroutines mentioned are not the same thing as
             | the C++20 ones. Boost captures the stack, that's what the
             | register shuffling is for. C++ is a compiler transform that
             | moves some state onto the heap, but probably not the whole
             | stack, and builds a switch dispatch style thing to
             | transform the control flow. This is why C++ comes with co_*
             | annotations and coroutines don't.
        
           | avgcorrection wrote:
           | I will never understand the trend of "using quotes around
           | things". Which is a shorthand version for "using quotes
           | around things that I wanted to point to and say, hey, this is
           | something that "goes over here", you know, inside these
           | quotes, to make sure that you understand exactly what I'm
           | delimiting, since _using commas, and semicolons, and colons
           | wouldn't fit for some reason. Oh look the thing that I'm
           | quoting now consumes 80% of this paragraph. But this is way
           | better than just saying "the modern rule of thumb is to not
           | optimize by writing your own assembly." Because then it isn't
           | 100% clear that the rule of thumb is delimited by (exactly)
           | "[do] not optimize by writing your own assembly." Ya see?_ "
        
           | RodgerTheGreat wrote:
           | I would just phrase it as "if you optimize by writing your
           | own assembly, don't expect your program or its performance to
           | be portable."
        
         | bob1029 wrote:
         | The state of modern compilers is pretty staggering to me. I've
         | seen some code folding in RyuJIT that makes me feel inferior as
         | a developer.
         | 
         | You've got a few compilers (Java, .NET, et. al.) which are
         | capable of re-compiling hot path code during live execution and
         | then seamlessly transitioning to those paths. This
         | recompilation can be based upon the statistics of the live
         | process, so it's almost like a sort of adaptive AI. Which paths
         | are hot in production does not need to be known at compile time
         | with these approaches.
        
         | JonChesterfield wrote:
         | There's a sense in which they're complicated. It's a sequence
         | of graph transforms which mostly deal with non-polynomial time
         | problems using heuristics, where mistakes in the transforms can
         | manifest quite a long way away from the error. There's a
         | significant risk that they're implemented in languages unique
         | to that compiler toolchain, as compiler devs are quite prone to
         | solving problems by writing compilers.
         | 
         | There's also a sense in which they're really simple. The input
         | format and output format are (usually) well defined. The user
         | interface is largely printing things to stderr and giving up,
         | possibly in a retry loop when there's an editor involved. The
         | program dependency graph is often quite small so the bug you're
         | looking at is probably in the source code you checked out.
         | Security is not the dominant concern you have elsewhere.
        
       | gpderetta wrote:
       | Yes, never push and ret. Here is something I wrote (/me checks
       | calendar) more than 15 years ago about optimizing coroutine
       | control flow:
       | https://www.crystalclearsoftware.com/soc/coroutine/coroutine...
        
       | water-your-self wrote:
       | If an HN reader wanted to play around with similar digging, what
       | would be the essential tools to be aware of and where best could
       | he start?
       | 
       | Assuming prior knowledge of assembly/C but without much
       | experience decompiling or testing speed.
        
         | shoo wrote:
         | Learn how to use a decent profiler. if you're running linux,
         | that's probably perf:
         | 
         | https://man7.org/linux/man-pages/man1/perf.1.html
         | 
         | https://www.brendangregg.com/perf.html
         | 
         | Here's a fun article from the cloudflare blog that gives an
         | example of using of perf to diagnose performance of a small
         | utility: https://blog.cloudflare.com/when-bloom-filters-dont-
         | bloom/
         | 
         | Matt Godbolt's compiler explorer is also worth checking out:
         | https://godbolt.org/
        
         | dcow wrote:
         | Your compiler can spit out assembly, you just need to know how
         | to read it. Sounds like the author was also using Xcode
         | Instruments
         | https://help.apple.com/instruments/mac/current/#/dev7b09c84f...
         | to check cpu counters. And they were using criterion
         | https://crates.io/crates/criterion to microbenchmark.
         | 
         | My guess would be that the author is porting some C code to
         | Rust and making sure not to regress performance along the way
         | (probably hopefully trying to increase it). Likely their
         | program was written in Rust and the section they were trying to
         | optimize called some old c code. Sounds like they rewrote the
         | section in Rust since Rust <-> C ffi calls break out of the
         | happy realm the rust compiler likes and end up causing a
         | performance hit themselves. You can write inline assembly in
         | Rust using the macro https://doc.rust-
         | lang.org/reference/inline-assembly.html.
        
       | CalChris wrote:
       | It took me a while to understand the mismatched bl/ret pairs
       | because I'm used to reading matched bl/ret pairs. This confusion
       | has to be similar to what the silicon is 'thinking': _Human, I
       | see matched bl /ret pairs all the time and I'm good at them. Why
       | are you giving me mismatched pairs? I'll do the right thing but
       | I'm not so good at them._
       | 
       | Still, this seems like function inlining. But why not just inline
       | and use a regular branch loop? Is _foo()_ also being called from
       | elsewhere? Is space at a premium?
        
         | masklinn wrote:
         | > Upon seeing this program, it's a common reaction to ask "why
         | is foo a subroutine at all?"
         | 
         | > The answer is "because this is a didactic example, not code
         | that's trying to go as fast as possible".
        
       | ogogmad wrote:
       | Minor erratum: Floating point addition actually _is_ commutative;
       | it 's in fact non-associative.
        
         | lmm wrote:
         | It's not commutative; -0.0 + 0.0 = -0.0 but 0.0 + -0.0 = 0.0.
        
         | dekhn wrote:
         | with some significant exceptions, such as NaNs.
        
           | recursive wrote:
           | Can you think of some `x` where `x + NaN` is not identical to
           | `NaN + x`? I can't.
        
             | dekhn wrote:
             | you mean, like 1 + NaN = NaN and NaN + 1 = NaN, but NaN !=
             | NaN? (I'm not a numerical expert, just repeating what
             | others have told me)
        
               | CountSessine wrote:
               | Yes. An NaN in IEEE754 has all 1's in the exponent, and
               | then the high bit of the mantissa determines whether it's
               | quiet or signalling, but then rest of the mantissa is/can
               | be a "payload".
        
               | bruce343434 wrote:
               | Sign bit determines signalling
        
               | robocat wrote:
               | Please double check your facts before disagreeing with
               | somebody so abruptly.
               | 
               | Sign bit is NOT the signalling/quiet bit. Bit 51 (edit or
               | bit 50 - damn ***** IEEE for not publishing important
               | standards for free public access) is according to the
               | first result I looked at:
               | https://craftinginterpreters.com/optimization.html
               | 
               | Edit 2 from IEEE 754 (2008 version):
               | 6.2.1 NaN encodings in binary formats       This
               | subclause further specifies the encodings of NaNs as bit
               | strings when they are the results of operations. When
               | encoded, all NaNs have a sign bit and a pattern of bits
               | necessary to identify the encoding as a NaN and which
               | determines its kind (sNaN vs. qNaN). The remaining bits,
               | which are in the trailing significand field, encode the
               | payload, which might be diagnostic information (see
               | above).       All binary NaN bit strings have all the
               | bits of the biased exponent field E set to 1 (see 3.4). A
               | quiet NaN bit string should be encoded with the first bit
               | (d1) of the trailing significand field T being 1. A
               | signaling NaN bit string should be encoded with the first
               | bit of the trailing significand field being 0. If the
               | first bit of the trailing significand field is 0, some
               | other bit of the trailing significand field must be non-
               | zero to distinguish the NaN from infinity. In the
               | preferred encoding just described, a signaling NaN shall
               | be quieted by setting d1 to 1, leaving the remaining bits
               | of T unchanged.            6.3 The sign bit       When
               | either an input or result is NaN, this standard does not
               | interpret the sign of a NaN. Note, however, that
               | operations on bit strings--copy, negate, abs, copySign--
               | specify the sign bit of a NaN result, sometimes based
               | upon the sign bit of a NaN operand. The logical predicate
               | totalOrder is also affected by the sign bit of a NaN
               | operand. For all other operations, this standard does not
               | specify the sign bit of a NaN result, even when there is
               | only one input NaN, or when the NaN is produced from an
               | invalid operation.       When neither the inputs nor
               | result are NaN, the sign of a product or quotient is the
               | exclusive OR of the operands' signs; the sign of a sum,
               | or of a difference x-y regarded as a sum x+(-y), differs
               | from at most one of the addends' signs; and the sign of
               | the result of conversions, the quantize operation, the
               | roundTo- Integral operations, and the
               | roundToIntegralExact (see 5.3.1) is the sign of the first
               | or only operand. These rules shall apply even when
               | operands or results are zero or infinite.       When the
               | sum of two operands with opposite signs (or the
               | difference of two operands with like signs) is exactly
               | zero, the sign of that sum (or difference) shall be +0 in
               | all rounding-direction attributes except
               | roundTowardNegative; under that attribute, the sign of an
               | exact zero sum (or difference) shall be -0. However, x +
               | x = x - (-x) retains the same sign as x even when x is
               | zero.       When (axb)+c is exactly zero, the sign of
               | fusedMultiplyAdd(a, b, c) shall be determined by the
               | rules above for a sum of operands. When the exact result
               | of (a x b) + c is non-zero yet the result of
               | fusedMultiplyAdd is zero because of rounding, the zero
               | result takes the sign of the exact result.       Except
               | that squareRoot(-0) shall be -0, every numeric squareRoot
               | result shall have a positive sign.
               | 
               | I.e. you are definitely wrong. The sign bit can be + or -
               | for NaN (presumably a side-effect of the encoding for
               | +/-Infinity ). And then that leads to a bunch of arse
               | (section 6.3) because the spec needs to decide what
               | happens to the sign bit in a bunch of different
               | situations. PS: fucking infinity. Infinity should have
               | been NaN. Infinity [?] Infinity, except in in the
               | egghead-land IEEE (side note: egghead is a compliment
               | IMHO). Mind you, easy to see mistakes in retrospect, but
               | corner cases are shit in programming. I do like NaN,
               | although reading comments here, and the IEEE spec, forces
               | me to learn how little I now about NaN encodings. Oh, and
               | any NaN should equal any other NaN. Mathematically
               | obviously not, but logically yes and IEEE is for
               | programming. NaN is already defined as a nonsense, so at
               | least keep the nonsense consistent. Changing _if =_ to
               | _if [?]_ should not introduce subtle logic bugs.
               | 
               | Ranting edit #755: and while we are at it, -0 is an
               | abomination in the eyes of the Great Architect in the
               | Matrix - it should never have been allowed - perhaps -0
               | should have been NaN with signalling bits in the exponent
               | (even though that would prevent some language virtual
               | machine optimisations where 53 bits of NaN get used to
               | pack other information, but the win would be compelling
               | because reducing bugs due special cases is huge IMHO).
               | How many developers understand IEEE corner cases: fuck
               | all in my long experience.
        
               | bruce343434 wrote:
               | The source I had in my head as I was replying was
               | https://posithub.org/docs/Posits4.pdf pages 31/32 which
               | implies that the sign bit is responsible for signalling-
               | ness. Neither of these are a primary source however.
               | 
               | A random stackoverflow answer without sources seems to
               | confirm your pov. Now I don't know what to believe. How
               | is the sign bit used in NaNs? Would they really waste
               | that bit?
        
               | stephencanon wrote:
               | Hi, former IEEE 754 committee member here: languages and
               | architectures are allowed to choose how they encode
               | signalingness, but they cannot use the sign bit to do it.
               | The most common choice is to use the high-order but of
               | the significand field, but most choices you might make is
               | represented by some architecture.
        
               | robocat wrote:
               | Awesome! So the important part of section 6.2.1 (2008) is
               | the "should" is not a "must"?                 "A quiet
               | NaN bit string should be encoded with the first bit (d1)
               | of the trailing significand field T being 1. A signaling
               | NaN bit string should be encoded with the first bit of
               | the trailing significand field being 0."
               | 
               | Or is it that some implementations before 2008 used other
               | bits?
               | 
               | PS: I might be complaining, but I do sincerely thank you
               | for your hard work. I certainly would not want to be on a
               | standards committee, so I sincerely appreciate the
               | efforts of the committed whom fight so hard to make
               | things better. Your comment just goes to show how many
               | corner cases there are to the corner cases!!
        
             | MaulingMonkey wrote:
             | If x is another NaN with a different payload, the result is
             | likely a NaN with a payload corresponding to the left side
             | of the addition.
        
               | raphlinus wrote:
               | That is correct, here is a playground link which shows
               | that result: https://play.rust-
               | lang.org/?version=stable&mode=debug&editio...
        
               | stephencanon wrote:
               | This varies tremendously across architectures and uarches
               | and compilers. The result will be a quiet NaN, and you
               | can't say anything beyond that.
        
             | jcranmer wrote:
             | It depends on your model of NaN. If you think there is only
             | one NaN value, then the two values return the same result.
             | If there are multiple distinct NaN values (i.e., you
             | distinguish between NaNs with different payloads), then the
             | resulting payload of an arithmetic operation is... not
             | well-specified.
             | 
             | Most hardware architectures (x86, ARM fall into this
             | category) pick the rule that the payload is the Nth operand
             | (usually first, sometimes second) when multiple inputs are
             | NaN. I believe there's some hardware where the lesser of
             | the payloads (converted to an integer) is picked instead.
             | RISC-V dispenses with NaN payload propagation entirely.
             | There is theoretically the ability for hardware to generate
             | a new unique NaN with the hardware address of the
             | instruction into the payload, but I don't believe any
             | hardware actually does it.
             | 
             | Most programming languages do not generally model NaN with
             | greater fidelity than "there is a NaN value" or maybe
             | "there is a distinction between qNaN and sNaN."
        
         | [deleted]
        
         | marcosdumay wrote:
         | Yes, it's commutative. But it is associative.
         | 
         | Make a number with 3 bits of mantissa, and go add a thousand
         | repetitions of 1. You can get hundreds of different results
         | depending on the order you add them up.
        
           | wizzwizz4 wrote:
           | > _But it is associative._
           | 
           | You're describing a non-associative operation.
        
             | ogogmad wrote:
             | Exactly. See the definition:
             | https://mathworld.wolfram.com/Associative.html
        
           | [deleted]
        
         | mkeeter wrote:
         | Thanks, fixed!
        
       | titzer wrote:
       | > More specifically, the branch predictor probably keeps an
       | internal stack of function return addresses, which is pushed to
       | whenever a bl is executed. When the branch predictor sees a ret
       | coming down the pipeline, it assumes that you're returning to the
       | address associated with the most recent bl (and begins
       | prefetching / speculative execution / whatever), then pops that
       | top address from its internal stack.
       | 
       | There's no need for "probably" here. The micro-architectural
       | mechanism is known as a return stack buffer[1] and is generally
       | separate from the branch predictor unit, though the processor may
       | make use of indirect branch prediction entries for returns as
       | well.
       | 
       | [1] It is, indeed, a tiny little stack of return addresses and
       | indeed, the article hit performance issues by misaligning it. The
       | (Intel chips') RSB is behind the Retbleed vulnerabilities.
        
       | londons_explore wrote:
       | Observation: Almost any code, when micro-optimized, can gain
       | about 10x performance.
       | 
       | So, if we had the time and energy, we could probably make all of
       | computing at least 10x faster.
       | 
       | But we don't have the time or energy to dedicate that much effort
       | to every line of code... But perhaps AI does?
        
         | fluoridation wrote:
         | Not true by a long shot, unfortunately. Getting even a 100%
         | performance gain out of manual optimization is unusual. Usually
         | the only way to get significant gains is by either switching
         | algorithms or by relaxing problem requirements.
        
         | david2ndaccount wrote:
         | He didn't gain 10x from a micro optimization, he gained that
         | much by converting it to use SIMD which is a macro
         | optimization. You usually have to structure your program in
         | such a way to be SIMD friendly. In this case it was already
         | simd-friendly (adding a large number of floats).
        
           | Jensson wrote:
           | > In this case it was already simd-friendly (adding a large
           | number of floats).
           | 
           | So it was micro optimization afterall!
        
             | lmm wrote:
             | It was a micro optimization because it was a toy example.
             | Doing floating-point arithmetic (and not already using BLAS
             | etc.) is very niche in real code.
        
               | Jensson wrote:
               | It is still micro optimization to ensure that those
               | instructions are used when the compiler is too dumb to
               | use them. You can say that they are niche, but that is
               | different from it being micro optimization.
        
         | shoo wrote:
         | > Almost any code, when micro-optimized, can gain about 10x
         | performance.
         | 
         | well, it depends on where you start from. the speedups can
         | become arbitrarily impressive sounding when the starting point
         | is arbitrarily inefficient.
         | 
         | e.g. if you're starting with a python script that wasn't
         | written with performance in mind and has ended up being
         | compute-bound in pure python number crunching code, if you
         | rewrite the thing in C while thinking a little about
         | appropriate data structures and memory allocation/access --
         | i.e. replacing a festival of python dict lookups and very
         | frequent memory allocation for tiny intermediate results with
         | indexing into preallocated arrays or so on -- it's quite common
         | to see speedups of 500x - 1000x. this is before micro-
         | optimisation, before introducing SIMD or multi-threading or
         | setting the compiler to build for your exact CPU or so on.
        
         | Pet_Ant wrote:
         | Whenever you read/hear/think "AI" replace it with "a
         | probabilistic process". If you can validate the output then
         | it's really a beam search (
         | https://en.wikipedia.org/wiki/Beam_search ) of the solution
         | space, and not really "AI".
         | 
         | If you can't, then it's really a crap shoot as to whether the
         | output is at all valid. If we want to use more opaque
         | processes, I think we need more transparent outputs. If a
         | neural net can produce a machine checkable proof or supply it
         | with the optimisation that's great, but otherwise it's just hot
         | air.
        
         | celeritascelery wrote:
         | I don't think that is generally true. He only got a large
         | speedup because he used SIMD, which has nothing to do with
         | micro optimization. I would say a better take away is that
         | micro optimization is really hard and you will often make
         | things worse if you don't know what you are doing. Even if you
         | do, you are only going to get a few percentage points.
        
           | thethirdone wrote:
           | My experience micro optimizing things is that even without
           | SIMD, most software can get at least a 5x in performance.
           | With SIMD, you can often get 50x improvements.
           | 
           | The reason why people thing "Even if you do, you are only
           | going to get a few percentage points." is because it
           | generally takes 5-50x the developer time to optimize such
           | code. If it takes half a day to write naive code to do
           | something like validate utf8, it probably takes ~25 workdays
           | to make a fast SIMD version. If you instead spend an extra
           | half a day, there is a good chance you get a 10-50% speedup
           | using normal code.
        
             | mumumu wrote:
             | This is true on a few streaming application (such as
             | parsing).
             | 
             | And most of the speedup is because of tricks to avoid doing
             | beaches. There is a great blog post from one of the authors
             | of JSON SIMD discussing this.
             | 
             | I'm on mobile, there is a link for the blog post on the
             | simd JSON github repository.
        
             | bee_rider wrote:
             | It also can we weird and non-obvious. For example depending
             | on the instruction mix and hardware it might not be worth
             | getting dinged by AVX-512 clocks. And if you are, say,
             | writing the UTF validation code as a library (more
             | favorable to put effort into a library!) you might not know
             | where the code is being used, so you might not even know
             | the instruction mix...
        
       | FullyFunctional wrote:
       | Well, _of course_ , the Return Address Stack (RAS) predictor
       | maintains its own call stack and you need to understand how it
       | works. However, there's a subtler way to break it: recurse too
       | deeply. The RAS only has a fixed, small, and _implementation
       | dependent_ length. If you use deep recursion with non-trivial
       | control flow (in particular multiple call sites), then the RAS
       | will starting missing once you return from beyond that limit.
       | 
       | Another consequence of the RAS is that co-routines switching is
       | more expensive than they might appear at first. RISC-V has
       | encoding hints to mark call(jal)/returns that are actually co-
       | routine switching but the full cost can't be avoided.
        
         | gpderetta wrote:
         | you can mitigate the cost by not 'call'-ing into your coroutine
         | switch function but inlining the code into the surrounding
         | coroutine. As a bonus you get a bit better branch prediction on
         | your yield because distinct yields will share less state.
         | 
         | Of course there is always going to be a penality for stackful
         | coroutines that yield deep into a callstack.
        
       | sylware wrote:
       | I write assembly mainly _not_ because it is faster, but because I
       | don't depend on an absurdely complex and massive compiler.
        
         | packetlost wrote:
         | So... how does that work? What sort of work do you do that you
         | have the time to write raw ASM and still be productive? I'm
         | asking in earnest, because I'm curious what sort of workflows
         | still allow for writing ASM directly outside of very specific
         | cases (such as initializing embedded MCUs, for example)
        
       | jefftk wrote:
       | _> The SIMD code does come with one asterisk, though: because
       | floating-point addition is not associative, and it performs the
       | summation in a different order, it may not get the same result as
       | straight-line code. In retrospect, this is likely why the
       | compiler doesn 't generate SIMD instructions to compute the sum!_
       | 
       | What if you set -funsafe-math-optimizations, which allows
       | "optimizations that allow arbitrary reassociations and
       | transformations with no accuracy guarantees"?
        
         | makapuf wrote:
         | You could just turn gcc to 11 and use -Ofast
        
           | zokier wrote:
           | Both clang and gcc vectorize if you ask them nicely:
           | https://gcc.godbolt.org/z/xvjY8P4cM
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-01-25 23:00 UTC)