[HN Gopher] CPU branch prediction evolution over years (2017)
       ___________________________________________________________________
        
       CPU branch prediction evolution over years (2017)
        
       Author : Keyframe
       Score  : 48 points
       Date   : 2022-12-31 20:46 UTC (2 hours ago)
        
 (HTM) web link (danluu.com)
 (TXT) w3m dump (danluu.com)
        
       | BoardsOfCanada wrote:
       | I could have sworn that PPro used 2-bit saturating counters, but
       | I checked and: "The P6 rejects the commonly used Smith algorithm,
       | which maintains four states using two bits, in favor of the more
       | recent Yeh method[1]. This adaptive algorithm uses four bits of
       | branch history and can recognize and predict repeatable sequences
       | of branches, for example, taken-taken-not taken."
       | 
       | One other thing I wondered about is if there have been
       | improvements to the recovery of mispredicted branches that have
       | not yet been retired. It seems like the huge OOO-windows of the
       | latest Apple and Intel CPUs would see diminishing returns
       | otherwise. Anyone knows?
        
         | brigade wrote:
         | Yeah, these days it's likely for only the frontend to stall
         | during misprediction recovery. Already decoded uops from before
         | the branch still make their way through the backend as normal.
         | If you have enough of those (long dependency chains or the
         | branch was reordered early enough), it's possible that the
         | misprediction has effectively no penalty.
         | 
         | As a corollary, this means that when possible you want the
         | dependency chain for branches to be as independent and short as
         | possible, so the CPU can reorder and resolve branches as early
         | as possible.
        
       | 082349872349872 wrote:
       | Is the moral of CPU branch prediction that generating dynamically
       | branch-free code is preferable to generating statically branch-
       | free?
        
         | mhh__ wrote:
         | Avoiding either > Easily predicted branch > conditional move >
         | difficult to predict branch
        
           | masklinn wrote:
           | > Avoiding either > Easily predicted branch
           | 
           | Branchless code can easily out-cost a well-predicted branch
           | (and its content if any).
        
             | mhh__ wrote:
             | Of course, perhaps this should be reduced to "keep an eye
             | out for silly branches"
        
             | taeric wrote:
             | This is where I was trying to use sorting networks as a fun
             | example. Without independent compare/swap units, they will
             | go slower than sequential sorting. Despite not necessarily
             | "branching."
        
         | marcosdumay wrote:
         | Well, reducing the share number of branches in your code has
         | exactly the same impact as improving the predictor's
         | performance. So, no I don't think you can get this conclusion
         | from the article.
         | 
         | The article is only biased into improving the CPU and ignoring
         | any opportunity of improving your program. What is
         | understandable, because it's about CPU architectures.
         | 
         | In practice, you will want to improve whatever part you have
         | the freedom to change. You usually can not choose.
        
           | mungoman2 wrote:
           | The branches still take (instruction) cache. So it _is_
           | beneficial to reduce the number of branches, even if they
           | could have been predicted accurately.
        
             | taeric wrote:
             | That said, I have seen people tricked into thinking
             | branches are only if statements. If using polymorphism to
             | reduce if statements, you'll still want to reduce the
             | variations of the types sent through the code. Well,
             | strictly you just want the variations to be predictable, I
             | suppose?
        
         | jleahy wrote:
         | That's certainly true, a correctly predicted branch costs
         | little more than a nop or unconditional jmp.
         | 
         | Code doesn't have to be branch free, just branch in a
         | predictable way.
         | 
         | Statically branch free code is great when you have things that
         | are genuinely unpredictable, think branches driven by data. You
         | might reach for min/max as an example of that, but in most
         | cases these tend to be highly predictable. Control flow in a
         | decompressor is probably a better example (eg. literal vs
         | copy).
        
           | mhh__ wrote:
           | A correctly predicted branch, in a sense can cost less than a
           | nop in that it does actual work _and_ requires no additional
           | effort from the backend.
        
         | jasonwatkinspdx wrote:
         | The way Linus put it in an old newsgroup posting from his
         | Transmeta days was something like: "the job of a cpu is to
         | generate the minimum number of cache misses as early as
         | possible."
        
         | taeric wrote:
         | I would think it is less on the "don't have branches" but more
         | "try to have a happy/common path." This is oddly at odds with
         | how humans work as operators. For human operators, you
         | typically want each interaction to be as unpredictable as
         | possible, to keep their attention. Falling into rote action is
         | a sure way to get a mistake out of a person. (Or to get them to
         | quit.)
         | 
         | However, for computers, it is good to view them as a machine.
         | As most machines, they are best if they are doing the exact
         | same thing over and over. And over.
         | 
         | As such, if you can contrive for the machine to always take the
         | same branch the vast majority of the time, you will get it to
         | work faster. Whether this is done with a dynamic arrangement of
         | the data, or a static arrangement of the code depends largely
         | on the specifics. Consider sorting networks as a fun
         | exploration of the idea. They can be ridiculously fast, as they
         | always do the same amount of work. (Provided you have enough
         | execution units.)
        
           | marcosdumay wrote:
           | The more predictable a work process becomes, the quicker
           | people can do it too.
           | 
           | Our issue is with correctness, that is something we not even
           | think about in computers. So, it's not really surprising that
           | this completely different issue has different constraints
           | than speed.
        
             | taeric wrote:
             | I suspect it is a curve?
        
       | dang wrote:
       | Related:
       | 
       |  _How and why CPUs do "branch prediction" (2017)_ -
       | https://news.ycombinator.com/item?id=20324092 - July 2019 (33
       | comments)
       | 
       |  _A history of branch prediction_ -
       | https://news.ycombinator.com/item?id=15078574 - Aug 2017 (65
       | comments)
        
       ___________________________________________________________________
       (page generated 2022-12-31 23:00 UTC)