[HN Gopher] CPU branch prediction evolution over years (2017) ___________________________________________________________________ CPU branch prediction evolution over years (2017) Author : Keyframe Score : 48 points Date : 2022-12-31 20:46 UTC (2 hours ago) (HTM) web link (danluu.com) (TXT) w3m dump (danluu.com) | BoardsOfCanada wrote: | I could have sworn that PPro used 2-bit saturating counters, but | I checked and: "The P6 rejects the commonly used Smith algorithm, | which maintains four states using two bits, in favor of the more | recent Yeh method[1]. This adaptive algorithm uses four bits of | branch history and can recognize and predict repeatable sequences | of branches, for example, taken-taken-not taken." | | One other thing I wondered about is if there have been | improvements to the recovery of mispredicted branches that have | not yet been retired. It seems like the huge OOO-windows of the | latest Apple and Intel CPUs would see diminishing returns | otherwise. Anyone knows? | brigade wrote: | Yeah, these days it's likely for only the frontend to stall | during misprediction recovery. Already decoded uops from before | the branch still make their way through the backend as normal. | If you have enough of those (long dependency chains or the | branch was reordered early enough), it's possible that the | misprediction has effectively no penalty. | | As a corollary, this means that when possible you want the | dependency chain for branches to be as independent and short as | possible, so the CPU can reorder and resolve branches as early | as possible. | 082349872349872 wrote: | Is the moral of CPU branch prediction that generating dynamically | branch-free code is preferable to generating statically branch- | free? | mhh__ wrote: | Avoiding either > Easily predicted branch > conditional move > | difficult to predict branch | masklinn wrote: | > Avoiding either > Easily predicted branch | | Branchless code can easily out-cost a well-predicted branch | (and its content if any). | mhh__ wrote: | Of course, perhaps this should be reduced to "keep an eye | out for silly branches" | taeric wrote: | This is where I was trying to use sorting networks as a fun | example. Without independent compare/swap units, they will | go slower than sequential sorting. Despite not necessarily | "branching." | marcosdumay wrote: | Well, reducing the share number of branches in your code has | exactly the same impact as improving the predictor's | performance. So, no I don't think you can get this conclusion | from the article. | | The article is only biased into improving the CPU and ignoring | any opportunity of improving your program. What is | understandable, because it's about CPU architectures. | | In practice, you will want to improve whatever part you have | the freedom to change. You usually can not choose. | mungoman2 wrote: | The branches still take (instruction) cache. So it _is_ | beneficial to reduce the number of branches, even if they | could have been predicted accurately. | taeric wrote: | That said, I have seen people tricked into thinking | branches are only if statements. If using polymorphism to | reduce if statements, you'll still want to reduce the | variations of the types sent through the code. Well, | strictly you just want the variations to be predictable, I | suppose? | jleahy wrote: | That's certainly true, a correctly predicted branch costs | little more than a nop or unconditional jmp. | | Code doesn't have to be branch free, just branch in a | predictable way. | | Statically branch free code is great when you have things that | are genuinely unpredictable, think branches driven by data. You | might reach for min/max as an example of that, but in most | cases these tend to be highly predictable. Control flow in a | decompressor is probably a better example (eg. literal vs | copy). | mhh__ wrote: | A correctly predicted branch, in a sense can cost less than a | nop in that it does actual work _and_ requires no additional | effort from the backend. | jasonwatkinspdx wrote: | The way Linus put it in an old newsgroup posting from his | Transmeta days was something like: "the job of a cpu is to | generate the minimum number of cache misses as early as | possible." | taeric wrote: | I would think it is less on the "don't have branches" but more | "try to have a happy/common path." This is oddly at odds with | how humans work as operators. For human operators, you | typically want each interaction to be as unpredictable as | possible, to keep their attention. Falling into rote action is | a sure way to get a mistake out of a person. (Or to get them to | quit.) | | However, for computers, it is good to view them as a machine. | As most machines, they are best if they are doing the exact | same thing over and over. And over. | | As such, if you can contrive for the machine to always take the | same branch the vast majority of the time, you will get it to | work faster. Whether this is done with a dynamic arrangement of | the data, or a static arrangement of the code depends largely | on the specifics. Consider sorting networks as a fun | exploration of the idea. They can be ridiculously fast, as they | always do the same amount of work. (Provided you have enough | execution units.) | marcosdumay wrote: | The more predictable a work process becomes, the quicker | people can do it too. | | Our issue is with correctness, that is something we not even | think about in computers. So, it's not really surprising that | this completely different issue has different constraints | than speed. | taeric wrote: | I suspect it is a curve? | dang wrote: | Related: | | _How and why CPUs do "branch prediction" (2017)_ - | https://news.ycombinator.com/item?id=20324092 - July 2019 (33 | comments) | | _A history of branch prediction_ - | https://news.ycombinator.com/item?id=15078574 - Aug 2017 (65 | comments) ___________________________________________________________________ (page generated 2022-12-31 23:00 UTC)