[HN Gopher] ARM Cortex-A72 fetch and branch processing ___________________________________________________________________ ARM Cortex-A72 fetch and branch processing Author : zdw Score : 45 points Date : 2020-12-13 15:50 UTC (1 days ago) (HTM) web link (sandsoftwaresound.net) (TXT) w3m dump (sandsoftwaresound.net) | lxgr wrote: | > This all may seem inefficient and crazy and it is. Program | representation and execution need a major re-think along the | lines of the once-investigated dataflow architecture. Why do and | then un-do? | | In a world where most code is JIT compiled on the target | architecture and/or periodically re-compiled using the latest, | optimized compiler once a new CPU comes out - sure. | | But is that realistic? Maybe that would work for Android and iOS | (with on-device and app store level compilation, respectively), | but with Docker, we seem to be moving to the opposite direction | again, and runtime hardware-based optimization makes a ton of | sense there. | GregarianChild wrote: | Why do and then un-do? | | The reason is that we've _not_ yet found a better way of making | fast general purpose processors. | | The OOO (= out-of-order) approach with high-quality prediction | (e.g. branch, value) to processor micro-architecture makes | sense if you have to mask a lot of memory-access latency, which | comes from data-dependent (= unpredictable) memory access. | General purpose workloads have a lot of that. (If memory access | patterns are more predictable, you'd probably run your workload | on a GPU or TPU or DSP, or some other accelerator.) Compilers, | whether ahead-of-time, or JIT, have _not_ got enough | information to schedule commands in a way that can mask memory | latency the way an OOO scheduler inside a processor can. | Stalling the pipeline because you are waiting for data to | arrive from memory is disastrous for performance and to be | avoided at all cost. Intel 's Itanium was based on the premise | that it is possible statically to schedule well enough, but | that has been considered a failure for general workloads. | Moreover, if you let a compiler schedule you need to have ISA | extensions that allow you to communicate scheduling information | to the processor which is not cost free (for example you may | spend precious bits to encode scheduling order that you then | cannot use for other things). | | I suspect that the very last word has not been spoken in this | discussion, but everything obvious to make dataflow competitive | (and quite a lot more) has been tried, and failed. | Veedrac wrote: | IMO a lot of the problem is simply that powerful players (eg. | Intel) had a vested interest in tying things to the platform, | and the megaliths we've built on top all assume this | construction, so even if there are clear ways in which things | can be done better (there are), you'd have to reinvent the | world to do them. Fundamental innovation is really difficult, | and OoO CPUs avoid the need for it. | 1996 wrote: | > you'd have to reinvent the world to do them | | It reminds me of how MIPS branching is wasted on nops: | | https://electronics.stackexchange.com/questions/28444/mips- | p... | | "The trick in writing efficient code is to put in an | instruction that will be useful as part of the loop that is | being taken executed, but do no harm if the branch is not | taken. | | The MIPS designers were counting on compiler writers to | write clever enough code generators to handle this | efficiently. However many do not (including Microchips C32 | compiler, based on GCC), and just put NOP's after every | branch, wasting both code space and cycles" | gpderetta wrote: | GPUs have been extremely successful, so have DSPs, each on | their niche. They just don't work well on branchy pointer | chasey code. Intel itself tried to move beyond OoO and | utterly failed. | | There is no conspiracy here. | lukehutch wrote: | What's the previously-investigated dataflow processor | architecture that the author refers to in the bolded text? | phire wrote: | https://en.wikipedia.org/wiki/Dataflow_architecture | monocasa wrote: | I believe they're just commenting on how Tomusulo style OoO | cores are internally similar to dataflow architectures in | general, despite presenting a von Neumann veneer. | | https://en.m.wikipedia.org/wiki/Dataflow_architecture | GregarianChild wrote: | _Speculation killed dataflow._ (Attributed to Arvind 2005) | | There have been many attempts at making data-flow applications | compatible with general purpose computation workloads. They all | failed to beat the performance of comparable conventional | processors. [1] lists the following disadvantages of data-flow: | | - Debugging difficult (no precise state) | | - Interrupt/exception handling is difficult (what is precise | state semantics?) | | - Implementing dynamic data structures difficult in pure data | flow models | | - Too much parallelism? (Parallelism control needed) | | - High bookkeeping overhead (tag matching, data storage) | | - Instruction cycle is inefficient | | [1] | https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?med... ___________________________________________________________________ (page generated 2020-12-14 23:00 UTC)