hngopher.com

       [HN Gopher] ARM Cortex-A72 fetch and branch processing
       ___________________________________________________________________
        
       ARM Cortex-A72 fetch and branch processing
        
       Author : zdw
       Score  : 45 points
       Date   : 2020-12-13 15:50 UTC (1 days ago)
        
 (HTM) web link (sandsoftwaresound.net)
 (TXT) w3m dump (sandsoftwaresound.net)
        
       | lxgr wrote:
       | > This all may seem inefficient and crazy and it is. Program
       | representation and execution need a major re-think along the
       | lines of the once-investigated dataflow architecture. Why do and
       | then un-do?
       | 
       | In a world where most code is JIT compiled on the target
       | architecture and/or periodically re-compiled using the latest,
       | optimized compiler once a new CPU comes out - sure.
       | 
       | But is that realistic? Maybe that would work for Android and iOS
       | (with on-device and app store level compilation, respectively),
       | but with Docker, we seem to be moving to the opposite direction
       | again, and runtime hardware-based optimization makes a ton of
       | sense there.
        
         | GregarianChild wrote:
         | Why do and then un-do?
         | 
         | The reason is that we've _not_ yet found a better way of making
         | fast general purpose processors.
         | 
         | The OOO (= out-of-order) approach with high-quality prediction
         | (e.g. branch, value) to processor micro-architecture makes
         | sense if you have to mask a lot of memory-access latency, which
         | comes from data-dependent (= unpredictable) memory access.
         | General purpose workloads have a lot of that. (If memory access
         | patterns are more predictable, you'd probably run your workload
         | on a GPU or TPU or DSP, or some other accelerator.) Compilers,
         | whether ahead-of-time, or JIT, have _not_ got enough
         | information to schedule commands in a way that can mask memory
         | latency the way an OOO scheduler inside a processor can.
         | Stalling the pipeline because you are waiting for data to
         | arrive from memory is disastrous for performance and to be
         | avoided at all cost. Intel 's Itanium was based on the premise
         | that it is possible statically to schedule well enough, but
         | that has been considered a failure for general workloads.
         | Moreover, if you let a compiler schedule you need to have ISA
         | extensions that allow you to communicate scheduling information
         | to the processor which is not cost free (for example you may
         | spend precious bits to encode scheduling order that you then
         | cannot use for other things).
         | 
         | I suspect that the very last word has not been spoken in this
         | discussion, but everything obvious to make dataflow competitive
         | (and quite a lot more) has been tried, and failed.
        
           | Veedrac wrote:
           | IMO a lot of the problem is simply that powerful players (eg.
           | Intel) had a vested interest in tying things to the platform,
           | and the megaliths we've built on top all assume this
           | construction, so even if there are clear ways in which things
           | can be done better (there are), you'd have to reinvent the
           | world to do them. Fundamental innovation is really difficult,
           | and OoO CPUs avoid the need for it.
        
             | 1996 wrote:
             | > you'd have to reinvent the world to do them
             | 
             | It reminds me of how MIPS branching is wasted on nops:
             | 
             | https://electronics.stackexchange.com/questions/28444/mips-
             | p...
             | 
             | "The trick in writing efficient code is to put in an
             | instruction that will be useful as part of the loop that is
             | being taken executed, but do no harm if the branch is not
             | taken.
             | 
             | The MIPS designers were counting on compiler writers to
             | write clever enough code generators to handle this
             | efficiently. However many do not (including Microchips C32
             | compiler, based on GCC), and just put NOP's after every
             | branch, wasting both code space and cycles"
        
             | gpderetta wrote:
             | GPUs have been extremely successful, so have DSPs, each on
             | their niche. They just don't work well on branchy pointer
             | chasey code. Intel itself tried to move beyond OoO and
             | utterly failed.
             | 
             | There is no conspiracy here.
        
       | lukehutch wrote:
       | What's the previously-investigated dataflow processor
       | architecture that the author refers to in the bolded text?
        
         | phire wrote:
         | https://en.wikipedia.org/wiki/Dataflow_architecture
        
         | monocasa wrote:
         | I believe they're just commenting on how Tomusulo style OoO
         | cores are internally similar to dataflow architectures in
         | general, despite presenting a von Neumann veneer.
         | 
         | https://en.m.wikipedia.org/wiki/Dataflow_architecture
        
         | GregarianChild wrote:
         | _Speculation killed dataflow._ (Attributed to Arvind 2005)
         | 
         | There have been many attempts at making data-flow applications
         | compatible with general purpose computation workloads. They all
         | failed to beat the performance of comparable conventional
         | processors. [1] lists the following disadvantages of data-flow:
         | 
         | - Debugging difficult (no precise state)
         | 
         | - Interrupt/exception handling is difficult (what is precise
         | state semantics?)
         | 
         | - Implementing dynamic data structures difficult in pure data
         | flow models
         | 
         | - Too much parallelism? (Parallelism control needed)
         | 
         | - High bookkeeping overhead (tag matching, data storage)
         | 
         | - Instruction cycle is inefficient
         | 
         | [1]
         | https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?med...
        
       ___________________________________________________________________
       (page generated 2020-12-14 23:00 UTC)