[HN Gopher] Arm Announces Cortex-R82: First 64-Bit Real Time Pro...
       ___________________________________________________________________
        
       Arm Announces Cortex-R82: First 64-Bit Real Time Processor
        
       Author : rbanffy
       Score  : 150 points
       Date   : 2020-09-07 18:31 UTC (4 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | ksec wrote:
       | The best part is actually Peter Greenhalgh, VP of Tech at ARM
       | wrote in the comment section.
       | 
       | >A real-time processor doesn't intrinsically process any
       | differently from a normal applications processor. What
       | differentiates it is that it bounds latencies and behaves
       | deterministically. For example, rather than an interrupt latency
       | on applications CPU taking anywhere from 50-1,000 cycles (or
       | more) the interrupt latency can be bounded to under 40-cycles.
       | 
       | >Tightly Coupled Memories allow certain routines and data to be
       | stored within the CPU so there's never any chance that a cache
       | eviction has taken place which forces a fetch from DDR (or
       | flash). In a phone or laptop, you don't have routines that
       | absolutely must be accessed in 5-cycles and can't wait 250-cycles
       | for DDR. If you're controlling an HDD you can't have the read
       | head crashing off the spinning disk! Or, in automotive, the spark
       | plug firing at the wrong.
       | 
       | >Latency and determinism can be very important!
       | 
       | >Not a R8 derivative.
       | 
       | I cant wait to see a SSD Controller with this tech.
        
         | Zenst wrote:
         | >I cant wait to see a SSD Controller with this tech
         | 
         | I doubt this will improve SSD controllers, whilst this may have
         | more granular control over the CPU cache. For example - if
         | there was an SSD controller that this was drop-in compatible, I
         | doubt there would be any measurable gains due to the NAND
         | latency and that if your a little late, the NAND isn't going
         | anywhere.
         | 
         | Spinning rust, now if you miss the rotation, you have to wait a
         | full rotation. However, given rotation speeds, the processing
         | has been fine as and the days of interleaving sectors to time
         | the controller performance died out decades ago, so I'm not
         | sure that example would see gains.
         | 
         | Remember that in these cases the code is kinda fixed and the
         | only dynamic is streaming the data in or out and processing
         | upon that data (encoding/decoding) - all fixed. So if the
         | core/cache is fixed upon one task, the whole latency aspect is
         | controlled and known. However, if you want a CPU to do many
         | things, then this level of cache control and interrupt handling
         | would be more important, so then you need an RTOS.
         | 
         | Though I'm sure many a dB would love the ability to run code
         | directly upon the drive controller - then with RTX30 opening up
         | direct drive interfacing over the bus and circumnavigating the
         | CPU[EDIT]- then again, this may well have some advantages.
         | 
         | [EDIT ADD] Easier access to large ram caches seems to be the
         | real gain, so for spinning rust HD's, that becomes a mixed
         | blessing as would ideally want some capacitors to make sure
         | that memory buffer is written in case of a power failure - so
         | more memory buffer, large capacitors to make sure enough power
         | to flush that buffer to the spinning rust. So be interesting
         | how that pans out.
        
         | Gracana wrote:
         | FWIW, the 64-bit part is new, but TCM and fast/bounded
         | interrupt response are features available in past and present
         | Cortex-R and Cortex-M devices. If you want to play with this
         | stuff now and don't need a 64-bit device, there are plenty of
         | options.
        
         | Klinky wrote:
         | What benefit do you believe this is going to bring SSDs? They
         | are full of DRAM buffers and utilize multiple simultaneous NAND
         | channels to mask latency inherent to the tech. It seems a
         | realtime processor would bring little to the table.
         | 
         | Edit: Well looking into it a bit, a lot of SSDs already use
         | realtime ARM processors in their controllers. The main benefit
         | here seems to be being able to address larger buffer sizes
         | going forward and access to more cores.
        
           | wtallis wrote:
           | Not having to work around the limitations of 32-bit
           | addressing is a convenience, but the industry has produced 4+
           | TB SSDs with 4+ GB of DRAM for quite a while now without this
           | core.
           | 
           | The real demand for a core like this in SSD controllers comes
           | from doing more computationally expensive work on the SSD
           | controller than just providing a block storage device
           | abstraction. Stuff like embedding a full key-value database
           | on the SSD or providing more general-purpose compute
           | capability is a hot topic for enterprise storage. There's
           | already at least one company shipping an SSD controller for
           | computational storage with a mix of realtime cores and
           | Cortex-A53 cores. The Cortex-R82 means such a chip could be
           | more homogeneous with just one type of ARM core.
        
         | teej wrote:
         | > data to be stored within the CPU
         | 
         | I don't know anything about CPU design. Am I wrong to read this
         | line and be worried about memory access vulnerabilities like
         | have been found with Intel CPUs?
        
           | formerly_proven wrote:
           | Tightly coupled memory is basically a cache, except it's
           | explicitly mapped at an address. In MCUs it's faster and also
           | more "secure" in a sense, because, just like a cache, you
           | can't DMA into TCM, so peripherals can't write garbage or
           | shellcode into TCM. The upside of TCM over a cache is that it
           | doesn't need the quick logic to for cache lookups and the
           | storage for tags (so smaller, lower power than a cache),
           | while the downside - it being explicit - is less of a problem
           | and sometimes an advantage in embedded systems.
        
             | teej wrote:
             | I understand a lot better now, thanks!
        
             | rbanffy wrote:
             | Local memory can also be handy in general-purpose computing
             | - there's always some data that belongs to a given core
             | that doesn't need to be consistent across shared memory.
        
         | pkaye wrote:
         | TCM was available for some older generation of ARM processors.
         | Is there some new about this implementation?
        
         | formerly_proven wrote:
         | > I cant wait to see a SSD Controller with this tech.
         | 
         | Not sure I understand your angle, because as I understand
         | things the lags and stalls experienced on (usually older) SSDs
         | stem from the controller either having to re-read a page many
         | times over to reconstruct it or for the controller to have to
         | perform housekeeping RFN because some buffer is full. To that
         | end, because of the virtually addressed nature of SSDs with the
         | "page tables" (if you will) being stored on the SSD as well you
         | run into similar issues as kernels do under memory pressure,
         | causing thrashing.
        
       | ChuckMcM wrote:
       | I'm pretty excited by this announcement. Not for storage but for
       | software defined radios. If Xilinx upgrades their RFSOC (which is
       | an Zynq Ultrascale variant with built in ADCs and DACs) to this
       | core from the current A53 core it will allow much more
       | sophisticated base band processing in software that is currently
       | done with FPGA gates in the fabric. And while reconfigurable
       | FPGAs are nice, software can change modes much more quickly than
       | reconfiguring an FPGA can.
        
         | dragontamer wrote:
         | Question: are GPUs a consideration in the SDR world?
         | 
         | I'm no expert on SDR, but it seems to me that SDRs are the high
         | bandwidth and parallel workloads where GPUs excel in. (GPUs can
         | perform fourier transforms very efficiently)
         | 
         | The only possible qualm would be the latency of a GPU. But I
         | don't imagine that SDR workloads are very latency sensitive? Or
         | am I mistaken there?
         | 
         | GPU kernels are just software, but the parallel nature of GPUs
         | means that it's better if most of the GPU is sharing the same
         | code (very small L1 code cache per thread). So it's far more
         | flexible than a FPGA but somewhat less flexible than a CPU
         | (where different cores and threads can be executing very
         | different code with large and unique L1 caches)
        
           | mhh__ wrote:
           | I've seen it done on the PC side, but (at least on my budget)
           | implementing a GPU, and an FPGA, and the actual SDR I haven't
           | seen. It's definitely a good idea but whereas I could
           | definitely do a decent spec arm board with enough time and
           | prototypes to get it right, I don't even know where to start
           | when it comes to even getting a GPU to boot.
        
           | 01100011 wrote:
           | NVIDIA and Ericsson are already working on leveraging GPUs
           | for 5G:
           | 
           | https://www.ericsson.com/en/blog/2020/4/hardware-
           | acceleratio...
           | 
           | I don't know if this involves baseband processing or just
           | protocols and tighter coupling of applications to the network
           | to reduce overall latency.
           | 
           | This effort is certainly driving a reduction in jitter which
           | is currently an issue with many realtime GPU applications.
        
           | ChuckMcM wrote:
           | Yes! There is a very active community of using Cuda to do
           | signal processing[1].
           | 
           | Latency is an interesting thing, its always part of the SDR
           | pipeline since you have filter delays and processing delays.
           | Most digital streams are uni-directional so you get the bits
           | out, just shifted by 'x' nS. Since the 'x' is deterministic
           | you can plan for it.
           | 
           | [1] https://github.com/rapidsai/cusignal
        
             | CamperBob2 wrote:
             | That's pretty cool. I don't know anyone who does realtime
             | signal processing work in Python, though, least of all
             | myself. Are there C bindings for all of that stuff?
        
               | ChuckMcM wrote:
               | Nearly all of the "heavy lifting" as it were are C or C++
               | libraries. Python is just the 'plumbing' level. It is not
               | dissimilar to using MATLAB where the drivers are all
               | optimized code but the connection between them is MATLAB.
               | 
               | This design pattern is a common of nearly all SDR
               | frameworks (Gnuradio, Redhawk, Pothos, Etc.) The
               | "interconnect" between processing elements is typically
               | shared memory, and that is why it lends itself to GPU
               | work as well.
               | 
               | That said, since my head is most comfortable thinking in
               | C, I tend to write stuff in C rather than Python :-).
        
               | ksaj wrote:
               | I'm not sure if this is precisely what you are looking
               | for, but it's been on my radar for a while for a
               | potential upcoming project. It has limitations (64-bit
               | Windows only, for one) which is why I haven't acted on it
               | yet. But it is part of my pre-project research just the
               | same.
               | 
               | https://github.com/taroz/GNSS-SDRLIB
        
         | bfrog wrote:
         | They probably won't though, because the R core I don't believe
         | has an MMU for Linux
        
           | idiot900 wrote:
           | From the article: "Another big change to the
           | microarchitecture is the inclusion of an MMU"
        
             | bfrog wrote:
             | This is pretty massive! If fpga vendors provide this, what
             | a great addition.
        
         | noipv4 wrote:
         | I would be keen to see if GNSS-SDR can run on a multicore
         | version of this CPU. We are developing fully software GPS /
         | Galileo receivers using the Analog devices AD9364, and are able
         | to run it on a quad core Kaby lake laptop.
        
       | gumby wrote:
       | > "real-time" processors which are used in high-performance real-
       | time applications.
       | 
       | C'mon anandtech, you can do better than this: "real time" means
       | deterministic, not necessarily faster, and in fact often means
       | _slower_.
       | 
       | As another poster pointed out, ARM's VP of Tech posted a comment
       | and posted what real-time means. I don't know why people jump
       | from that to some idea it would be faster.
        
         | doctoboggan wrote:
         | "high-performance" could just as easily be interpreted as
         | faster response time as compared to faster clock speed. So in
         | that way they are "faster".
        
           | gumby wrote:
           | But real-time systems don't guarantee faster. I was a real-
           | time engineer for years.
        
             | swebs wrote:
             | Faster latency, not faster throughput.
        
               | gumby wrote:
               | Not necessarily even lower latency, just predictable.
        
         | wtallis wrote:
         | I think the point you're missing is that "low performance"
         | realtime applications exist and can often get by with more
         | commodity hardware and careful software, eg. running a RTOS on
         | x86 hardware. But that's usually not a viable option when your
         | latency requirements are measured in clock cycles or
         | nanoseconds rather than microseconds. For that, you need more
         | specialized hardware that can do realtime _and_ high
         | performance.
        
       | TheMagicHorsey wrote:
       | What would be the advantage of running an RTOS on Cortex-R82 vs
       | any other ARM processor? Doesn't an RTOS give you the hard
       | realtime capability through software?
       | 
       | I feel I must be missing something critical in my understanding.
       | I thought RTOS in software was sufficient to get realtime
       | processing.
        
         | tashbarg wrote:
         | Your software can only guarantee what the hardware
         | provides/guarantees. If your cpu can take a variable amount of
         | cycles to start processing an interrupt, the RTOS can only
         | guarantee the upper bound. A real time cpu is all about being
         | faster and more importantly more deterministic, therefore
         | enabling the RTOS to guarantee faster reaction times.
         | 
         | If an interrupt can usually be processed in 50 cycles but very
         | rarely will take 500, the RTOS can't schedule for the 50
         | cycles. If the real time cpu guarantees 200 cycles, the RTOS
         | can actually schedule with that. Depending on the application,
         | that can make a huge difference.
        
         | tyingq wrote:
         | See this comment: https://news.ycombinator.com/item?id=24401929
        
       | bfrog wrote:
       | This could be an amazing chip, I imagine its a replacement for
       | the R5?
        
         | doctoboggan wrote:
         | The article talks about it like it's a replacement for the R8.
        
           | klysm wrote:
           | It's also called the R82 so I think that makes sense.
        
       | Animats wrote:
       | So this is basically a CPU with an upper bound on the horrible
       | cases. That's useful for real time. A standard test for real time
       | is to run a hard real time OS like QNX, and have a simple program
       | which receives an interrupt from an input pin and restarts a high
       | priority task that's waiting for the interrupt. The high priority
       | task turns on an output pin. You hook the input and output pins
       | to an oscilloscope. You want to see all the output spikes about
       | the same distance from the input spikes. You don't want to see
       | output outliers way out there, late. If you see that, it's not a
       | hard real time system. You can do this with a standalone program
       | to test the CPU by itself, but you really need to test it when
       | the CPU is also doing other things.
       | 
       | Sources of CPU-level trouble include 1) rarely used CPU
       | instructions that run slow microcode, 2) cases where the pipeline
       | needs a total flush and that's slow, 3) the board manufacturer
       | doing something in system management mode and not telling you.
       | Drivers locking out interrupts too long is the usual Linux
       | problem. It's assumed that all real time code is locked in
       | memory; you do not page real time processes.
        
         | neltnerb wrote:
         | Maybe you can clear this up for me.
         | 
         | To me, realtime means "responds faster than the fastest it
         | needs to in order to control a feedback loop" which is super
         | subjective since different system dynamics are so different.
         | 
         | When I tried to use RTOS it appeared to be that better than 1ms
         | resolution was not remotely possible. Just hardcoding the
         | interrupts resulted in 1us resolution easily.
         | 
         | I grant that the RTOS provides a scheduler, but when this chip
         | says that it is intended for "realtime" does it mean that it's
         | microsecond fast even when running RTOS, that it's capable of
         | running fast with handwritten interrupts and code (which to me
         | seems to be most things), or just that it has a spec for an
         | upper bound on op timing?
         | 
         | I guess the last is important for correctness proofs but the
         | prior two are how I interpret it as a confused person reading
         | marketing materials put out by conflicting sources.
        
           | matthewmacleod wrote:
           | I don't get the impression that in this case "realtime" means
           | anything surprising - that is, it means "responds within a
           | well-known and deterministic time." As you point out, that
           | time is usually slower than a non-realtime system would
           | offer.
           | 
           | Specifically with the R-family ARM chips, there is support
           | for things like tightly-coupled memory
           | (https://developer.arm.com/documentation/ddi0338/g/level-
           | one-...) to provide latency guarantees for real-time code,
           | deterministic interrupts, that kind of thing.
        
             | neltnerb wrote:
             | I see, thanks. I can see an advantage to being able to say
             | that your average performance isn't as good but your
             | variation is small, you can prove you'll hit your metric. I
             | guess I usually over-engineer by 10x but maybe a better
             | coder would do a better job of running up against the
             | limits.
        
           | kccqzy wrote:
           | Actually realtime doesn't mean "responds faster than the
           | fastest it needs to" but rather it means _always_ responds
           | faster than some predetermined threshold. A lot of the easier
           | ways to make things fast (like caches, or even branch
           | prediction) are inherently probabilistic; if the thing is in
           | the cache you get wonderfully fast performance; if not
           | performance falls off a cliff. A realtime system avoids that,
           | making the average case much slower but the slowest case much
           | faster. It basically flattens the latency distribution.
        
           | michaelt wrote:
           | _> When I tried to use RTOS it appeared to be that better
           | than 1ms resolution was not remotely possible._
           | 
           | I've only worked with one RTOS, and a lightweight one at
           | that.
           | 
           | The scheduler would run every time an interrupt handler
           | finished running.
           | 
           | One such interrupt was the 'systick' - you can choose if you
           | want it to run every 1ms, or 100us, or a different rate
           | (depending on how fast your CPU is)
           | 
           | The systick was how you timed anything you didn't want to use
           | a dedicated hardware timer for. If you want to wait 10ms and
           | your systick happened every 1ms, you waited until the 10th
           | systick.
           | 
           | If you want to time something with finer resolution than 1ms,
           | you either made your systick faster, or used a hardware timer
           | with the interrupt triggering the scheduler.
           | 
           |  _> Just hardcoding the interrupts resulted in 1us resolution
           | easily._
           | 
           | If your CPU runs at 180MHz, 1us is enough time for 180 clock
           | cycles. Plenty of time to run one reasonably carefully coded
           | interrupt handler, not much time to run 10 interrupt
           | handlers.
           | 
           | As you can imagine, running your systick at 1us with that
           | clock rate would be challenging!
        
           | SAI_Peregrinus wrote:
           | "realtime" means "latency is bounded to some documented
           | value". "Hard realtime" means it never exceeds that value,
           | and has the same latency every time. "Soft realtime" means it
           | never exceeds that value but the actual latency varies.
           | 
           | That value doesn't have to be anything in particular. Could
           | be 1ms. Could be 100ms. Could be a year. It just has to exist
           | and be documented, and never, ever be exceeded by whatever
           | operations are covered by its guarantee.
        
           | [deleted]
        
       | tyingq wrote:
       | _" Another big change to the microarchitecture is the inclusion
       | of an MMU, which allows the Cortex-82 to actually serve as a
       | general-purpose CPU for a rich operating system such as Linux."_
       | 
       | That's interesting, but they seem to be removing most of the
       | differences between the A and R series.
        
         | recklesstodd wrote:
         | Another usage for the MMU would be in a virtualization context.
         | The Hypervisor can configure static address translation tables
         | for each RTOS VM.
        
         | duskwuff wrote:
         | Not really. The performance-oriented A series can include --
         | and will continue to include -- features which improve
         | performance at the expensive of inconsistency, such as branch
         | prediction and speculative execution. The R series values
         | consistent behavior over performance, so it won't have those
         | features.
        
           | the_duke wrote:
           | R82 is in-order, but does provide branch prediction. [1]
           | 
           | [1] https://developer.arm.com/ip-
           | products/processors/cortex-r/co...
        
       | brundolf wrote:
       | For those like me who didn't know what a "Real Time Processor"
       | was: https://en.m.wikipedia.org/wiki/Real-time_computing
       | 
       | (somebody correct me if this is wrong)
        
         | supernova87a wrote:
         | I'm also interested --
         | 
         | What's the hardware difference required to support real time?
         | Is it some dedicated parts of compute to support the queueing /
         | prioritization of jobs? And some additional ability to have the
         | "master" must-always-work part be able to interrupt or reset
         | "optional" processes?
        
           | akiselev wrote:
           | Deterministic latency. That usually means no cache,
           | speculative execution, or anything else that can't guarantee
           | it will complete in a reasonable time.
           | 
           | Specifically, in many architectures interrupt handling code
           | must be able to yield _very_ quickly in order to continue
           | receiving interrupts (or use reentrant interrupts which are a
           | whole other mess) and even a cache look up, which might fail
           | and have to wait for DDR ram for hundreds or thousands of
           | cycles, makes the system unable to guarantee that it will do
           | what it needs to do in time, like respond to some safety
           | shutoff switch.
           | 
           | I think one of the things ARM did here is make a way for
           | interrupt handling code to stay in cache permanently along
           | with some other determinism guarantees.
        
             | supernova87a wrote:
             | Thanks! I guess in the designing of the accompanying
             | software then, the people writing it probably spend equal
             | (or more) time in writing what happens if the
             | expected/desired behavior fails, than if it succeeds?
        
             | brundolf wrote:
             | It sounds analogous to the question of deterministic memory
             | usage (vs GC), but at a hardware level and for runtime
             | instead of memory
        
           | dragontamer wrote:
           | Realtime is both software and hardware.
           | 
           | A realtime OS won't do much preemption, probably would rather
           | stick to cooperative scheduling.
           | 
           | On the hardware side, explicit cache controls are big in
           | realtime chips. Your programmers know exactly which data is
           | in cache and what isn't, and therefore can accurately plan
           | how long tasks take.
           | 
           | The MMU traditionally wasn't realtime. I wonder how they
           | managed to get realtime controls with virtual memory. (They
           | must have some kind of guarantee on TLB lookups or something)
        
       | cordite wrote:
       | Is this just a native way to prevent interrupts on specific
       | cores?
        
         | klysm wrote:
         | I believe it's more determinism than just the interrupt timing.
        
       | fizixer wrote:
       | If I remember correctly, having a capable RTOS is much more
       | important, and they can be used in regular processors too [0].
       | 
       | I wonder what's so special about this processor that makes it
       | better than dozens of other hardware platforms (both
       | microprocessors and microcontrollers) on which embedded RTOSes
       | are running and doing just fine.
       | 
       | Also if you have a 64-bit processor, but no 64-bit RTOS, you
       | don't have much.
       | 
       | [0] https://en.wikipedia.org/wiki/Comparison_of_real-
       | time_operat...
        
       ___________________________________________________________________
       (page generated 2020-09-07 23:00 UTC)