[HN Gopher] Arm Announces Cortex-R82: First 64-Bit Real Time Pro... ___________________________________________________________________ Arm Announces Cortex-R82: First 64-Bit Real Time Processor Author : rbanffy Score : 150 points Date : 2020-09-07 18:31 UTC (4 hours ago) (HTM) web link (www.anandtech.com) (TXT) w3m dump (www.anandtech.com) | ksec wrote: | The best part is actually Peter Greenhalgh, VP of Tech at ARM | wrote in the comment section. | | >A real-time processor doesn't intrinsically process any | differently from a normal applications processor. What | differentiates it is that it bounds latencies and behaves | deterministically. For example, rather than an interrupt latency | on applications CPU taking anywhere from 50-1,000 cycles (or | more) the interrupt latency can be bounded to under 40-cycles. | | >Tightly Coupled Memories allow certain routines and data to be | stored within the CPU so there's never any chance that a cache | eviction has taken place which forces a fetch from DDR (or | flash). In a phone or laptop, you don't have routines that | absolutely must be accessed in 5-cycles and can't wait 250-cycles | for DDR. If you're controlling an HDD you can't have the read | head crashing off the spinning disk! Or, in automotive, the spark | plug firing at the wrong. | | >Latency and determinism can be very important! | | >Not a R8 derivative. | | I cant wait to see a SSD Controller with this tech. | Zenst wrote: | >I cant wait to see a SSD Controller with this tech | | I doubt this will improve SSD controllers, whilst this may have | more granular control over the CPU cache. For example - if | there was an SSD controller that this was drop-in compatible, I | doubt there would be any measurable gains due to the NAND | latency and that if your a little late, the NAND isn't going | anywhere. | | Spinning rust, now if you miss the rotation, you have to wait a | full rotation. However, given rotation speeds, the processing | has been fine as and the days of interleaving sectors to time | the controller performance died out decades ago, so I'm not | sure that example would see gains. | | Remember that in these cases the code is kinda fixed and the | only dynamic is streaming the data in or out and processing | upon that data (encoding/decoding) - all fixed. So if the | core/cache is fixed upon one task, the whole latency aspect is | controlled and known. However, if you want a CPU to do many | things, then this level of cache control and interrupt handling | would be more important, so then you need an RTOS. | | Though I'm sure many a dB would love the ability to run code | directly upon the drive controller - then with RTX30 opening up | direct drive interfacing over the bus and circumnavigating the | CPU[EDIT]- then again, this may well have some advantages. | | [EDIT ADD] Easier access to large ram caches seems to be the | real gain, so for spinning rust HD's, that becomes a mixed | blessing as would ideally want some capacitors to make sure | that memory buffer is written in case of a power failure - so | more memory buffer, large capacitors to make sure enough power | to flush that buffer to the spinning rust. So be interesting | how that pans out. | Gracana wrote: | FWIW, the 64-bit part is new, but TCM and fast/bounded | interrupt response are features available in past and present | Cortex-R and Cortex-M devices. If you want to play with this | stuff now and don't need a 64-bit device, there are plenty of | options. | Klinky wrote: | What benefit do you believe this is going to bring SSDs? They | are full of DRAM buffers and utilize multiple simultaneous NAND | channels to mask latency inherent to the tech. It seems a | realtime processor would bring little to the table. | | Edit: Well looking into it a bit, a lot of SSDs already use | realtime ARM processors in their controllers. The main benefit | here seems to be being able to address larger buffer sizes | going forward and access to more cores. | wtallis wrote: | Not having to work around the limitations of 32-bit | addressing is a convenience, but the industry has produced 4+ | TB SSDs with 4+ GB of DRAM for quite a while now without this | core. | | The real demand for a core like this in SSD controllers comes | from doing more computationally expensive work on the SSD | controller than just providing a block storage device | abstraction. Stuff like embedding a full key-value database | on the SSD or providing more general-purpose compute | capability is a hot topic for enterprise storage. There's | already at least one company shipping an SSD controller for | computational storage with a mix of realtime cores and | Cortex-A53 cores. The Cortex-R82 means such a chip could be | more homogeneous with just one type of ARM core. | teej wrote: | > data to be stored within the CPU | | I don't know anything about CPU design. Am I wrong to read this | line and be worried about memory access vulnerabilities like | have been found with Intel CPUs? | formerly_proven wrote: | Tightly coupled memory is basically a cache, except it's | explicitly mapped at an address. In MCUs it's faster and also | more "secure" in a sense, because, just like a cache, you | can't DMA into TCM, so peripherals can't write garbage or | shellcode into TCM. The upside of TCM over a cache is that it | doesn't need the quick logic to for cache lookups and the | storage for tags (so smaller, lower power than a cache), | while the downside - it being explicit - is less of a problem | and sometimes an advantage in embedded systems. | teej wrote: | I understand a lot better now, thanks! | rbanffy wrote: | Local memory can also be handy in general-purpose computing | - there's always some data that belongs to a given core | that doesn't need to be consistent across shared memory. | pkaye wrote: | TCM was available for some older generation of ARM processors. | Is there some new about this implementation? | formerly_proven wrote: | > I cant wait to see a SSD Controller with this tech. | | Not sure I understand your angle, because as I understand | things the lags and stalls experienced on (usually older) SSDs | stem from the controller either having to re-read a page many | times over to reconstruct it or for the controller to have to | perform housekeeping RFN because some buffer is full. To that | end, because of the virtually addressed nature of SSDs with the | "page tables" (if you will) being stored on the SSD as well you | run into similar issues as kernels do under memory pressure, | causing thrashing. | ChuckMcM wrote: | I'm pretty excited by this announcement. Not for storage but for | software defined radios. If Xilinx upgrades their RFSOC (which is | an Zynq Ultrascale variant with built in ADCs and DACs) to this | core from the current A53 core it will allow much more | sophisticated base band processing in software that is currently | done with FPGA gates in the fabric. And while reconfigurable | FPGAs are nice, software can change modes much more quickly than | reconfiguring an FPGA can. | dragontamer wrote: | Question: are GPUs a consideration in the SDR world? | | I'm no expert on SDR, but it seems to me that SDRs are the high | bandwidth and parallel workloads where GPUs excel in. (GPUs can | perform fourier transforms very efficiently) | | The only possible qualm would be the latency of a GPU. But I | don't imagine that SDR workloads are very latency sensitive? Or | am I mistaken there? | | GPU kernels are just software, but the parallel nature of GPUs | means that it's better if most of the GPU is sharing the same | code (very small L1 code cache per thread). So it's far more | flexible than a FPGA but somewhat less flexible than a CPU | (where different cores and threads can be executing very | different code with large and unique L1 caches) | mhh__ wrote: | I've seen it done on the PC side, but (at least on my budget) | implementing a GPU, and an FPGA, and the actual SDR I haven't | seen. It's definitely a good idea but whereas I could | definitely do a decent spec arm board with enough time and | prototypes to get it right, I don't even know where to start | when it comes to even getting a GPU to boot. | 01100011 wrote: | NVIDIA and Ericsson are already working on leveraging GPUs | for 5G: | | https://www.ericsson.com/en/blog/2020/4/hardware- | acceleratio... | | I don't know if this involves baseband processing or just | protocols and tighter coupling of applications to the network | to reduce overall latency. | | This effort is certainly driving a reduction in jitter which | is currently an issue with many realtime GPU applications. | ChuckMcM wrote: | Yes! There is a very active community of using Cuda to do | signal processing[1]. | | Latency is an interesting thing, its always part of the SDR | pipeline since you have filter delays and processing delays. | Most digital streams are uni-directional so you get the bits | out, just shifted by 'x' nS. Since the 'x' is deterministic | you can plan for it. | | [1] https://github.com/rapidsai/cusignal | CamperBob2 wrote: | That's pretty cool. I don't know anyone who does realtime | signal processing work in Python, though, least of all | myself. Are there C bindings for all of that stuff? | ChuckMcM wrote: | Nearly all of the "heavy lifting" as it were are C or C++ | libraries. Python is just the 'plumbing' level. It is not | dissimilar to using MATLAB where the drivers are all | optimized code but the connection between them is MATLAB. | | This design pattern is a common of nearly all SDR | frameworks (Gnuradio, Redhawk, Pothos, Etc.) The | "interconnect" between processing elements is typically | shared memory, and that is why it lends itself to GPU | work as well. | | That said, since my head is most comfortable thinking in | C, I tend to write stuff in C rather than Python :-). | ksaj wrote: | I'm not sure if this is precisely what you are looking | for, but it's been on my radar for a while for a | potential upcoming project. It has limitations (64-bit | Windows only, for one) which is why I haven't acted on it | yet. But it is part of my pre-project research just the | same. | | https://github.com/taroz/GNSS-SDRLIB | bfrog wrote: | They probably won't though, because the R core I don't believe | has an MMU for Linux | idiot900 wrote: | From the article: "Another big change to the | microarchitecture is the inclusion of an MMU" | bfrog wrote: | This is pretty massive! If fpga vendors provide this, what | a great addition. | noipv4 wrote: | I would be keen to see if GNSS-SDR can run on a multicore | version of this CPU. We are developing fully software GPS / | Galileo receivers using the Analog devices AD9364, and are able | to run it on a quad core Kaby lake laptop. | gumby wrote: | > "real-time" processors which are used in high-performance real- | time applications. | | C'mon anandtech, you can do better than this: "real time" means | deterministic, not necessarily faster, and in fact often means | _slower_. | | As another poster pointed out, ARM's VP of Tech posted a comment | and posted what real-time means. I don't know why people jump | from that to some idea it would be faster. | doctoboggan wrote: | "high-performance" could just as easily be interpreted as | faster response time as compared to faster clock speed. So in | that way they are "faster". | gumby wrote: | But real-time systems don't guarantee faster. I was a real- | time engineer for years. | swebs wrote: | Faster latency, not faster throughput. | gumby wrote: | Not necessarily even lower latency, just predictable. | wtallis wrote: | I think the point you're missing is that "low performance" | realtime applications exist and can often get by with more | commodity hardware and careful software, eg. running a RTOS on | x86 hardware. But that's usually not a viable option when your | latency requirements are measured in clock cycles or | nanoseconds rather than microseconds. For that, you need more | specialized hardware that can do realtime _and_ high | performance. | TheMagicHorsey wrote: | What would be the advantage of running an RTOS on Cortex-R82 vs | any other ARM processor? Doesn't an RTOS give you the hard | realtime capability through software? | | I feel I must be missing something critical in my understanding. | I thought RTOS in software was sufficient to get realtime | processing. | tashbarg wrote: | Your software can only guarantee what the hardware | provides/guarantees. If your cpu can take a variable amount of | cycles to start processing an interrupt, the RTOS can only | guarantee the upper bound. A real time cpu is all about being | faster and more importantly more deterministic, therefore | enabling the RTOS to guarantee faster reaction times. | | If an interrupt can usually be processed in 50 cycles but very | rarely will take 500, the RTOS can't schedule for the 50 | cycles. If the real time cpu guarantees 200 cycles, the RTOS | can actually schedule with that. Depending on the application, | that can make a huge difference. | tyingq wrote: | See this comment: https://news.ycombinator.com/item?id=24401929 | bfrog wrote: | This could be an amazing chip, I imagine its a replacement for | the R5? | doctoboggan wrote: | The article talks about it like it's a replacement for the R8. | klysm wrote: | It's also called the R82 so I think that makes sense. | Animats wrote: | So this is basically a CPU with an upper bound on the horrible | cases. That's useful for real time. A standard test for real time | is to run a hard real time OS like QNX, and have a simple program | which receives an interrupt from an input pin and restarts a high | priority task that's waiting for the interrupt. The high priority | task turns on an output pin. You hook the input and output pins | to an oscilloscope. You want to see all the output spikes about | the same distance from the input spikes. You don't want to see | output outliers way out there, late. If you see that, it's not a | hard real time system. You can do this with a standalone program | to test the CPU by itself, but you really need to test it when | the CPU is also doing other things. | | Sources of CPU-level trouble include 1) rarely used CPU | instructions that run slow microcode, 2) cases where the pipeline | needs a total flush and that's slow, 3) the board manufacturer | doing something in system management mode and not telling you. | Drivers locking out interrupts too long is the usual Linux | problem. It's assumed that all real time code is locked in | memory; you do not page real time processes. | neltnerb wrote: | Maybe you can clear this up for me. | | To me, realtime means "responds faster than the fastest it | needs to in order to control a feedback loop" which is super | subjective since different system dynamics are so different. | | When I tried to use RTOS it appeared to be that better than 1ms | resolution was not remotely possible. Just hardcoding the | interrupts resulted in 1us resolution easily. | | I grant that the RTOS provides a scheduler, but when this chip | says that it is intended for "realtime" does it mean that it's | microsecond fast even when running RTOS, that it's capable of | running fast with handwritten interrupts and code (which to me | seems to be most things), or just that it has a spec for an | upper bound on op timing? | | I guess the last is important for correctness proofs but the | prior two are how I interpret it as a confused person reading | marketing materials put out by conflicting sources. | matthewmacleod wrote: | I don't get the impression that in this case "realtime" means | anything surprising - that is, it means "responds within a | well-known and deterministic time." As you point out, that | time is usually slower than a non-realtime system would | offer. | | Specifically with the R-family ARM chips, there is support | for things like tightly-coupled memory | (https://developer.arm.com/documentation/ddi0338/g/level- | one-...) to provide latency guarantees for real-time code, | deterministic interrupts, that kind of thing. | neltnerb wrote: | I see, thanks. I can see an advantage to being able to say | that your average performance isn't as good but your | variation is small, you can prove you'll hit your metric. I | guess I usually over-engineer by 10x but maybe a better | coder would do a better job of running up against the | limits. | kccqzy wrote: | Actually realtime doesn't mean "responds faster than the | fastest it needs to" but rather it means _always_ responds | faster than some predetermined threshold. A lot of the easier | ways to make things fast (like caches, or even branch | prediction) are inherently probabilistic; if the thing is in | the cache you get wonderfully fast performance; if not | performance falls off a cliff. A realtime system avoids that, | making the average case much slower but the slowest case much | faster. It basically flattens the latency distribution. | michaelt wrote: | _> When I tried to use RTOS it appeared to be that better | than 1ms resolution was not remotely possible._ | | I've only worked with one RTOS, and a lightweight one at | that. | | The scheduler would run every time an interrupt handler | finished running. | | One such interrupt was the 'systick' - you can choose if you | want it to run every 1ms, or 100us, or a different rate | (depending on how fast your CPU is) | | The systick was how you timed anything you didn't want to use | a dedicated hardware timer for. If you want to wait 10ms and | your systick happened every 1ms, you waited until the 10th | systick. | | If you want to time something with finer resolution than 1ms, | you either made your systick faster, or used a hardware timer | with the interrupt triggering the scheduler. | | _> Just hardcoding the interrupts resulted in 1us resolution | easily._ | | If your CPU runs at 180MHz, 1us is enough time for 180 clock | cycles. Plenty of time to run one reasonably carefully coded | interrupt handler, not much time to run 10 interrupt | handlers. | | As you can imagine, running your systick at 1us with that | clock rate would be challenging! | SAI_Peregrinus wrote: | "realtime" means "latency is bounded to some documented | value". "Hard realtime" means it never exceeds that value, | and has the same latency every time. "Soft realtime" means it | never exceeds that value but the actual latency varies. | | That value doesn't have to be anything in particular. Could | be 1ms. Could be 100ms. Could be a year. It just has to exist | and be documented, and never, ever be exceeded by whatever | operations are covered by its guarantee. | [deleted] | tyingq wrote: | _" Another big change to the microarchitecture is the inclusion | of an MMU, which allows the Cortex-82 to actually serve as a | general-purpose CPU for a rich operating system such as Linux."_ | | That's interesting, but they seem to be removing most of the | differences between the A and R series. | recklesstodd wrote: | Another usage for the MMU would be in a virtualization context. | The Hypervisor can configure static address translation tables | for each RTOS VM. | duskwuff wrote: | Not really. The performance-oriented A series can include -- | and will continue to include -- features which improve | performance at the expensive of inconsistency, such as branch | prediction and speculative execution. The R series values | consistent behavior over performance, so it won't have those | features. | the_duke wrote: | R82 is in-order, but does provide branch prediction. [1] | | [1] https://developer.arm.com/ip- | products/processors/cortex-r/co... | brundolf wrote: | For those like me who didn't know what a "Real Time Processor" | was: https://en.m.wikipedia.org/wiki/Real-time_computing | | (somebody correct me if this is wrong) | supernova87a wrote: | I'm also interested -- | | What's the hardware difference required to support real time? | Is it some dedicated parts of compute to support the queueing / | prioritization of jobs? And some additional ability to have the | "master" must-always-work part be able to interrupt or reset | "optional" processes? | akiselev wrote: | Deterministic latency. That usually means no cache, | speculative execution, or anything else that can't guarantee | it will complete in a reasonable time. | | Specifically, in many architectures interrupt handling code | must be able to yield _very_ quickly in order to continue | receiving interrupts (or use reentrant interrupts which are a | whole other mess) and even a cache look up, which might fail | and have to wait for DDR ram for hundreds or thousands of | cycles, makes the system unable to guarantee that it will do | what it needs to do in time, like respond to some safety | shutoff switch. | | I think one of the things ARM did here is make a way for | interrupt handling code to stay in cache permanently along | with some other determinism guarantees. | supernova87a wrote: | Thanks! I guess in the designing of the accompanying | software then, the people writing it probably spend equal | (or more) time in writing what happens if the | expected/desired behavior fails, than if it succeeds? | brundolf wrote: | It sounds analogous to the question of deterministic memory | usage (vs GC), but at a hardware level and for runtime | instead of memory | dragontamer wrote: | Realtime is both software and hardware. | | A realtime OS won't do much preemption, probably would rather | stick to cooperative scheduling. | | On the hardware side, explicit cache controls are big in | realtime chips. Your programmers know exactly which data is | in cache and what isn't, and therefore can accurately plan | how long tasks take. | | The MMU traditionally wasn't realtime. I wonder how they | managed to get realtime controls with virtual memory. (They | must have some kind of guarantee on TLB lookups or something) | cordite wrote: | Is this just a native way to prevent interrupts on specific | cores? | klysm wrote: | I believe it's more determinism than just the interrupt timing. | fizixer wrote: | If I remember correctly, having a capable RTOS is much more | important, and they can be used in regular processors too [0]. | | I wonder what's so special about this processor that makes it | better than dozens of other hardware platforms (both | microprocessors and microcontrollers) on which embedded RTOSes | are running and doing just fine. | | Also if you have a 64-bit processor, but no 64-bit RTOS, you | don't have much. | | [0] https://en.wikipedia.org/wiki/Comparison_of_real- | time_operat... ___________________________________________________________________ (page generated 2020-09-07 23:00 UTC)