[HN Gopher] Measuring CPU core-to-core latency ___________________________________________________________________ Measuring CPU core-to-core latency Author : nviennot Score : 130 points Date : 2022-09-18 17:15 UTC (5 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | jtorsella wrote: | If anyone is interested, here are the results on my M1 Pro | running Asahi Linux: | | Min: 48.3 Max: 175.0 Mean: 133.0 | | I'll try to copy the exact results once I have a browser on | Asahi, but the general pattern is most pairs have >150ns and a | few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at | about 50ns. | jesse__ wrote: | This is absolutely the coolest thing I've seen in a while. | fideloper wrote: | Because I'm ignorant: What are the practical take aways from | this? | | When is a cpu core sending a message to another core? | SkipperCat wrote: | In HFT, we typically pin processes to run on a single isolated | core (on a multicore machine). That allows the process to avoid | a lot of kernel and other interrupts which could cause the | process to not operate in a low latency manner. | | If we have two of these processes, each on separate cores, and | they occasionally need to talk to each other, then knowing the | best choice of process/core location can keep the system | operating in the lowest latency setup. | | So, an app like this could be very helpful for determining | where to place pinned processes onto specific cores. | | There's also some common rules-of-thumb such as, don't put | pinned processes that need to communicate on cores that are | separated by the QPI, that just adds latency. Make sure if | you're communicating with a NIC to find out which socket has | the shortest path on the PCI bus to that NIC and other fun | stuff. I never even thought about NUMA until I started to work | with folks in HFT. It really makes you dig into the internals | of the hardware to squeeze the most out of it. | suprjami wrote: | I'm surprised how much crossing NUMA nodes can affect | performance. We've seen NICs halve their throughout with | (intentionally) wrong setups. | | I think of NUMA nodes as multiple computers which just happen | to share a common operating system. | ls65536 wrote: | In general this makes sense, but I think you need to be | careful in some cases where the lowest latency between two | logical "cores" is likely to be between those which are SMT | siblings on the same physical core (assuming you have an SMT- | enabled system). These logical "cores" will be sharing much | of the same physical core's resources (such as the low- | latency L1/L2 and micro-op caches), so depending on the | particular workload, pinning two threads to these two logical | "cores" could very well result in worse performance overall. | slabity wrote: | SMT is usually disabled in these situations to prevent it | from being a concern. | nextaccountic wrote: | Doesn't this leave some performance on the table? Each | core has more ports than a single thread could reasonably | use, exactly because two threads can run on a single core | slabity wrote: | In terms of _throughput_ , technically yes, you are | leaving performance on the table. However, in HFT the | throughput is greatly limited by IO anyways, so you don't | get much benefit with it enabled. | | What you want is to minimize _latency_ , which means you | don't want to be waiting for _anything_ before you start | processing whatever information you need. To do this, you | need to ensure that the correct things are cached where | they need to be, and SMT means that you have multiple | threads fighting each other for that precious cache | space. | | In non-FPGA systems I've worked with, I've seen dozens of | microseconds of latency added with SMT enabled vs | disabled. | bitcharmer wrote: | No one in hft space runs with smt enabled | eternalban wrote: | Hm. Use icelake, with an aggregator process sitting in core 11 | and have all the others run completely on input alone and then | report to core 11. (Core 11 from that heatmap appears to be the | only cpu with a sweetheart core having low latency to all other | cores.) I wonder how hard is to write a re-writer to map an | executable to match cpu architecture characteristics. Something | like graph transformations to create clusters (of memory | addresses) that are then mapped to a core. | electricshampo1 wrote: | Answering only the latter question: | | A Primer on Memory Consistency and Cache Coherence, Second | Edition | | https://www.morganclaypool.com/doi/10.2200/S00962ED2V01Y2019... | | (free online book) would help | crazytalk wrote: | It's mentioned in the readme - this is measuring the latency of | cache coherence. Depending on architecture, some sets of cores | will be organized with shared L2/L3 cache. In order to acquire | exclusive access to a cache line (memory range of 64-128ish | bytes), caches belonging to other sets of cores need to be | waited on to release their own exclusive access, or to be | informed they need to invalidate their caches. This is | observable as a small number of cycles additional memory access | latency that is heavily dependent on hardware cache design, | which is what is being measured | | Cross-cache communication may simply happen by reading or | writing to memory touched by another thread that most recently | ran on another core | | Check out https://en.wikipedia.org/wiki/MOESI_protocol for | starters, although I think modern CPUs implement protocols more | advanced than this (I think MOESI is decades old at this point) | haroldrijn wrote: | AMD processors also use a hierarchical coherence directory, | where the global coherence directory on the IO die enforces | coherence across chiplets and a local coherence directory on | each chiplet enforces coherence on-die http://www.cs.cmu.edu/ | afs/cs/academic/class/15740-f03/www/le... | aseipp wrote: | The example code uses an atomic store instruction in order to | write values from threads to a memory location, and then an | atomic read to read them. The system guarantees that a read of | a previously written location is consistent with a subsequent | write, i.e. "you always read the thing you just wrote" (on x86, | this guarantee is called "Total Store Ordering.") Reads and | writes to a memory location are translated to messages on a | memory bus, and that is connected to a memory controller, which | the CPUs use to talk to the memory they have available. The | memory controller is responsible for ensuring every CPU sees a | consistent view of memory according to the respective platform | memory ordering rules, and with respect to the incoming | read/write requests from various CPUs. (There are also caches | between the DRAM and CPU here but they are just another layer | in the hierarchy and aren't so material to the high-level view, | because you can keep adding layers, and indeed some systems | even have L1, L2, L3, and L4 caches!) | | A CPU will normally translate atomic instructions like "store | this 32-bit value to this address" into special messages on the | memory bus. Atomic operations it turns out are already normally | implemented in the message protocol between cores and memory | fabric, so you just translate the atomic instructions into | atomic messages "for free" and let the controller sort it out. | But the rules of how instructions flow across the memory bus is | complicated because the topology of modern CPUs is complicated. | They are divided, partitioned into NUMA domains, have various | caches that are shared or not shared between 1-2-or-4-way | clusters, et cetera. They must still obey the memory | consistency rules defined by the platform, and all the caches | and interconnects between them. As a result, there isn't | necessarily a uniform measurement of time for any particular | write to location X from a core to be visible to another core | when it reads X; you have to measure it to see how the system | responds, which might include expensive operations like | flushing the cache. It turns out two cores that are very far | away will just take more time to see a message, since the bus | path will likely be longer -- the latency will be higher for a | core-to-core memory write where the write will be visible | consistently. | | So when you're designing high performance algorithms and | systems, you want to keep the CPU topology and memory hierarchy | in mind. That's the most important takeaway. From that | standpoint, these heatmaps are simply useful ways of | characterizing the baseline performance of _some_ basic | operations between CPUs, so you might get an idea of how | topology affects memory latency. | apaolillo wrote: | We published a paper where we captured the same kind of insights | (deep numa hierarchies including cache levels, numa nodes, | packages) and used them to tailor spinlocks to the underlying | machine: https://dl.acm.org/doi/10.1145/3477132.3483557 | zeristor wrote: | I realise these were run on AWS instances, but could this be run | locally on Apple Silicon? | | Erm, I guess I should try. | wyldfire wrote: | This is a cool project. | | It looks kinda like the color scales are normalized to just-this- | CPU's latency? It would be neater if the scale represented the | same values among CPUs. Or rather, it would be neat if there were | an additional view for this data that could make it easier to | compare among them. | | I think the differences are really interesting to consider. What | if the scheduler could consider these designs when weighing how | to schedule each task? Either statically or somehow empirically? | I think I've seen sysfs info that describes the cache | hierarchies, so maybe some of this info is available already. | That nest [1] scheduler was recently shared on HN, I suppose it | may be taking advantage of some of these properties. | | [1] https://dl.acm.org/doi/abs/10.1145/3492321.3519585 | dan-robertson wrote: | It would be interesting to have a more detailed understanding of | why these are the latencies, e.g. this repo has 'clusters' but | there is surely some architectural reason for these clusters. Is | it just physical distance on the chip or is there some other | design constraint? | | I find it pretty interesting where the interface that cpu makers | present (eg a bunch of equal cores) breaks down. | xani_ wrote: | Just look at the processor architecture diagram. | | But TL;DR modern big processors are not one big piece of | silicon but basically "SMP in a box", a bunch of smaller | chiplets interconnected with eachother. That helps with yield | ("bad" chiplet costs you just 8 cores, not whole 16/24/48/64 | core chip). Those also usually come with their own memory | controllers. | | And so you basically have NUMA on a single processor with all | of the optimization challenges for it | bitcharmer wrote: | Most of this cross-core overhead diversity is gone on skylake | and newer chips because Intel moved from a ring topology to | mesh design for their l3 caches. | ip26 wrote: | Some of it is simple distance. Some of it is architectural | choices _because_ of the distance. A sharing domain that spans | a large distance performs poorly because of the latency. | Therefore domains are kept modest, but the consequence is | crossing domains has an extra penalty. | rigtorp wrote: | I have something similar but in C++: | https://github.com/rigtorp/c2clat | virgulino wrote: | I went to your homepage. Your gif "Programming in C++" made me | really laugh, thanks for that! 8-) | | https://rigtorp.se/ | sgtnoodle wrote: | I've been doing some latency measurements like this, but between | two processes using unix domain sockets. I'm measuring more on | the order of 50uS on average, when using FIFO RT scheduling. I | suspect the kernel is either letting processes linger for a | little bit, or perhaps the "idle" threads tend to call into the | kernel and let it do some non-preemptable book keeping. | | If I crank up the amount of traffic going through the sockets, | the average latency drops, presumably due to the processes being | able to batch together multiple packets rather than having to | block on each one. | Const-me wrote: | I only use AF_UNIX sockets when I need to pass open file | handles between processes. I generally prefer message queues: | https://linux.die.net/man/7/mq_overview | | I haven't measured myself, but other people did, and they found | the latency of message queues is substantially lower: | https://github.com/goldsborough/ipc-bench | [deleted] | jeffbee wrote: | Fails to build from source with Rust 1.59 so I tried the C++ | `c2clat` from elsewhere in the thread. Quite interesting on Alder | Lake, because the quartet of Atom cores has uniform latency (they | share an L2 cache and other resources) while the core-to-core | latency of the Core side of the CPU varies. Note that the way | these are logically numbers is 0,1 are SMT threads of the first | core and so forth through 14-15. 16-19 are Atom cores with 1 | thread each. CPU 0 1 2 3 4 5 | 6 7 8 9 10 11 12 13 14 15 16 17 18 | 19 0 0 12 60 44 60 44 60 43 50 47 | 56 48 58 49 60 50 79 79 78 79 1 12 | 0 45 45 44 44 60 43 51 49 55 47 57 49 | 56 51 76 76 76 76 2 60 45 0 13 42 | 43 53 43 48 37 52 41 53 42 53 42 72 72 | 72 72 3 44 45 13 0 42 43 53 42 47 | 37 51 40 53 41 53 42 72 72 72 72 4 | 60 44 42 42 0 13 56 43 49 52 54 41 56 | 42 42 41 75 75 74 75 5 44 44 43 43 | 13 0 56 43 51 54 55 41 56 42 56 42 77 | 77 77 77 6 60 60 53 53 56 56 0 13 | 49 54 56 41 57 42 57 42 78 78 78 78 | 7 43 43 43 42 43 43 13 0 46 47 54 41 | 41 41 55 41 72 71 71 71 8 50 51 48 | 47 49 51 49 46 0 12 51 51 54 56 55 56 | 75 75 75 75 9 47 49 37 37 52 54 54 | 47 12 0 49 53 54 56 55 54 74 69 67 68 | 10 56 55 52 51 54 55 56 54 51 49 0 13 | 53 58 56 59 75 75 76 75 11 48 47 41 | 40 41 41 41 41 51 53 13 0 51 52 55 59 | 75 75 75 75 12 58 57 53 53 56 56 57 | 41 54 54 53 51 0 13 55 60 77 77 77 77 | 13 49 49 42 41 42 42 42 41 56 56 58 52 | 13 0 55 54 77 77 77 77 14 60 56 53 | 53 42 56 57 55 55 55 56 55 55 55 0 12 | 74 70 78 78 15 50 51 42 42 41 42 42 | 41 56 54 59 59 60 54 12 0 75 74 74 77 | 16 79 76 72 72 75 77 78 72 75 74 75 75 | 77 77 74 75 0 55 55 55 17 79 76 72 | 72 75 77 78 71 75 69 75 75 77 77 70 74 | 55 0 55 55 18 78 76 72 72 74 77 78 | 71 75 67 76 75 77 77 78 74 55 55 0 55 | 19 79 76 72 72 75 77 78 71 75 68 75 75 | 77 77 78 77 55 55 55 0 ___________________________________________________________________ (page generated 2022-09-18 23:00 UTC)