[HN Gopher] Measuring CPU core-to-core latency
       ___________________________________________________________________
        
       Measuring CPU core-to-core latency
        
       Author : nviennot
       Score  : 130 points
       Date   : 2022-09-18 17:15 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jtorsella wrote:
       | If anyone is interested, here are the results on my M1 Pro
       | running Asahi Linux:
       | 
       | Min: 48.3 Max: 175.0 Mean: 133.0
       | 
       | I'll try to copy the exact results once I have a browser on
       | Asahi, but the general pattern is most pairs have >150ns and a
       | few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at
       | about 50ns.
        
       | jesse__ wrote:
       | This is absolutely the coolest thing I've seen in a while.
        
       | fideloper wrote:
       | Because I'm ignorant: What are the practical take aways from
       | this?
       | 
       | When is a cpu core sending a message to another core?
        
         | SkipperCat wrote:
         | In HFT, we typically pin processes to run on a single isolated
         | core (on a multicore machine). That allows the process to avoid
         | a lot of kernel and other interrupts which could cause the
         | process to not operate in a low latency manner.
         | 
         | If we have two of these processes, each on separate cores, and
         | they occasionally need to talk to each other, then knowing the
         | best choice of process/core location can keep the system
         | operating in the lowest latency setup.
         | 
         | So, an app like this could be very helpful for determining
         | where to place pinned processes onto specific cores.
         | 
         | There's also some common rules-of-thumb such as, don't put
         | pinned processes that need to communicate on cores that are
         | separated by the QPI, that just adds latency. Make sure if
         | you're communicating with a NIC to find out which socket has
         | the shortest path on the PCI bus to that NIC and other fun
         | stuff. I never even thought about NUMA until I started to work
         | with folks in HFT. It really makes you dig into the internals
         | of the hardware to squeeze the most out of it.
        
           | suprjami wrote:
           | I'm surprised how much crossing NUMA nodes can affect
           | performance. We've seen NICs halve their throughout with
           | (intentionally) wrong setups.
           | 
           | I think of NUMA nodes as multiple computers which just happen
           | to share a common operating system.
        
           | ls65536 wrote:
           | In general this makes sense, but I think you need to be
           | careful in some cases where the lowest latency between two
           | logical "cores" is likely to be between those which are SMT
           | siblings on the same physical core (assuming you have an SMT-
           | enabled system). These logical "cores" will be sharing much
           | of the same physical core's resources (such as the low-
           | latency L1/L2 and micro-op caches), so depending on the
           | particular workload, pinning two threads to these two logical
           | "cores" could very well result in worse performance overall.
        
             | slabity wrote:
             | SMT is usually disabled in these situations to prevent it
             | from being a concern.
        
               | nextaccountic wrote:
               | Doesn't this leave some performance on the table? Each
               | core has more ports than a single thread could reasonably
               | use, exactly because two threads can run on a single core
        
               | slabity wrote:
               | In terms of _throughput_ , technically yes, you are
               | leaving performance on the table. However, in HFT the
               | throughput is greatly limited by IO anyways, so you don't
               | get much benefit with it enabled.
               | 
               | What you want is to minimize _latency_ , which means you
               | don't want to be waiting for _anything_ before you start
               | processing whatever information you need. To do this, you
               | need to ensure that the correct things are cached where
               | they need to be, and SMT means that you have multiple
               | threads fighting each other for that precious cache
               | space.
               | 
               | In non-FPGA systems I've worked with, I've seen dozens of
               | microseconds of latency added with SMT enabled vs
               | disabled.
        
             | bitcharmer wrote:
             | No one in hft space runs with smt enabled
        
         | eternalban wrote:
         | Hm. Use icelake, with an aggregator process sitting in core 11
         | and have all the others run completely on input alone and then
         | report to core 11. (Core 11 from that heatmap appears to be the
         | only cpu with a sweetheart core having low latency to all other
         | cores.) I wonder how hard is to write a re-writer to map an
         | executable to match cpu architecture characteristics. Something
         | like graph transformations to create clusters (of memory
         | addresses) that are then mapped to a core.
        
         | electricshampo1 wrote:
         | Answering only the latter question:
         | 
         | A Primer on Memory Consistency and Cache Coherence, Second
         | Edition
         | 
         | https://www.morganclaypool.com/doi/10.2200/S00962ED2V01Y2019...
         | 
         | (free online book) would help
        
         | crazytalk wrote:
         | It's mentioned in the readme - this is measuring the latency of
         | cache coherence. Depending on architecture, some sets of cores
         | will be organized with shared L2/L3 cache. In order to acquire
         | exclusive access to a cache line (memory range of 64-128ish
         | bytes), caches belonging to other sets of cores need to be
         | waited on to release their own exclusive access, or to be
         | informed they need to invalidate their caches. This is
         | observable as a small number of cycles additional memory access
         | latency that is heavily dependent on hardware cache design,
         | which is what is being measured
         | 
         | Cross-cache communication may simply happen by reading or
         | writing to memory touched by another thread that most recently
         | ran on another core
         | 
         | Check out https://en.wikipedia.org/wiki/MOESI_protocol for
         | starters, although I think modern CPUs implement protocols more
         | advanced than this (I think MOESI is decades old at this point)
        
           | haroldrijn wrote:
           | AMD processors also use a hierarchical coherence directory,
           | where the global coherence directory on the IO die enforces
           | coherence across chiplets and a local coherence directory on
           | each chiplet enforces coherence on-die http://www.cs.cmu.edu/
           | afs/cs/academic/class/15740-f03/www/le...
        
         | aseipp wrote:
         | The example code uses an atomic store instruction in order to
         | write values from threads to a memory location, and then an
         | atomic read to read them. The system guarantees that a read of
         | a previously written location is consistent with a subsequent
         | write, i.e. "you always read the thing you just wrote" (on x86,
         | this guarantee is called "Total Store Ordering.") Reads and
         | writes to a memory location are translated to messages on a
         | memory bus, and that is connected to a memory controller, which
         | the CPUs use to talk to the memory they have available. The
         | memory controller is responsible for ensuring every CPU sees a
         | consistent view of memory according to the respective platform
         | memory ordering rules, and with respect to the incoming
         | read/write requests from various CPUs. (There are also caches
         | between the DRAM and CPU here but they are just another layer
         | in the hierarchy and aren't so material to the high-level view,
         | because you can keep adding layers, and indeed some systems
         | even have L1, L2, L3, and L4 caches!)
         | 
         | A CPU will normally translate atomic instructions like "store
         | this 32-bit value to this address" into special messages on the
         | memory bus. Atomic operations it turns out are already normally
         | implemented in the message protocol between cores and memory
         | fabric, so you just translate the atomic instructions into
         | atomic messages "for free" and let the controller sort it out.
         | But the rules of how instructions flow across the memory bus is
         | complicated because the topology of modern CPUs is complicated.
         | They are divided, partitioned into NUMA domains, have various
         | caches that are shared or not shared between 1-2-or-4-way
         | clusters, et cetera. They must still obey the memory
         | consistency rules defined by the platform, and all the caches
         | and interconnects between them. As a result, there isn't
         | necessarily a uniform measurement of time for any particular
         | write to location X from a core to be visible to another core
         | when it reads X; you have to measure it to see how the system
         | responds, which might include expensive operations like
         | flushing the cache. It turns out two cores that are very far
         | away will just take more time to see a message, since the bus
         | path will likely be longer -- the latency will be higher for a
         | core-to-core memory write where the write will be visible
         | consistently.
         | 
         | So when you're designing high performance algorithms and
         | systems, you want to keep the CPU topology and memory hierarchy
         | in mind. That's the most important takeaway. From that
         | standpoint, these heatmaps are simply useful ways of
         | characterizing the baseline performance of _some_ basic
         | operations between CPUs, so you might get an idea of how
         | topology affects memory latency.
        
       | apaolillo wrote:
       | We published a paper where we captured the same kind of insights
       | (deep numa hierarchies including cache levels, numa nodes,
       | packages) and used them to tailor spinlocks to the underlying
       | machine: https://dl.acm.org/doi/10.1145/3477132.3483557
        
       | zeristor wrote:
       | I realise these were run on AWS instances, but could this be run
       | locally on Apple Silicon?
       | 
       | Erm, I guess I should try.
        
       | wyldfire wrote:
       | This is a cool project.
       | 
       | It looks kinda like the color scales are normalized to just-this-
       | CPU's latency? It would be neater if the scale represented the
       | same values among CPUs. Or rather, it would be neat if there were
       | an additional view for this data that could make it easier to
       | compare among them.
       | 
       | I think the differences are really interesting to consider. What
       | if the scheduler could consider these designs when weighing how
       | to schedule each task? Either statically or somehow empirically?
       | I think I've seen sysfs info that describes the cache
       | hierarchies, so maybe some of this info is available already.
       | That nest [1] scheduler was recently shared on HN, I suppose it
       | may be taking advantage of some of these properties.
       | 
       | [1] https://dl.acm.org/doi/abs/10.1145/3492321.3519585
        
       | dan-robertson wrote:
       | It would be interesting to have a more detailed understanding of
       | why these are the latencies, e.g. this repo has 'clusters' but
       | there is surely some architectural reason for these clusters. Is
       | it just physical distance on the chip or is there some other
       | design constraint?
       | 
       | I find it pretty interesting where the interface that cpu makers
       | present (eg a bunch of equal cores) breaks down.
        
         | xani_ wrote:
         | Just look at the processor architecture diagram.
         | 
         | But TL;DR modern big processors are not one big piece of
         | silicon but basically "SMP in a box", a bunch of smaller
         | chiplets interconnected with eachother. That helps with yield
         | ("bad" chiplet costs you just 8 cores, not whole 16/24/48/64
         | core chip). Those also usually come with their own memory
         | controllers.
         | 
         | And so you basically have NUMA on a single processor with all
         | of the optimization challenges for it
        
         | bitcharmer wrote:
         | Most of this cross-core overhead diversity is gone on skylake
         | and newer chips because Intel moved from a ring topology to
         | mesh design for their l3 caches.
        
         | ip26 wrote:
         | Some of it is simple distance. Some of it is architectural
         | choices _because_ of the distance. A sharing domain that spans
         | a large distance performs poorly because of the latency.
         | Therefore domains are kept modest, but the consequence is
         | crossing domains has an extra penalty.
        
       | rigtorp wrote:
       | I have something similar but in C++:
       | https://github.com/rigtorp/c2clat
        
         | virgulino wrote:
         | I went to your homepage. Your gif "Programming in C++" made me
         | really laugh, thanks for that! 8-)
         | 
         | https://rigtorp.se/
        
       | sgtnoodle wrote:
       | I've been doing some latency measurements like this, but between
       | two processes using unix domain sockets. I'm measuring more on
       | the order of 50uS on average, when using FIFO RT scheduling. I
       | suspect the kernel is either letting processes linger for a
       | little bit, or perhaps the "idle" threads tend to call into the
       | kernel and let it do some non-preemptable book keeping.
       | 
       | If I crank up the amount of traffic going through the sockets,
       | the average latency drops, presumably due to the processes being
       | able to batch together multiple packets rather than having to
       | block on each one.
        
         | Const-me wrote:
         | I only use AF_UNIX sockets when I need to pass open file
         | handles between processes. I generally prefer message queues:
         | https://linux.die.net/man/7/mq_overview
         | 
         | I haven't measured myself, but other people did, and they found
         | the latency of message queues is substantially lower:
         | https://github.com/goldsborough/ipc-bench
        
       | [deleted]
        
       | jeffbee wrote:
       | Fails to build from source with Rust 1.59 so I tried the C++
       | `c2clat` from elsewhere in the thread. Quite interesting on Alder
       | Lake, because the quartet of Atom cores has uniform latency (they
       | share an L2 cache and other resources) while the core-to-core
       | latency of the Core side of the CPU varies. Note that the way
       | these are logically numbers is 0,1 are SMT threads of the first
       | core and so forth through 14-15. 16-19 are Atom cores with 1
       | thread each.                 CPU   0    1    2    3    4    5
       | 6    7    8    9   10   11   12   13   14   15   16   17   18
       | 19        0    0   12   60   44   60   44   60   43   50   47
       | 56   48   58   49   60   50   79   79   78   79        1   12
       | 0   45   45   44   44   60   43   51   49   55   47   57   49
       | 56   51   76   76   76   76        2   60   45    0   13   42
       | 43   53   43   48   37   52   41   53   42   53   42   72   72
       | 72   72        3   44   45   13    0   42   43   53   42   47
       | 37   51   40   53   41   53   42   72   72   72   72        4
       | 60   44   42   42    0   13   56   43   49   52   54   41   56
       | 42   42   41   75   75   74   75        5   44   44   43   43
       | 13    0   56   43   51   54   55   41   56   42   56   42   77
       | 77   77   77        6   60   60   53   53   56   56    0   13
       | 49   54   56   41   57   42   57   42   78   78   78   78
       | 7   43   43   43   42   43   43   13    0   46   47   54   41
       | 41   41   55   41   72   71   71   71        8   50   51   48
       | 47   49   51   49   46    0   12   51   51   54   56   55   56
       | 75   75   75   75        9   47   49   37   37   52   54   54
       | 47   12    0   49   53   54   56   55   54   74   69   67   68
       | 10   56   55   52   51   54   55   56   54   51   49    0   13
       | 53   58   56   59   75   75   76   75       11   48   47   41
       | 40   41   41   41   41   51   53   13    0   51   52   55   59
       | 75   75   75   75       12   58   57   53   53   56   56   57
       | 41   54   54   53   51    0   13   55   60   77   77   77   77
       | 13   49   49   42   41   42   42   42   41   56   56   58   52
       | 13    0   55   54   77   77   77   77       14   60   56   53
       | 53   42   56   57   55   55   55   56   55   55   55    0   12
       | 74   70   78   78       15   50   51   42   42   41   42   42
       | 41   56   54   59   59   60   54   12    0   75   74   74   77
       | 16   79   76   72   72   75   77   78   72   75   74   75   75
       | 77   77   74   75    0   55   55   55       17   79   76   72
       | 72   75   77   78   71   75   69   75   75   77   77   70   74
       | 55    0   55   55       18   78   76   72   72   74   77   78
       | 71   75   67   76   75   77   77   78   74   55   55    0   55
       | 19   79   76   72   72   75   77   78   71   75   68   75   75
       | 77   77   78   77   55   55   55    0
        
       ___________________________________________________________________
       (page generated 2022-09-18 23:00 UTC)