[HN Gopher] A single line of code made a 24-core server slower t... ___________________________________________________________________ A single line of code made a 24-core server slower than a laptop Author : Ygg2 Score : 456 points Date : 2021-12-31 13:40 UTC (9 hours ago) (HTM) web link (pkolaczk.github.io) (TXT) w3m dump (pkolaczk.github.io) | jv22222 wrote: | Yes! Finally, HN removed the rule to strip thew word "How" from | the title. | hinkley wrote: | A few hours later... | ncmncm wrote: | The notion of "lock-free" is fundamentally misleading. | | It trips up everybody who hopes "lock-free" will be a magic | bullet freeing them from resource contention bottlenecks. | | When you have explicit locks, aka mutexes (condition variables, | semaphores, what-have-you), the interaction between your threads | is visible on the surface. Replacing that with "lock-free" | interaction, you have essentially the same set of operations as | when taking a mutex. On the up side, overhead cost may be lower. | On the down side, mistakes show up as subtly wrong results | instead of deadlocks. And, you get less control, so when you have | a problem, it is at best harder to see why, and harder to fix. | | Because the lock-free operations involve similar hardware bus | interactions under the covers, they cannot fix contention | problems. You have no choice but to fix contention problems at a | higher, architectural design level, by reducing actual | contention. Having solved the problems there, the extra cost of | explicit mutex operations often does not matter, and the extra | control you get may be worth any extra cost. | | What is lock-free good for, then? Lock-free components reduce | overhead when there is no contention. Given any actual | contention, performance is abandoned in favor of correctness. So, | if you have mutexes and performance is good, moving to lock-free | operations _might_ make performance a little better. If | performance is bad, mutexes and lock-free operations will be | about equally as bad. | Kranar wrote: | Don't see anything more misleading about lock free than about | anything else involving parallelism. Lock freedom has to do | with guarantees about scheduling, you can be lock free and | still have contention but a lock free algorithm is guaranteed | to make progress within a bounded amount of time. A blocking | algorithm is not guaranteed to make any kind of progress within | a bounded amount of time. | | If you're working on a real time system, and I don't mean the | colloquial meaning of that word, but rather a system that must | guarantee that a side effect is produced no later than a given | period of time, then you must use a lock free algorithm, there | is no alternative. Avionics, DSP, high frequency trading, these | are all domains where lock free algorithms are necessary. | Making a fast lock free algorithm is great and generally the | goal of any kind of parallelism is performance, but fast is not | an unambiguous term, it can mean low latency or high bandwidth | and it's very unlikely to write an algorithm that is both. | | Lock free algorithms are great when you need low latency and | guaranteed throughput. If you don't need that then it's | possible to trade those properties for much higher bandwidth | and use blocking algorithms. | ncmncm wrote: | If you are counting on lock-free operations to guarantee you | will meet your real-time scheduling requirements, you have | chosen to fail, right out of the gate. Guaranteeing met | deadlines takes engineering. Using lock-free operations is | not a substitute for engineering. There is no such | substitute. | | Also: All this has nothing to do with parallelism. | Parallelism is about things happening at the same time. | Locking or not locking, and contention, are about | concurrency, which is about synchronization and | serialization, a different topic. | Kranar wrote: | If you are using a mutex to guarantee real-time scheduling | requirements, you have failed. | | Furthermore if you think a discussion about benchmarking an | algorithm on a 24-core SMP system involving atomic reads, | writes and shared caches has nothing to do with parallelism | then let's just agree to disagree with one another and I | wish you all the best in your programming endeavors. | | People can decide for themselves what credibility they wish | to lend to your argument given your perspective on this | matter. | duped wrote: | Unfortunately I think this comment does is muddy the waters | even more. All "lock free" means is that one process(+) may not | cause any other process to block during execution. | | "Contention" is not very meaningful to a lock-free algorithm, | since lock-free algorithms typically do not permit shared data | access semantics. Rather they describe how data is _sent | between_ processes, which is why a lock-free solution looks | very different than something behind mutexes. | | But to the heart of your point, I do wish people spent more | time understanding that lock-free vs lock-ful is a distinction | in semantics and doesn't imply anything about performance. | There are subtle trade offs. | | (+) I'm using "process" with a lot of hand waving, assume it | means whatever your atom of parallelism is | ncmncm wrote: | You are confusing two fundamentally different topics. | Anywhere there is no possibility of contention for access to | a resource, "lock-free" is trivially meaningless. | twoodfin wrote: | In situations like this, it's valuable to know about all the | event types perf can sample beyond CPU cycles, including cache | misses at various levels and branch mis-predictions. These can go | right into flamegraphs as well. | Thiez wrote: | Is this the kind of situation where assigning a core affinity to | the various threads would have made a difference (to make sure | they all run on the same physical processor), or is the mere | presence of another processor enough to trigger the behavior | shown in the article? | loeg wrote: | No, that would not make a significant difference here. The | problem is that the same line has to bounce around all of the | threads that share it; pinning doesn't reduce that very much on | this time-scale. Thread migrations will be very infrequent | relative to atomic operations, especially when the number of | runnable threads is lower than number of physical cores. | dagmx wrote: | Core affinity would only help if he didn't have more threads | than clusters. Otherwise he'd either hit the wall his laptop | did, or cross the cluster and hit the cache issues | [deleted] | hermitdev wrote: | I don't think core affinity would have helped in this | situation. They had one instance that was being shared across | all cores, necessitating the reference count to be sync'd | between sockets at L3 cache level. They may have seen some | benefit if you had, say, 2 instances, one per CPU socket, and | dedicated an instance per socket to avoid L3 cache | interactions. But, in this case, it is far simpler, cleaner and | better performing to not share the instance and avoid the | reference counting entirely. | [deleted] | w_t_payne wrote: | Shared-nothing dataflow models are an interesting way to engineer | (some types of) parallelism. | bob1029 wrote: | > The fact that our reference counters are atomic is an | additional problem that makes things even more complex for the | processor. Although using atomic instructions is often referred | to as "lockless programming", this is slightly misleading - in | fact, atomic operations require some locking to happen at the | hardware level. This locking is very fine-grained and cheap as | long as there is no congestion, but as usual with locking, you | may expect very poor performance if many things try to fight for | the same lock at the same time. And it is of course much worse if | those "things" are whole CPUs and not just single cores that are | close to each other. | | Contention is the devil and you can _never_ hide from it if you | bring it into your life. The fastest "contention avoidance hack" | would be the CAS operation as mentioned above, which still | performs approximately 2 orders of magnitude slower than a single | threaded operation would in the most adverse scenarios. | | If you aren't sure of which problems actually "fit" on a modern | x86 thread these days, rest assured that many developers in | fintech have been able to squeeze entire financial exchanges onto | one: | | https://lmax-exchange.github.io/disruptor/disruptor.html | | For most scenarios, you really don't need all those damn cores. | They are almost certainly getting in the way of you solving your | problems. | hinkley wrote: | I believe Rust is pulling on the right thread here but I have | no idea how this will look in fifteen years. At some point we | will need to be able to tell that two threads don't contend and | thus can be "farther away" from each other, but these threads | are constantly communicating and need to use the same complex | and/or memory banks. | dagmx wrote: | I think it could be something built upon lifetime analysis. I | think a sufficiently complex compiler could use the data | available already today to infer thread affinity and other | rules. | hinkley wrote: | Basically what I was thinking. In the near term it doesn't | even have to be that precise to have a benefit (Rust builds | on escape analysis that helped with distributed garbage | collection, even when it wasn't that exhaustive). Even | ranking threads by suspected degree of interaction is a lot | of data. | | These definitely don't talk. Offload. These definitely | talk. Colocate. These probably talk. Bin pack. | cjfd wrote: | "For most scenarios, you really don't need all those damn | cores. They are almost certainly getting in the way of you | solving your problems." This rings true. I still think the most | promising multi-threading strategy is to, by default, keep | everything on one core and if you find some expensive and more | or less independent workload to put that in another thread. | This is the old-fashioned way though. All of the cool kids will | push for thread pools where every workload will jump from core | to core and back again.... This potentially can require a lot | of locking, though, which also is not good for performance. | zmmmmm wrote: | It's an interesting reflection then on the Actor paradigm | where you are encouraged to create large numbers of fine | grained tasks and toss them into a giant thread pool blender | because light weight threads are "free" now. It would be | quite interesting to see the consequences of thread affinity | loss in that setting. | cecilpl2 wrote: | Games are a prime example of systems which must take advantage | of all possible hardware resources. | bob1029 wrote: | The fact that we believe this is probably part of the problem | with game development today. | dagmx wrote: | Games similarly struggle with multi core/ccx designed chips. | | Battlefield 4 was famously terrible on Ryzen when AMD first | released it. | littlecranky67 wrote: | To be fair, Ryzen mostly underperformed when it was | released which was not AMDs fault. When for years engineers | used Intel as their benchmarking and testing machines and | were tweaking all sorts of parameters to measure against | Intel-only, you inevitable end up with an optimized | implementation that resembles the hardware details of the | Chip in some kind. Any new architecture would perform | badly, except if it is a perfect clone of the original | testing hardware. | dagmx wrote: | I wasn't implying it was AMDs fault, but the idea that | the games were optimized for Intel doesn't hold true IMHO | because they were also built for consoles that were AMD. | | The majority of the performance was down to not | accounting for CCX scheduling | PaulDavisThe1st wrote: | Another interesting example is audio synthesis. It is | interesting because there are at least two models for audio | synthesis: single sample processing (as used in, for example, | VCV Rack) and block structured processing (as used by every | plugin API and thus every DAW-that-runs-plugins). But it gets | even more interesting because the degree of potential | parallelism is entirely controlled by the signal routing | pathways chosen by the user. If the software is processing N | independent signal pathways that do not meet until a final | output stage, you have a high (well, "N") level of potential | parallelism, and if N can expand to a high value then you will | always be able to benefit from more cores. | | However, if the signal pathway is more convoluted (no pun | intended for my audio nerd crew), then not only does the degree | of parallelism decrease but the issue of single-sample vs. | block structured can become much more important. In the single | sample case, the cost of computing a single sample has | (relatively) enormous fixed overhead if the code has any | serious level of modularity, but is still very cheap compared | to inter-thread and inter-core synchronization. Spreading the | computation across cores will frequently, but not always, slow | things down quite dramatically. By contrast, block structured | processing has a much lower per-sample cost, because the | function call tree is invoked only once for an entire block of | samples, and so the relative cost of synchronization decreases. | | This sounds like a slam dunk for block structured processing. | | Problem is, there are things you cannot do correctly with block | structured processing, and so there is always a place for | single sample processing. However, the potential for | parallelization and the level at which it should take place | differs significantly between the two processing models, which | opens to door to some highly questionable design. | | The short version of this is that audio synthesis in the | abstract can always expand to use any number of cores, | particularly for cpu-intensive synthesis like physical | modelling, but that in the presence of single-sample | processing, the cost of synchronization may completely negate | the benefits of parallelism. | spacechild1 wrote: | Great write up! | | What I like about Pd is that you can freely reblock and | resample any subpatch. Want some section with single-sample- | feedback? Just put a [block~ 1]. You can also increase the | blocksize. Usually, this is done for upsampling and FFT | processing. Finally, reblocking can be nested, meaning that | you can reblock to 1024 samples and inside have another | subpatch running at 1 sample blocksize. | | SuperCollider, on the other hand, has a fixed global | blocksize and samplerate, which I think is one of its biggest | limitations. (Needless to say, there are many things it does | better than Pd!) | | --- | | In the last few days I have been experimenting with adding | multi-threading support to Pd | (https://github.com/Spacechild1/pure-data/tree/multi- | threadin...). With the usual blocksize of 64 sample, you can | definitely observe the scheduling overhead in the CPU meter. | If you have a few (heavy-weight) subpatches running in | parallel, the overhead is neglible. But for [clone] with a | high number of (light-weight) copies, the overhead becomes | rather noticable. In my quick tests, reblocking to 256 | samples already reduces the overhead significantly, at the | cost of increased latency, of course. | | --- | | Also, in my plugin host for Pd/Supercollider | (https://git.iem.at/pd/vstplugin/) I have a multi-threading | and bridging/sandboxing option. If the plugin itself is | rather lightweight and the blocksize is small, the scheduling | overhead becomes quite noticable. In Pd you can just put | [vstplugin~] in a subpatch + [block~]. For the SuperCollider | version I have added a "reblock" argument to process the | plugin at a higher blocksize, at the cost of increased | latency. | michaelrmmiller wrote: | To add onto this very good explanation from Paul, in | practice, audio processing graphs always rapidly decay into | serial processing after starting out massively parallel. Most | of the time, we are writing to a single output (i.e. speakers | via your audio interface or a file). On top of that, some of | the heaviest weight DSP happens on this final output stage | (mastering chains with multiband compressors, linear phase | equalizers, limiters etc.) So every 5.3ms (256 sample buffer | @ 48kHz sample rate) we start massively parallel processing | all the leaf nodes (audio tracks with sound files, virtual | instruments and synths) and end up bottlenecking as the tree | collapses into a line. Then we are stuck doing some of the | most CPU intensive work on a single core since we can't | process the limiter DSP plug-in until the EQ finishes its | work, for example. | | We need to meet our real-time deadline or risk dropping | buffers and making nasty pops and clicks. That mastering | stage can pretty easily be the limiting (hah) step that | causes us to miss the deadline, even if we processed hundreds | of tracks in parallel moments before in less time. | | The plug-in APIs (AudioUnits, VSTs, AAX) which are | responsible for all the DSP and virtual instruments are also | designed to process synchronously. Some plug-ins implement | their own threading under the hood but this can often get in | the way of the host application's real-time processing. On | top of that, because the API isn't designed to be | asynchronous, the host's processing thread is tied up waiting | for the completed result from the plug-in before it can move | on. | | Add on that many DSP algorithms are time-dependent. You can't | chop up the sample buffer into N different parts and process | them independently. The result for sample i+1 depends on | processing sample i first. | cletus wrote: | The mantra I live by is that multi-threaded programming is so | potentially complex and it's so easy to screw up (eg erasing all | your supposed performance gains as was the case in this article) | or by introducing subtle bugs or (worse) security flaws that you | should do absolutely everything you can to avoid it. | | IME C++ programmers seem to put an unreasonable amount of faith | on RAII to "solve" these problems eg: | std::mutex m; ... { | std::lock_guard<std::mutex> lg(m); // do stuff with a | mutex lock } // lock_guard destructor releases mutex | | If you're just updating a (private) map or something, this is | fine. But as soon as this gets complex (eg by calling a function | and you lose control over what thread is executing because some | deep code creates a new thread) then this all turns to crap. | | So whenever anyone tells you "just start a thread" without any | hesitation, it's reasonably safe to assume they have no idea what | they're talking about. | | This is one reason I love the async code model used in a number | of languages. I have the most experience with Hack [1] but it's | not unique to that. What I tend to think of as "product" code | (which is a real Facebook-ism) has no need to every create a | thread and dealing with any concurrency issues. | | Now if you're programming an HTTP server or an RDMBS or other | infrastructure then sure, you have a valid need. But if you're | just serving HTTP requests, just avoid it entirely if you can | because any performance gain is probably imaginary but the | complexity increase isn't. | | [1]: https://docs.hhvm.com/hack/asynchronous- | operations/introduct... | otabdeveloper4 wrote: | Don't call functions (that aren't part of the C++ standard | library) inside the lock guard. It's not a hard rule and you'll | get 99% of the way there. | Thiez wrote: | Isn't that mantra a bit too pessimistic? After fixing the | performance issue, a speedup that was almost linear in the | number of cores was achieved using multithreading. | | In addition, async code can bring much of the same challenges | as multithreading. For example, in C# tasks may run on the | current thread, but they may also run on a thread-pool, and | this is mostly transparent to the programmer. When you don't | immediately await all asynchronous operations your tasks may | end up running at the same time. | cletus wrote: | Like anything you weigh up the upside against the downside. | | The downside is more complex code, code that's harder to | reason about, the possibility of concurrency bugs, slower | development times, the possibility of security flaws and you | may just be shifting your performance bottleneck (eg creating | a hot mutex). | | The upside is lower latency, more throughput and higher QPS. | | My argument is quite simply YAGNI. A lot of these benefits | are for many people purely imaginary. They're the very | definition of premature optimization. Code that is less | complex, faster to develop in and likely "good enough" is | probably going to help you way more than that. | | If you get to the scale where heavy multithreading actually | matters, that's a good problem to have. My argument is that | many people aren't there and I would hazard a guess that a | contributing factor to not getting there is worrying about | this level of performance before they need to. | thewarrior wrote: | This model is good for I/O heavy code that is typical for most | web services. However if your load is a mix of I/O and compute | something like Go should do a lot better. | cletus wrote: | Not sure why you're being downvoted because you raise a valid | point: Go has a concurrency model aimed at avoiding many of | these problems. This is of course the channel abstraction. | | Go has other issues. My first piece of advice to any new Go | programmer is make every single one of your channels | unbuffered. This way it operates a lot like the async model I | mentioned and it tends to be easy to reason about. | | But people fall into a trap of thinking adding a buffer will | improve concurrency and throughput. This might be true but | most of the time it's a premature optimization. Buffered | channels also often lead to obscure bugs and a system that's | simply harder to reason about. | | It's really better to think of Go as an tool for organizing | concurrent code, not necessary a direct solution to the | problems involved. But organization matters. | thewarrior wrote: | All concurrency tools are foot guns but go seems to be the | safest foot gun :) | FpUser wrote: | >"But if you're just serving HTTP requests, just avoid it | entirely if you can because any performance gain is probably | imaginary but the complexity increase isn't" | | I serve HTTP requests offering JSON based RPC in my C++ servers | that do all kinds of calculations (often CPU consuming) in | multiple threads. Works just fine and scales well. Yes I have | to be careful on how I work with shared data but it is | centralized encapsulated and done in a single place I call | request router / controller. I reuse it among the projects. | Nothing too exciting about it and no reason to be scare do | death ;) | amichal wrote: | Would this tool have have scaled as well as 24 separate OS | processes? | | I remember hitting this same issue many years ago. Some primitive | in the library (c++) didn't on the surface appear to involve | shared state and cache write lock but of course it did if you | thought about it. Since I switched over to more application level | coding I don't work that low level very much and tend to scale | with processes now (typically not needing that much parallelism | anyway) so i don't often have to think these things through | anymore. | 10000truths wrote: | The problem outlined in the article is not with process vs | thread, the problem is with multiple parallel execution | contexts reading and writing the same cache line. This issue | would have manifested in the same way if your separate | processes touched the same offset in a shared memory mapping. | | Of course, designing with multiple processes in mind helps to | design a shared-nothing architecture, as the costs of shared | state between processes are much more explicit. But _any_ kind | of state that needs to be shared will run into the same | bottleneck, because all forms of shared IPC fundamentally have | to use some form of locking somewhere in the stack to serialize | incoming inputs. | lmeyerov wrote: | I think the intuition on processes vs threads was right: | | 1. The Arc use is a case (I believe! Am not a Rust'er!) of | true sharing of references. The cacheline will be contention | across cores. Doing processes would have eliminated true | sharing. Tracking logical core affinity for references via a | modal type system could have flagged this at compile time, | but I suspect that's not a thing in Rust land. | | 2. Even if it was false sharing at the cacheline level, | process abstractions in shared memory systems are generally | aligned to avoid false sharing at cacheline granularity (and | misaligned ops at smaller granularities). Of course, in the | extreme, a 10X hero programmer may have tried to do amazing | levels of bit packing and syscall hackery and thus break all | those niceties, but that's the cleanup that everyone else in | a team has to make the hero a 10Xer. | loeg wrote: | Probably yes, unless the author went to great lengths to | contend on shared memory segments. | wabain wrote: | This is an excellent writeup, but I'm curious about how dependent | the microbenchmark is on the scripting engine doing no real work. | My off-the-cuff guess is that the unshared version might still | perform better with other workloads (it would have higher | throughput and avoid variance caused by fluctuating degrees of | cache contention) but it might be that the observable overhead | from hammering a single cache line would disappear if each thread | had more varied work to do. And of course if the scripting engine | benefits substantially from caching internal state across | invocations (things like incrementally optimized JIT code) then | having 24 instances can incur substantial extra work. | dathinab wrote: | Has anyone any idea where the `` format strings come from?? | | As far as I know rust doesn't have them. | | And as far as I know if rust gets format strings it most likely | get them in form of a `f` prefix (like raw strings have an `r` | prefix and binary strings have a `b` prefix). | | Is it just a thing to make the code more readable/shorter on the | web? | metabagel wrote: | Markdown uses backticks around code. | | https://docs.github.com/en/github/writing-on-github/getting-... | cdirkx wrote: | The code shown in the article isn't Rust, but Rune [0], which | has similar syntax, but is a dynamic scripting language. It has | template literals [1]. | | [0] https://rune-rs.github.io/ | | [1] https://rune-rs.github.io/book/template_literals.html | dathinab wrote: | Oh, that makes sense. | michalhosna wrote: | If you are talking about "Here is an example of a complete | workload that measures performance of selecting rows by random | keys", then you are looking at Rune code rather than Rust code. | | See https://rune-rs.github.io/book/template_literals.html | johnisgood wrote: | Can anyone TL;DR it? A single line, in Rust, can make a 24-core | server slower than a laptop? What line is that? | jkelleyrtp wrote: | In a multithreaded context, shared data was shared via atomic | reference counting. The atomic operation acted as a single | bottleneck for every operation. The solution was to derive deep | cloning for the shared data (a 1 line fix) to reduce atomic | contention. | PhileinSophia wrote: | vlovich123 wrote: | Is there any tooling that can demonstrate cache coherency | hotspots like cache line bouncing or inter-core communication? | mhh__ wrote: | All of the "serious" profilers support this. Intel have a bunch | of tooling for diagnosing threading issues, AMD also have some | less polished but still functional metrics you can use. | vlovich123 wrote: | Do they track it down to the actual problem point in the | code? Or do you just see that maybe there's a problem and | then experimenting with changes to figure out what needs to | be fixed? | mhh__ wrote: | Short answer: yes. Much better than perf does. | maxwell86 wrote: | Yes. Linux perf has a c2c hit modify "mode"/"option" that will | report precisely this issue. | loeg wrote: | What is c2c? Thanks. | tkhattra wrote: | c2c stands for cache-2-cache. | | perf-c2c allows you to measure cacheline contention - see | https://man7.org/linux/man-pages/man1/perf-c2c.1.html and | https://joemario.github.io/blog/2016/09/01/c2c-blog/. | eptcyka wrote: | Cache to cents? Kernels 2 cents on how your code is | utilizing the cache? I have 0 clue too. | masklinn wrote: | Cache to Cache, | | It tracks a bunch of cache-related counters, and HITM | (hit modify) tracks loads of data modified by either an | other core of the same node ("local hitm") or by an other | node entirely ("remote hitm"). | | Here, the contention would have shown up as extreme | amounts of both, as each core would almost certainly try | atomically updating a value which had just been updated | by an other core, with higher than even chances the value | had been updated by a core of the other socket (= node). | gjulianm wrote: | I've used the Intel VTune Profiler, I think it's free. It's a | pretty good tool for profiling code and gives you quite a lot | of information on how you're using the core pipelines, but you | need an Intel processor to use it. I think it would have | quickly flagged this kind of problem and told you that the | processor were stalled waiting for cache. | mhh__ wrote: | Vtune is free now. it's by the far the best profiler I've | ever come across, AMD's equivalent is a bit rough and ready | by comparison. | game_the0ry wrote: | Relevant video - Scott Meyers: Cpu Caches and Why You Care | | https://www.youtube.com/watch?v=WDIkqP4JbkE | | Remember, boys and girls, small mistakes can have big | consequences. | | Keep those data in immutable data structures when mutli- | threading. Actor model also helps. | teen wrote: | That code? Thread.sleep(5000) | errcorrectcode wrote: | My laptop (i7-8650U) has slightly faster single-core performance | than my 96-thread dual 7402 EPYC Milan home server. I'm fine with | that. I'm not fine with undue inefficiencies, i.e., excessive | cache misses, branch mis-predictions, and unparallelization due | to sloppy or unfortunate code. | kentonv wrote: | An aside: | | > L3 cache is shared by all cores of a CPU. | | I recently learned this is no longer true! AMD EPYC processors | (and maybe other recent AMDs?) divide cores into groups called | "core complexes" (CCX), each of which has a separate L3 cache. My | particular processor is 32-core where each set of 4 cores is a | CCX. I discovered this when trying to figure out why a benchmark | was performing wildly differently from one run to the next with a | bimodal distribution -- it turned out to depend on whether Linux | has scheduled the two processes in my benchmark to run on the | same CCX vs. different CCXs. | | https://en.wikipedia.org/wiki/Epyc shows the "core config" of | each model, which is (number of CCX) x (cores per CCX). | [deleted] | Ataraxic wrote: | Just as information for some later readers, but this is | applicable not just to Epyc but also the consumer version | Threadripper. My understanding is that this is an example of a | Non-Uniform Memory Access (NUMA) which was also used to link | multiple cpus in different sockets together for a long time | now, but now they've been integrated into a cpu that fits in | one socket. | | This actually has an importance if you are running a VM on such | a system since you'll run into things like the actual RAM (not | l3 cache) is often directly linked to a particular NUMA node. | For example accessing memory in the first ram stick vs the | second will give different latencies as it goes from ccx1 => | ccx2 => stick2 versus ccx1 => stick1. This is applicable to I | think 2XXX versions and earlier for threadripper. My | understanding is that they solved this in later versions using | the infinity fabric (IO die) so now all ccx's go through the IO | die. | | I ran into all of this trying to run an ubuntu machine that ran | windows using KVM while passing through my nvidia graphics | card. | reilly3000 wrote: | This sheds some light on why I had so much trouble running | GPU passthrough with Unraid (KVM) on my 1950X w/ Nvida. Those | were dark days. | rob_c wrote: | Frankly that was probably 200% unraid ... | masklinn wrote: | > Just as information for some later readers, but this is | applicable not just to Epyc but also the consumer version | Threadripper. | | It's applicable to any Zen design with more than one CCX, | which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it | was 4). | | The wiki has the explanation under the "Core config" entry of | the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16 | cores). | | Zen 3 APU are all single-CCX, though there are Zen 2 parts in | the 5000 range which are multi-CCX (because why not confuse | people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a | 2x4. | | The rest is either low-core enough to be single-CCX Zen 2 | (5300U) or topping out at a single 8-cores CCX (everything | else). | dagmx wrote: | This also applies to Ryzen as well. The first gen Ryzen was | plagued by software and schedulers that couldn't handle the | ccx affinity | 123pie123 wrote: | this is important when sizing and tuning VMs/ hypervisor | masklinn wrote: | > I recently learned this is no longer true! AMD EPYC | processors (and maybe other recent AMDs?) divide cores into | groups called "core complexes" (CCX), each of which has a | separate L3 cache. | | It's still kinda true, just that "a CPU" in that context is a | CCX. Cross CCX communications has shown up a fair bit in | reviews and benches, and really in all chips at that scale | (e.g. Ampere's Altras and Intel's Xeons): | https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 | and one of the "improvements" in Zen 3 is the CCX are much | larger (there's one CCX per CCD, rather than 2) so there's less | crosstalk. | | And it could already be untrue previously e.g. the Pentium D | were rushed out by sticking two P4 dies on the same PCB, I | think they even had to go through the northbridge to | communicate, so they were dual socket in all but physical | conformation (hence being absolute turds). I don't think they | had L3 at all though, so that wasn't really a factor, but | still... | tentacleuno wrote: | > And it could already be untrue previously e.g. the Pentium | D were rushed out by sticking two P4 dies on the same PCB, I | think they even had to go through the northbridge to | communicate | | Part of me hopes you're wrong here. That is absolutely | absurd. | masklinn wrote: | Yeah you'd hope so wouldn't you? | | But Intel got caught completely unaware by the switch to | multi-core, just as it had by the 64b switch. | | The eventual Core 2 was not ready yet (Intel even had to | bridge to it with the intermediate Core, which really had | more to do with the Pentium M than with the Core 2 though | it did feature a proper dual-core design... so much so that | the Core solo was actually a binned Duo with one core | disabled). | | So anyway Intel was caught with its pants around its ankle | for the second time and they couldn't let that happen. And | they actually beat AMD to market, having turned out a | working dual-core design in the time between AMD's | announcement of the dual-core opteron (and strong hints of | x2) and the actual release, about 8 months. | | To manage that Intel could not rearchitecture their chip | (and probably didn't want to as it'd become clear Netburst | was a dead-end), so they stapled two Prescott cores | together, FSB included, and connected both to the | northbridge. | | It probably took more time to validate that solution for | the server market, which is why where AMD released the dual | core Opterons in April and Athlons in May, it took until | October for the first dual-core Xeon to be available. | danudey wrote: | Here's the wikipedia article: | | https://en.wikipedia.org/wiki/Pentium_D | | "The Pentium D brand refers to two series of desktop dual- | core 64-bit x86-64 microprocessors with the NetBurst | microarchitecture, which is the dual-core variant of | Pentium 4 "Prescott" manufactured by Intel. Each CPU | comprised two dies, each containing a single core, residing | next to each other on a multi-chip module package." | IronWolve wrote: | Yup, and why overclocking on AMD tests each of your | ccx/cores, marks gold cores that can reach 5ghz+, and slower | cores that dont overclock. Also why AMD tests each ccx, and | moves worse ones to lower end products, and higher ones to | the 5950x. | | Then you can OC tweak each core and try to milk out | performance, PBO tends to be pretty good on detecting which | cores with slight performance boost. With a few bios options, | all that adds up without effort. | | And then finally get a better (advanced) scheduler in | windows/linux that can move workloads around per core | depending on workload. Windows released multiple fixes for | the scheduler on AMD starting with win10 1903. | | I find scheduler mods and modest bios tweaks to increase | performance without much effort, very detectable performance. | | Linux I use xanmod (and pf-kernel before that) | | Windows I use project lasso and AMD PBO. | | Slow/fast cores and scheduler tweaks is how win11/arm | mac/android, makes things appear faster too. | | Amusing, scheduler/core tweaks been around for linux for a | decade+, making the desktop super smooth, but its now | mainstream in win11/arm osx. | Andys wrote: | does xanmod work as a drop-in replacement in Ubuntu with an | NVIDIA card? | sroussey wrote: | There are core complexes in Apple's M1 variations as well. | | And it's not just L3 that can shared by various chips. | | Each complex also has its own memory bandwidth, so running on | two core in two complexes will get around twice the memory | bandwidth of two cores in the same complex. | lend000 wrote: | I've always wondered how much this affects shared cloud | instances (for example, a single c5.small instance). Can your | neighbor, who shares your L3 cache, be using memory so much | more aggressively than you that it causes you to evict your | L2/L1 cache? Or is cache coherency maintained if the core that | evicted your L3 cache is known to be living in a different | memory space? | aspaceman wrote: | Coherency will be maintained (since the protocols support | that case). But yes, a separate process would evict that | memory. From the processors point of view, they're just | addresses tagged with data. Caching behavior doesnt depend on | virtual memory addresses because I believe that info is | stripped away at that point. | | So if someone is thrashing cache on the same core you're on, | you will notice it if the processes aren't being shared | effectively. | | The contents of the cache aren't stored as part of a paused | process or context switch. But I'd appreciate a correction | here if I'm wrong. | | For an example, consider two processes A and B running on a | set of cores. If A makes many more memory accesses than B, A | can effectively starve B of "cached" memory accesses because | A accesses memory more frequently. | | If B were run alone, then it's working set would fit in | cache. Effectively making the algorithm operate from cache | instead of RAM. | | BUT. You really have to be hitting the caches hard. Doesn't | happen too often in casual applications. I only encountered | this on GPUs (where each core has sperate L1 but a shared | L2). Even then it's only aa problem if every core is hitting | different cache lines. | aspaceman wrote: | Found some detail - apparently the CPU manuals are the | place to check https://cs.stackexchange.com/a/1090 | pdpi wrote: | Out of curiosity -- was the "bad" mode a case of two separate | workloads competing for limited cache within one CCX, or was it | really bad cache behaviour across CCXs causing problems because | the two processes shared memory? | kentonv wrote: | The two processes were a client and a server, the client just | sends HTTP requests to the server over a unix socket, but | otherwise they don't share resources. TBH I'm surprised there | was so much impact just from that. There might be something | more to the story -- I haven't investigated much yet. But | basically when I used `taskset` to limit the benchmark to two | cores in different CCX's, it ran about half the speed as when | limited to two cores in the same CCX. I guess another | possibility is that the kernel is allowing both processes to | swap between cores regularly, which would much better explain | the disastrous performance penalty. But, in that case I don't | understand the bimodal distribution I see when running the | benchmark with no core affinity... | hinkley wrote: | This kind of [stuff] drives me nuts every time I'm trying to | read the performance telemetry tea leaves at work. | | Do we have requests that take longer, or did Linux do something | dumb with thread affinity? | dahfizz wrote: | Pro tip: if you care about performance, and especially if you | care about reliable performance measurements, you should pin | and accelerate your threads. Using isol CPUs is the next | step. | | In other words: if your application is at the mercy of Linux | making bad decisions about what threads run where, that is a | performance bug in your app. | spockz wrote: | How would you do this when running in kubernetes? Afaik it | just guarantees that you get scheduled on some cpu for the | "right" amount of time. Not on which cpu that is. | | I only know that there is one scheduler that gives you | dedicated cores if you set the request and limit both to | equal multiples of 1000. | spiddy wrote: | On post-docker pre-kubernetes time I used `--cpuset-cpus` | on docker args to dedicate specific cpus to redis | instances, using CoreOS and fleet for cluster | orchestration. | ncmncm wrote: | Another reason to put off moving to kubernetes. | masklinn wrote: | > Do we have requests that take longer, or did Linux do | something dumb with thread affinity? | | Yes. | | Cross-socket communications do take longer, but a properly | configured NUMA-aware OS should probably have segregated | threads of the same process to the same socket, so the | performance should have increased linearly from 1 to 12 | threads, then fallen off a cliff as the cross-socket effect | started blowing up performances. | klysm wrote: | Do you use numa nodes to tell the kernel about this or | something else? | wmf wrote: | You can enable node-per-CCX mode in UEFI to do that. | Otherwise you have to use tools like hwloc to discover the | cache topology and pin workloads accordingly. | rob_c wrote: | Underrated and useful reply. Thank you! | buildbot wrote: | L3 being shared is only common on smaller cores really, on any | large Intel design the L3 isn't all in one place, it's broken | up across the core. | wmf wrote: | That's still a single logical L3 (e.g. a single core could | use the entire L3 capacity). | cheschire wrote: | There are only two hard things in Computer Science: cache | invalidation and naming things. - Phil Karlton | airstrike wrote: | There are actually only two hard problems in computer science: | | 0) Cache invalidation | | 1) Naming things | | 5) Asynchronous callbacks | | 2) Off-by-one errors | | 3) Scope creep | | 6) Bounds checking | c0balt wrote: | You forgot text encodings, endianess, ... | hinkley wrote: | Is 4 uninitialized variables? | Scarblac wrote: | There is only one hard problem in computer science: _people_. | [deleted] | nicoburns wrote: | Most languages will do bounds checking for you. I reckon that | one's mostly solved at this point. | chris_st wrote: | But... but... but... we HAVE to use C, it's the only | language that's fast enough! | | /s | andi999 wrote: | Too slow. | hnaccount_rng wrote: | You missed the off-by-one errors | cheschire wrote: | So you mean I was off by one? | Pet_Ant wrote: | Better version: | | There are only two hard things in Computer Science: cache | invalidation, naming things, and off-by-one errors. | MR4D wrote: | Better better version: | | There are only two hard things in Computer Science: cache | invalidation, and...what did you ask me again? | jahnu wrote: | There are only two hard things in Computer Science: cache | invalidation, naming things, and off-by-oly two hard things | in Computer Scivetem dy gjera te veshtira ne Shkencen | Kompjute | hermitdev wrote: | Now, just toss in a byte order mark, and I think we've a | winner! | jacquesm wrote: | Hehe. Ok. That is very funny :) | gavinray wrote: | I don't get it -- am I slow? | [deleted] | jacquesm wrote: | Buffer overflow :) | vaidhy wrote: | This used to happen in C for me. Basically, you are | trying to print a string which has overflowed the | allocated length and you get junk printed on the string. | C-Strings are null terminated and the system will keep | printing till it encounters a null. So, bounds checking | and buffer overflows are also hard problems. | colejohnson66 wrote: | Gotta watch out for those nulls :) | silon42 wrote: | Nulls are easy compared to invalid non-null pointers. | ludamad wrote: | People say null in Java was a mistake, but I've never had | any issues in my null-only dialect of Java /s | kibwen wrote: | There are only two hard problems in Computer Science: cache | invalidation, naming things, off-by-one errors, and | remembering to update the documentation. | nikkinana wrote: | [deleted] | physicsgraph wrote: | This article is a well-written deep dive on parallel performance | debugging that ends up identifying the difference being due to | how caching is handled on multi-CPU vs single-CPU systems. Thanks | to the author for taking time to write up their troubleshooting | process description. | pengaru wrote: | TL;DR: serializing your parallel work dispatch on an atomic | global counter is bad for scaling across many cores. | | Or, per-cpu counters exist for a reason. | dan-robertson wrote: | It seems to me that one solution would be to be able to shard an | Arc into separate counts for each thread, so that the | synchronisation wouldn't be required (?) but I don't know how it | would be done without making most Arcs unnecessarily expensive. | | Perhaps another problem is that libraries can't be (or just | aren't) generic about the kind of smart-pointers that are used | but the kind of smart-pointers used can matter a lot (e.g. | libraries often have to use Arc for some of their clients even | though Rc would often suffice). | | But maybe I'm just totally wrong about this? | Kranar wrote: | It's a great idea and in fact a variation of it is being | explored by Python to eliminate the use of the GIL without | introducing performance regressions. | | The variation is instead of one count per thread, have two | reference counts where one is atomic and the other is not | atomic. The heuristic is that at any given time, the majority | of writes are performed within a single thread and so that | thread has exclusive read/write access to the non-atomic | reference count, all other threads share the atomic reference | count. There is some work needed to coordinate the atomic and | non-atomic reference count so as to test when an object can be | deleted, but the overhead of this work is less than the cost of | always using an atomic. | | http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf | dan-robertson wrote: | In the case of the original article, the reference count was | being updated by all the threads so having an Arc-with-Rc- | for-one-thread wouldn't actually change things, I think. But | maybe you just mean for 'cheaper smart-pointers in general'? | | The python thing is interesting. I thought there were other | reasons for the GIL than ref counts but I guess everything | must be addressed. | Kranar wrote: | Yep in this case it might not have helped, but all this to | say that some exploration of having multiple reference | counts is a great idea. The balance becomes the overhead of | coordinating multiple reference counts versus the gains | from reduced atomic operations. | Pxtl wrote: | That actually makes me appreciate that the single-threadedness of | Python might a good thing since Python uses refcounting. | reilly3000 wrote: | FWIW this is the kind of content I originally came to HN for and | what keeps me coming back. | chrisaycock wrote: | The culprit was a shared _atomic reference counter_ (Arc). The | cache invalidation across cores eliminated any kind of scaling | via multiple threads. | | The solution is to stop sharing. The author had to make changes | specific to this codebase for each thread to have its own copy of | the offending dataset. | denysvitali wrote: | Awesome TL;DR! I read the full article but this summarizes it | well. The article is anyways worth a full read | divs1210 wrote: | Immutable collections save the day! | Thiez wrote: | Immutable collections wouldn't really help here since Rust | does not have a garbage collector, so they would still be | wrapped in an `Arc<T>`, or use one internally. Unless you're | giving each thread their own copy of the immutable | collection, in which case the immutability doesn't really add | much. | ericbarrett wrote: | You're right if they're wrapped in an Arc<T>. But it's a | lot easier to confidently clone an immutable data structure | to threads/async functions because you know they won't be | mutated by other cores. If you're cloning mutable data, | you'll have to collect each "shard" at the end of whatever | runtime loop your program has, and merge them back into the | authoritative model. (Or drop the changes, but then why | make the data mutable?) Gets hairy. | amelius wrote: | Or use a language with a good concurrent garbage collector. | dagmx wrote: | What language wouldn't hit this issue? The second you have an | atomic write across multiple numa clusters you'll hit this. | | Edit: it appears they're talking specifically about the ref | counting whereas I was considering the entirety of a shared | context. That clarification helps me understand where their | statement was coming from | amelius wrote: | The problem was caused by a memory management primitive. | This wouldn't be an issue in a runtime with a GC designed | for concurrent loads. | lowbloodsugar wrote: | Wood for trees. The problem was caused by an ill-thought- | out design. We can do similarly performance degrading | things in GC languages, just the details will be | different. However, at some extreme, which, to be fair, | most systems won't hit, GC languages perform vastly worse | than non GC languages. In one service I own, Java's G1GC | uses 20x more CPU than Rust for an application specific | benchmark. Most of this time is spent in the concurrent | phase so Shenandoah and GenShen aren't going to make a | dent (and we can't afford the RAM for Shendandoah). 20x | CPU and 2x wall clock. The question we are looking at is, | "Should we just continue to spend 20x $ on operating | costs for the Java version to avoid writing it in Rust?" | dagmx wrote: | How would the GC avoid the atomic lock and cache | invalidation across numa boundaries? What language has a | sufficiently lock less rw capable GC? | | Edit: to clarify I was thinking about a mutable context | share as shown in the code example, not solely about the | ref counting. | yencabulator wrote: | If the only purpose of the Arc was to be a reference | count, a language with a (non-refcount) GC wouldn't need | that atomic in the first place. | dagmx wrote: | That's fair. It read to me, from his post, that Rune was | using it for context sharing as well between threads | (since that's the source of the highest level Arc in his | code example) . If it's only for ref counts then it makes | sense that a concurrent GC could avoid the issue | Const-me wrote: | > How would the GC avoid the atomic lock and cache | invalidation across numa boundaries? | | By not using reference counting. State of the art GCs | don't count references. They usually doing mark and | sweep, implementing multiple generations, and/or doing a | few other things. | | Most of that overhead only happens while collecting. | Merely referencing an object from another thread doesn't | modify any shared cache lines. | | > What language has a sufficiently lock less rw capable | GC? | | Java, C#, F#, Golang. | dagmx wrote: | Yeah I think my confusion was that I was thinking about a | fully mutable context shared across threads based on the | code example in the post. | | But if it's just for the ref counting part of the Arc | then I can see how a GC would solve it by not needing the | RC | Const-me wrote: | > I was thinking about a fully mutable context shared | across threads | | A quote from the article: "No locks, no mutexes, no | syscalls, no shared mutable data here. There are some | read-only structures context and unit shared behind an | Arc, but read-only sharing shouldn't be a problem." As | you see, the data shared across threads was immutable. | | However, the library they have picked was designed around | Rust's ref.counting Arc<> smart pointers. Apparently for | some other use cases, not needed by the OP, that library | needs to modify these objects. | | > I can see how a GC would solve it by not needing the RC | | Interestingly enough, C++ would also solve that. The | language does not stop programmers from changing things | from multiple threads concurrently. For this reason, very | few libraries have their public APIs designed around | std::shared_ptr<> (C++ equivalent of Rust's Arc<>). | Instead, what usually happens, library authors write in | the documentation things like "the object you pass to | this API must be thread safe" and "it's your | responsibility to make sure the pointer you pass stays | alive for as long as you using the API", and call it a | day. | amelius wrote: | GoLang. It was designed for highly concurrent loads. For | an overview of history and technology involved, read | e.g.: | | https://blog.twitch.tv/en/2016/07/05/gos-march-to-low- | latenc... | dagmx wrote: | Edit: another comment explains that you're likely talking | about just the ref counting aspect rather than the entire | context sharing used by the rune code shown, in which | case, yes I see why a concurrent GC would avoid the issue | in that scenario. | | ---------- | | I'm familiar with Go's GC. Your linked post doesn't | explain how it would avoid the hit mentioned by the cache | invalidation across multiple clusters. | | It'll either try and put multiple go routines on a single | cluster (as listed in the link) or it'll need to copy the | necessary stack per thread. Which is effectively what the | original article ends up doing. | | But if you encounter anything that needs to run | concurrently across threads while using a single r/w | object, you'll hit the same cliff surely? | YogurtFiend wrote: | While this isn't true with Go's GC, a GC that's stop-the- | world can avoid these cache coherency issues altogether. | If you pause every thread before marking and sweeping, | then you won't run into problems--only the GC is running. | While this may sound silly, stopping the world will | oftentimes lead to higher throughput than concurrent | collectors, at the cost of higher latency (which is why | it's not suitable for Go, which is often used for | building servers where latency is a priority). | jhgb wrote: | For one, I'd be surprised if Azul's C4 (Zing) didn't have | these issues already solved (assuming anyone has solved | that, _they_ probably did). | YogurtFiend wrote: | I don't know why you're being downvoted. Memory reclamation | is hard in the face of parallelism. It's much easier to | design lock-free algorithms in languages with a garbage | collector, because memory reclamation issues are taken care | of for you (and often times in a batched manner, which won't | affect the hot path of your code). There _are_ techniques to | avoid using a garbage collector: atomic reference counting, | epoch-based reclaimation, hazard points, and more. But they | can be quite difficult to implement and difficult to use. | quotemstr wrote: | Right. Plus GCs can compact memory, which we can't do with | manual memory management. I don't understand why people | fetishize manual memory management --- do you want to spend | your short time on Earth twiddling references? | titzer wrote: | After 28 years of programming in C and 25 years | programming in C++ (on and off), I am sick to death of | managing memory. Programmers can't do it right or well. | Let the super-intelligent idiot savant with billions of | cycles of computation per second and trillions of bits of | storage do the thinking for me. | | Personally I glad the world is moving away from unsafe | manual memory management in C++, be that with better RAII | and managed pointers, Rust's type system, etc. But those | things are still breakable and IMHO, ultimately a big | waste of people's time. If Rust's lifetime annotations | could be inferred by a compiler, then by all means. But | if they can't, just GC already. All that extra thinking | pushes people to think about the wrong things and designs | end up ossified, difficult to refactor because of the | extra overhead of radically changing ownership. Forget it | already! Let computers figure out the whole memory | management problem. | | Want your program to go fast? Don't allocate a bunch of | garbage. Redesign it to reuse data structures carefully. | Don't burden yourself forever twiddling with managing the | large and small alike, obsessing over every dang byte. | mcguire wrote: | Rust's Arc _is_ an atomic reference counter. But, as the | article describes, those aren 't free. | duped wrote: | One could argue that using a GC for reclamation isn't lock- | free if it's not a concurrent GC. | hayley-patton wrote: | There are very few GCs that are actually lock-free. While | uncooperative mutators cannot block other mutators in an | on-the-fly collector, for example, the collector is | blocked until all mutators cooperate. Nonetheless I'd | guess that a sufficiently concurrent collector is good | enough. ___________________________________________________________________ (page generated 2021-12-31 23:00 UTC)