[HN Gopher] A single line of code made a 24-core server slower t...
       ___________________________________________________________________
        
       A single line of code made a 24-core server slower than a laptop
        
       Author : Ygg2
       Score  : 456 points
       Date   : 2021-12-31 13:40 UTC (9 hours ago)
        
 (HTM) web link (pkolaczk.github.io)
 (TXT) w3m dump (pkolaczk.github.io)
        
       | jv22222 wrote:
       | Yes! Finally, HN removed the rule to strip thew word "How" from
       | the title.
        
         | hinkley wrote:
         | A few hours later...
        
       | ncmncm wrote:
       | The notion of "lock-free" is fundamentally misleading.
       | 
       | It trips up everybody who hopes "lock-free" will be a magic
       | bullet freeing them from resource contention bottlenecks.
       | 
       | When you have explicit locks, aka mutexes (condition variables,
       | semaphores, what-have-you), the interaction between your threads
       | is visible on the surface. Replacing that with "lock-free"
       | interaction, you have essentially the same set of operations as
       | when taking a mutex. On the up side, overhead cost may be lower.
       | On the down side, mistakes show up as subtly wrong results
       | instead of deadlocks. And, you get less control, so when you have
       | a problem, it is at best harder to see why, and harder to fix.
       | 
       | Because the lock-free operations involve similar hardware bus
       | interactions under the covers, they cannot fix contention
       | problems. You have no choice but to fix contention problems at a
       | higher, architectural design level, by reducing actual
       | contention. Having solved the problems there, the extra cost of
       | explicit mutex operations often does not matter, and the extra
       | control you get may be worth any extra cost.
       | 
       | What is lock-free good for, then? Lock-free components reduce
       | overhead when there is no contention. Given any actual
       | contention, performance is abandoned in favor of correctness. So,
       | if you have mutexes and performance is good, moving to lock-free
       | operations _might_ make performance a little better. If
       | performance is bad, mutexes and lock-free operations will be
       | about equally as bad.
        
         | Kranar wrote:
         | Don't see anything more misleading about lock free than about
         | anything else involving parallelism. Lock freedom has to do
         | with guarantees about scheduling, you can be lock free and
         | still have contention but a lock free algorithm is guaranteed
         | to make progress within a bounded amount of time. A blocking
         | algorithm is not guaranteed to make any kind of progress within
         | a bounded amount of time.
         | 
         | If you're working on a real time system, and I don't mean the
         | colloquial meaning of that word, but rather a system that must
         | guarantee that a side effect is produced no later than a given
         | period of time, then you must use a lock free algorithm, there
         | is no alternative. Avionics, DSP, high frequency trading, these
         | are all domains where lock free algorithms are necessary.
         | Making a fast lock free algorithm is great and generally the
         | goal of any kind of parallelism is performance, but fast is not
         | an unambiguous term, it can mean low latency or high bandwidth
         | and it's very unlikely to write an algorithm that is both.
         | 
         | Lock free algorithms are great when you need low latency and
         | guaranteed throughput. If you don't need that then it's
         | possible to trade those properties for much higher bandwidth
         | and use blocking algorithms.
        
           | ncmncm wrote:
           | If you are counting on lock-free operations to guarantee you
           | will meet your real-time scheduling requirements, you have
           | chosen to fail, right out of the gate. Guaranteeing met
           | deadlines takes engineering. Using lock-free operations is
           | not a substitute for engineering. There is no such
           | substitute.
           | 
           | Also: All this has nothing to do with parallelism.
           | Parallelism is about things happening at the same time.
           | Locking or not locking, and contention, are about
           | concurrency, which is about synchronization and
           | serialization, a different topic.
        
             | Kranar wrote:
             | If you are using a mutex to guarantee real-time scheduling
             | requirements, you have failed.
             | 
             | Furthermore if you think a discussion about benchmarking an
             | algorithm on a 24-core SMP system involving atomic reads,
             | writes and shared caches has nothing to do with parallelism
             | then let's just agree to disagree with one another and I
             | wish you all the best in your programming endeavors.
             | 
             | People can decide for themselves what credibility they wish
             | to lend to your argument given your perspective on this
             | matter.
        
         | duped wrote:
         | Unfortunately I think this comment does is muddy the waters
         | even more. All "lock free" means is that one process(+) may not
         | cause any other process to block during execution.
         | 
         | "Contention" is not very meaningful to a lock-free algorithm,
         | since lock-free algorithms typically do not permit shared data
         | access semantics. Rather they describe how data is _sent
         | between_ processes, which is why a lock-free solution looks
         | very different than something behind mutexes.
         | 
         | But to the heart of your point, I do wish people spent more
         | time understanding that lock-free vs lock-ful is a distinction
         | in semantics and doesn't imply anything about performance.
         | There are subtle trade offs.
         | 
         | (+) I'm using "process" with a lot of hand waving, assume it
         | means whatever your atom of parallelism is
        
           | ncmncm wrote:
           | You are confusing two fundamentally different topics.
           | Anywhere there is no possibility of contention for access to
           | a resource, "lock-free" is trivially meaningless.
        
       | twoodfin wrote:
       | In situations like this, it's valuable to know about all the
       | event types perf can sample beyond CPU cycles, including cache
       | misses at various levels and branch mis-predictions. These can go
       | right into flamegraphs as well.
        
       | Thiez wrote:
       | Is this the kind of situation where assigning a core affinity to
       | the various threads would have made a difference (to make sure
       | they all run on the same physical processor), or is the mere
       | presence of another processor enough to trigger the behavior
       | shown in the article?
        
         | loeg wrote:
         | No, that would not make a significant difference here. The
         | problem is that the same line has to bounce around all of the
         | threads that share it; pinning doesn't reduce that very much on
         | this time-scale. Thread migrations will be very infrequent
         | relative to atomic operations, especially when the number of
         | runnable threads is lower than number of physical cores.
        
         | dagmx wrote:
         | Core affinity would only help if he didn't have more threads
         | than clusters. Otherwise he'd either hit the wall his laptop
         | did, or cross the cluster and hit the cache issues
        
         | [deleted]
        
         | hermitdev wrote:
         | I don't think core affinity would have helped in this
         | situation. They had one instance that was being shared across
         | all cores, necessitating the reference count to be sync'd
         | between sockets at L3 cache level. They may have seen some
         | benefit if you had, say, 2 instances, one per CPU socket, and
         | dedicated an instance per socket to avoid L3 cache
         | interactions. But, in this case, it is far simpler, cleaner and
         | better performing to not share the instance and avoid the
         | reference counting entirely.
        
       | [deleted]
        
       | w_t_payne wrote:
       | Shared-nothing dataflow models are an interesting way to engineer
       | (some types of) parallelism.
        
       | bob1029 wrote:
       | > The fact that our reference counters are atomic is an
       | additional problem that makes things even more complex for the
       | processor. Although using atomic instructions is often referred
       | to as "lockless programming", this is slightly misleading - in
       | fact, atomic operations require some locking to happen at the
       | hardware level. This locking is very fine-grained and cheap as
       | long as there is no congestion, but as usual with locking, you
       | may expect very poor performance if many things try to fight for
       | the same lock at the same time. And it is of course much worse if
       | those "things" are whole CPUs and not just single cores that are
       | close to each other.
       | 
       | Contention is the devil and you can _never_ hide from it if you
       | bring it into your life. The fastest  "contention avoidance hack"
       | would be the CAS operation as mentioned above, which still
       | performs approximately 2 orders of magnitude slower than a single
       | threaded operation would in the most adverse scenarios.
       | 
       | If you aren't sure of which problems actually "fit" on a modern
       | x86 thread these days, rest assured that many developers in
       | fintech have been able to squeeze entire financial exchanges onto
       | one:
       | 
       | https://lmax-exchange.github.io/disruptor/disruptor.html
       | 
       | For most scenarios, you really don't need all those damn cores.
       | They are almost certainly getting in the way of you solving your
       | problems.
        
         | hinkley wrote:
         | I believe Rust is pulling on the right thread here but I have
         | no idea how this will look in fifteen years. At some point we
         | will need to be able to tell that two threads don't contend and
         | thus can be "farther away" from each other, but these threads
         | are constantly communicating and need to use the same complex
         | and/or memory banks.
        
           | dagmx wrote:
           | I think it could be something built upon lifetime analysis. I
           | think a sufficiently complex compiler could use the data
           | available already today to infer thread affinity and other
           | rules.
        
             | hinkley wrote:
             | Basically what I was thinking. In the near term it doesn't
             | even have to be that precise to have a benefit (Rust builds
             | on escape analysis that helped with distributed garbage
             | collection, even when it wasn't that exhaustive). Even
             | ranking threads by suspected degree of interaction is a lot
             | of data.
             | 
             | These definitely don't talk. Offload. These definitely
             | talk. Colocate. These probably talk. Bin pack.
        
         | cjfd wrote:
         | "For most scenarios, you really don't need all those damn
         | cores. They are almost certainly getting in the way of you
         | solving your problems." This rings true. I still think the most
         | promising multi-threading strategy is to, by default, keep
         | everything on one core and if you find some expensive and more
         | or less independent workload to put that in another thread.
         | This is the old-fashioned way though. All of the cool kids will
         | push for thread pools where every workload will jump from core
         | to core and back again.... This potentially can require a lot
         | of locking, though, which also is not good for performance.
        
           | zmmmmm wrote:
           | It's an interesting reflection then on the Actor paradigm
           | where you are encouraged to create large numbers of fine
           | grained tasks and toss them into a giant thread pool blender
           | because light weight threads are "free" now. It would be
           | quite interesting to see the consequences of thread affinity
           | loss in that setting.
        
         | cecilpl2 wrote:
         | Games are a prime example of systems which must take advantage
         | of all possible hardware resources.
        
           | bob1029 wrote:
           | The fact that we believe this is probably part of the problem
           | with game development today.
        
           | dagmx wrote:
           | Games similarly struggle with multi core/ccx designed chips.
           | 
           | Battlefield 4 was famously terrible on Ryzen when AMD first
           | released it.
        
             | littlecranky67 wrote:
             | To be fair, Ryzen mostly underperformed when it was
             | released which was not AMDs fault. When for years engineers
             | used Intel as their benchmarking and testing machines and
             | were tweaking all sorts of parameters to measure against
             | Intel-only, you inevitable end up with an optimized
             | implementation that resembles the hardware details of the
             | Chip in some kind. Any new architecture would perform
             | badly, except if it is a perfect clone of the original
             | testing hardware.
        
               | dagmx wrote:
               | I wasn't implying it was AMDs fault, but the idea that
               | the games were optimized for Intel doesn't hold true IMHO
               | because they were also built for consoles that were AMD.
               | 
               | The majority of the performance was down to not
               | accounting for CCX scheduling
        
         | PaulDavisThe1st wrote:
         | Another interesting example is audio synthesis. It is
         | interesting because there are at least two models for audio
         | synthesis: single sample processing (as used in, for example,
         | VCV Rack) and block structured processing (as used by every
         | plugin API and thus every DAW-that-runs-plugins). But it gets
         | even more interesting because the degree of potential
         | parallelism is entirely controlled by the signal routing
         | pathways chosen by the user. If the software is processing N
         | independent signal pathways that do not meet until a final
         | output stage, you have a high (well, "N") level of potential
         | parallelism, and if N can expand to a high value then you will
         | always be able to benefit from more cores.
         | 
         | However, if the signal pathway is more convoluted (no pun
         | intended for my audio nerd crew), then not only does the degree
         | of parallelism decrease but the issue of single-sample vs.
         | block structured can become much more important. In the single
         | sample case, the cost of computing a single sample has
         | (relatively) enormous fixed overhead if the code has any
         | serious level of modularity, but is still very cheap compared
         | to inter-thread and inter-core synchronization. Spreading the
         | computation across cores will frequently, but not always, slow
         | things down quite dramatically. By contrast, block structured
         | processing has a much lower per-sample cost, because the
         | function call tree is invoked only once for an entire block of
         | samples, and so the relative cost of synchronization decreases.
         | 
         | This sounds like a slam dunk for block structured processing.
         | 
         | Problem is, there are things you cannot do correctly with block
         | structured processing, and so there is always a place for
         | single sample processing. However, the potential for
         | parallelization and the level at which it should take place
         | differs significantly between the two processing models, which
         | opens to door to some highly questionable design.
         | 
         | The short version of this is that audio synthesis in the
         | abstract can always expand to use any number of cores,
         | particularly for cpu-intensive synthesis like physical
         | modelling, but that in the presence of single-sample
         | processing, the cost of synchronization may completely negate
         | the benefits of parallelism.
        
           | spacechild1 wrote:
           | Great write up!
           | 
           | What I like about Pd is that you can freely reblock and
           | resample any subpatch. Want some section with single-sample-
           | feedback? Just put a [block~ 1]. You can also increase the
           | blocksize. Usually, this is done for upsampling and FFT
           | processing. Finally, reblocking can be nested, meaning that
           | you can reblock to 1024 samples and inside have another
           | subpatch running at 1 sample blocksize.
           | 
           | SuperCollider, on the other hand, has a fixed global
           | blocksize and samplerate, which I think is one of its biggest
           | limitations. (Needless to say, there are many things it does
           | better than Pd!)
           | 
           | ---
           | 
           | In the last few days I have been experimenting with adding
           | multi-threading support to Pd
           | (https://github.com/Spacechild1/pure-data/tree/multi-
           | threadin...). With the usual blocksize of 64 sample, you can
           | definitely observe the scheduling overhead in the CPU meter.
           | If you have a few (heavy-weight) subpatches running in
           | parallel, the overhead is neglible. But for [clone] with a
           | high number of (light-weight) copies, the overhead becomes
           | rather noticable. In my quick tests, reblocking to 256
           | samples already reduces the overhead significantly, at the
           | cost of increased latency, of course.
           | 
           | ---
           | 
           | Also, in my plugin host for Pd/Supercollider
           | (https://git.iem.at/pd/vstplugin/) I have a multi-threading
           | and bridging/sandboxing option. If the plugin itself is
           | rather lightweight and the blocksize is small, the scheduling
           | overhead becomes quite noticable. In Pd you can just put
           | [vstplugin~] in a subpatch + [block~]. For the SuperCollider
           | version I have added a "reblock" argument to process the
           | plugin at a higher blocksize, at the cost of increased
           | latency.
        
           | michaelrmmiller wrote:
           | To add onto this very good explanation from Paul, in
           | practice, audio processing graphs always rapidly decay into
           | serial processing after starting out massively parallel. Most
           | of the time, we are writing to a single output (i.e. speakers
           | via your audio interface or a file). On top of that, some of
           | the heaviest weight DSP happens on this final output stage
           | (mastering chains with multiband compressors, linear phase
           | equalizers, limiters etc.) So every 5.3ms (256 sample buffer
           | @ 48kHz sample rate) we start massively parallel processing
           | all the leaf nodes (audio tracks with sound files, virtual
           | instruments and synths) and end up bottlenecking as the tree
           | collapses into a line. Then we are stuck doing some of the
           | most CPU intensive work on a single core since we can't
           | process the limiter DSP plug-in until the EQ finishes its
           | work, for example.
           | 
           | We need to meet our real-time deadline or risk dropping
           | buffers and making nasty pops and clicks. That mastering
           | stage can pretty easily be the limiting (hah) step that
           | causes us to miss the deadline, even if we processed hundreds
           | of tracks in parallel moments before in less time.
           | 
           | The plug-in APIs (AudioUnits, VSTs, AAX) which are
           | responsible for all the DSP and virtual instruments are also
           | designed to process synchronously. Some plug-ins implement
           | their own threading under the hood but this can often get in
           | the way of the host application's real-time processing. On
           | top of that, because the API isn't designed to be
           | asynchronous, the host's processing thread is tied up waiting
           | for the completed result from the plug-in before it can move
           | on.
           | 
           | Add on that many DSP algorithms are time-dependent. You can't
           | chop up the sample buffer into N different parts and process
           | them independently. The result for sample i+1 depends on
           | processing sample i first.
        
       | cletus wrote:
       | The mantra I live by is that multi-threaded programming is so
       | potentially complex and it's so easy to screw up (eg erasing all
       | your supposed performance gains as was the case in this article)
       | or by introducing subtle bugs or (worse) security flaws that you
       | should do absolutely everything you can to avoid it.
       | 
       | IME C++ programmers seem to put an unreasonable amount of faith
       | on RAII to "solve" these problems eg:
       | std::mutex m;         ...         {
       | std::lock_guard<std::mutex> lg(m);           // do stuff with a
       | mutex lock         } // lock_guard destructor releases mutex
       | 
       | If you're just updating a (private) map or something, this is
       | fine. But as soon as this gets complex (eg by calling a function
       | and you lose control over what thread is executing because some
       | deep code creates a new thread) then this all turns to crap.
       | 
       | So whenever anyone tells you "just start a thread" without any
       | hesitation, it's reasonably safe to assume they have no idea what
       | they're talking about.
       | 
       | This is one reason I love the async code model used in a number
       | of languages. I have the most experience with Hack [1] but it's
       | not unique to that. What I tend to think of as "product" code
       | (which is a real Facebook-ism) has no need to every create a
       | thread and dealing with any concurrency issues.
       | 
       | Now if you're programming an HTTP server or an RDMBS or other
       | infrastructure then sure, you have a valid need. But if you're
       | just serving HTTP requests, just avoid it entirely if you can
       | because any performance gain is probably imaginary but the
       | complexity increase isn't.
       | 
       | [1]: https://docs.hhvm.com/hack/asynchronous-
       | operations/introduct...
        
         | otabdeveloper4 wrote:
         | Don't call functions (that aren't part of the C++ standard
         | library) inside the lock guard. It's not a hard rule and you'll
         | get 99% of the way there.
        
         | Thiez wrote:
         | Isn't that mantra a bit too pessimistic? After fixing the
         | performance issue, a speedup that was almost linear in the
         | number of cores was achieved using multithreading.
         | 
         | In addition, async code can bring much of the same challenges
         | as multithreading. For example, in C# tasks may run on the
         | current thread, but they may also run on a thread-pool, and
         | this is mostly transparent to the programmer. When you don't
         | immediately await all asynchronous operations your tasks may
         | end up running at the same time.
        
           | cletus wrote:
           | Like anything you weigh up the upside against the downside.
           | 
           | The downside is more complex code, code that's harder to
           | reason about, the possibility of concurrency bugs, slower
           | development times, the possibility of security flaws and you
           | may just be shifting your performance bottleneck (eg creating
           | a hot mutex).
           | 
           | The upside is lower latency, more throughput and higher QPS.
           | 
           | My argument is quite simply YAGNI. A lot of these benefits
           | are for many people purely imaginary. They're the very
           | definition of premature optimization. Code that is less
           | complex, faster to develop in and likely "good enough" is
           | probably going to help you way more than that.
           | 
           | If you get to the scale where heavy multithreading actually
           | matters, that's a good problem to have. My argument is that
           | many people aren't there and I would hazard a guess that a
           | contributing factor to not getting there is worrying about
           | this level of performance before they need to.
        
         | thewarrior wrote:
         | This model is good for I/O heavy code that is typical for most
         | web services. However if your load is a mix of I/O and compute
         | something like Go should do a lot better.
        
           | cletus wrote:
           | Not sure why you're being downvoted because you raise a valid
           | point: Go has a concurrency model aimed at avoiding many of
           | these problems. This is of course the channel abstraction.
           | 
           | Go has other issues. My first piece of advice to any new Go
           | programmer is make every single one of your channels
           | unbuffered. This way it operates a lot like the async model I
           | mentioned and it tends to be easy to reason about.
           | 
           | But people fall into a trap of thinking adding a buffer will
           | improve concurrency and throughput. This might be true but
           | most of the time it's a premature optimization. Buffered
           | channels also often lead to obscure bugs and a system that's
           | simply harder to reason about.
           | 
           | It's really better to think of Go as an tool for organizing
           | concurrent code, not necessary a direct solution to the
           | problems involved. But organization matters.
        
             | thewarrior wrote:
             | All concurrency tools are foot guns but go seems to be the
             | safest foot gun :)
        
         | FpUser wrote:
         | >"But if you're just serving HTTP requests, just avoid it
         | entirely if you can because any performance gain is probably
         | imaginary but the complexity increase isn't"
         | 
         | I serve HTTP requests offering JSON based RPC in my C++ servers
         | that do all kinds of calculations (often CPU consuming) in
         | multiple threads. Works just fine and scales well. Yes I have
         | to be careful on how I work with shared data but it is
         | centralized encapsulated and done in a single place I call
         | request router / controller. I reuse it among the projects.
         | Nothing too exciting about it and no reason to be scare do
         | death ;)
        
       | amichal wrote:
       | Would this tool have have scaled as well as 24 separate OS
       | processes?
       | 
       | I remember hitting this same issue many years ago. Some primitive
       | in the library (c++) didn't on the surface appear to involve
       | shared state and cache write lock but of course it did if you
       | thought about it. Since I switched over to more application level
       | coding I don't work that low level very much and tend to scale
       | with processes now (typically not needing that much parallelism
       | anyway) so i don't often have to think these things through
       | anymore.
        
         | 10000truths wrote:
         | The problem outlined in the article is not with process vs
         | thread, the problem is with multiple parallel execution
         | contexts reading and writing the same cache line. This issue
         | would have manifested in the same way if your separate
         | processes touched the same offset in a shared memory mapping.
         | 
         | Of course, designing with multiple processes in mind helps to
         | design a shared-nothing architecture, as the costs of shared
         | state between processes are much more explicit. But _any_ kind
         | of state that needs to be shared will run into the same
         | bottleneck, because all forms of shared IPC fundamentally have
         | to use some form of locking somewhere in the stack to serialize
         | incoming inputs.
        
           | lmeyerov wrote:
           | I think the intuition on processes vs threads was right:
           | 
           | 1. The Arc use is a case (I believe! Am not a Rust'er!) of
           | true sharing of references. The cacheline will be contention
           | across cores. Doing processes would have eliminated true
           | sharing. Tracking logical core affinity for references via a
           | modal type system could have flagged this at compile time,
           | but I suspect that's not a thing in Rust land.
           | 
           | 2. Even if it was false sharing at the cacheline level,
           | process abstractions in shared memory systems are generally
           | aligned to avoid false sharing at cacheline granularity (and
           | misaligned ops at smaller granularities). Of course, in the
           | extreme, a 10X hero programmer may have tried to do amazing
           | levels of bit packing and syscall hackery and thus break all
           | those niceties, but that's the cleanup that everyone else in
           | a team has to make the hero a 10Xer.
        
         | loeg wrote:
         | Probably yes, unless the author went to great lengths to
         | contend on shared memory segments.
        
       | wabain wrote:
       | This is an excellent writeup, but I'm curious about how dependent
       | the microbenchmark is on the scripting engine doing no real work.
       | My off-the-cuff guess is that the unshared version might still
       | perform better with other workloads (it would have higher
       | throughput and avoid variance caused by fluctuating degrees of
       | cache contention) but it might be that the observable overhead
       | from hammering a single cache line would disappear if each thread
       | had more varied work to do. And of course if the scripting engine
       | benefits substantially from caching internal state across
       | invocations (things like incrementally optimized JIT code) then
       | having 24 instances can incur substantial extra work.
        
       | dathinab wrote:
       | Has anyone any idea where the `` format strings come from??
       | 
       | As far as I know rust doesn't have them.
       | 
       | And as far as I know if rust gets format strings it most likely
       | get them in form of a `f` prefix (like raw strings have an `r`
       | prefix and binary strings have a `b` prefix).
       | 
       | Is it just a thing to make the code more readable/shorter on the
       | web?
        
         | metabagel wrote:
         | Markdown uses backticks around code.
         | 
         | https://docs.github.com/en/github/writing-on-github/getting-...
        
         | cdirkx wrote:
         | The code shown in the article isn't Rust, but Rune [0], which
         | has similar syntax, but is a dynamic scripting language. It has
         | template literals [1].
         | 
         | [0] https://rune-rs.github.io/
         | 
         | [1] https://rune-rs.github.io/book/template_literals.html
        
           | dathinab wrote:
           | Oh, that makes sense.
        
         | michalhosna wrote:
         | If you are talking about "Here is an example of a complete
         | workload that measures performance of selecting rows by random
         | keys", then you are looking at Rune code rather than Rust code.
         | 
         | See https://rune-rs.github.io/book/template_literals.html
        
       | johnisgood wrote:
       | Can anyone TL;DR it? A single line, in Rust, can make a 24-core
       | server slower than a laptop? What line is that?
        
         | jkelleyrtp wrote:
         | In a multithreaded context, shared data was shared via atomic
         | reference counting. The atomic operation acted as a single
         | bottleneck for every operation. The solution was to derive deep
         | cloning for the shared data (a 1 line fix) to reduce atomic
         | contention.
        
       | PhileinSophia wrote:
        
       | vlovich123 wrote:
       | Is there any tooling that can demonstrate cache coherency
       | hotspots like cache line bouncing or inter-core communication?
        
         | mhh__ wrote:
         | All of the "serious" profilers support this. Intel have a bunch
         | of tooling for diagnosing threading issues, AMD also have some
         | less polished but still functional metrics you can use.
        
           | vlovich123 wrote:
           | Do they track it down to the actual problem point in the
           | code? Or do you just see that maybe there's a problem and
           | then experimenting with changes to figure out what needs to
           | be fixed?
        
             | mhh__ wrote:
             | Short answer: yes. Much better than perf does.
        
         | maxwell86 wrote:
         | Yes. Linux perf has a c2c hit modify "mode"/"option" that will
         | report precisely this issue.
        
           | loeg wrote:
           | What is c2c? Thanks.
        
             | tkhattra wrote:
             | c2c stands for cache-2-cache.
             | 
             | perf-c2c allows you to measure cacheline contention - see
             | https://man7.org/linux/man-pages/man1/perf-c2c.1.html and
             | https://joemario.github.io/blog/2016/09/01/c2c-blog/.
        
             | eptcyka wrote:
             | Cache to cents? Kernels 2 cents on how your code is
             | utilizing the cache? I have 0 clue too.
        
               | masklinn wrote:
               | Cache to Cache,
               | 
               | It tracks a bunch of cache-related counters, and HITM
               | (hit modify) tracks loads of data modified by either an
               | other core of the same node ("local hitm") or by an other
               | node entirely ("remote hitm").
               | 
               | Here, the contention would have shown up as extreme
               | amounts of both, as each core would almost certainly try
               | atomically updating a value which had just been updated
               | by an other core, with higher than even chances the value
               | had been updated by a core of the other socket (= node).
        
         | gjulianm wrote:
         | I've used the Intel VTune Profiler, I think it's free. It's a
         | pretty good tool for profiling code and gives you quite a lot
         | of information on how you're using the core pipelines, but you
         | need an Intel processor to use it. I think it would have
         | quickly flagged this kind of problem and told you that the
         | processor were stalled waiting for cache.
        
           | mhh__ wrote:
           | Vtune is free now. it's by the far the best profiler I've
           | ever come across, AMD's equivalent is a bit rough and ready
           | by comparison.
        
       | game_the0ry wrote:
       | Relevant video - Scott Meyers: Cpu Caches and Why You Care
       | 
       | https://www.youtube.com/watch?v=WDIkqP4JbkE
       | 
       | Remember, boys and girls, small mistakes can have big
       | consequences.
       | 
       | Keep those data in immutable data structures when mutli-
       | threading. Actor model also helps.
        
       | teen wrote:
       | That code? Thread.sleep(5000)
        
       | errcorrectcode wrote:
       | My laptop (i7-8650U) has slightly faster single-core performance
       | than my 96-thread dual 7402 EPYC Milan home server. I'm fine with
       | that. I'm not fine with undue inefficiencies, i.e., excessive
       | cache misses, branch mis-predictions, and unparallelization due
       | to sloppy or unfortunate code.
        
       | kentonv wrote:
       | An aside:
       | 
       | > L3 cache is shared by all cores of a CPU.
       | 
       | I recently learned this is no longer true! AMD EPYC processors
       | (and maybe other recent AMDs?) divide cores into groups called
       | "core complexes" (CCX), each of which has a separate L3 cache. My
       | particular processor is 32-core where each set of 4 cores is a
       | CCX. I discovered this when trying to figure out why a benchmark
       | was performing wildly differently from one run to the next with a
       | bimodal distribution -- it turned out to depend on whether Linux
       | has scheduled the two processes in my benchmark to run on the
       | same CCX vs. different CCXs.
       | 
       | https://en.wikipedia.org/wiki/Epyc shows the "core config" of
       | each model, which is (number of CCX) x (cores per CCX).
        
         | [deleted]
        
         | Ataraxic wrote:
         | Just as information for some later readers, but this is
         | applicable not just to Epyc but also the consumer version
         | Threadripper. My understanding is that this is an example of a
         | Non-Uniform Memory Access (NUMA) which was also used to link
         | multiple cpus in different sockets together for a long time
         | now, but now they've been integrated into a cpu that fits in
         | one socket.
         | 
         | This actually has an importance if you are running a VM on such
         | a system since you'll run into things like the actual RAM (not
         | l3 cache) is often directly linked to a particular NUMA node.
         | For example accessing memory in the first ram stick vs the
         | second will give different latencies as it goes from ccx1 =>
         | ccx2 => stick2 versus ccx1 => stick1. This is applicable to I
         | think 2XXX versions and earlier for threadripper. My
         | understanding is that they solved this in later versions using
         | the infinity fabric (IO die) so now all ccx's go through the IO
         | die.
         | 
         | I ran into all of this trying to run an ubuntu machine that ran
         | windows using KVM while passing through my nvidia graphics
         | card.
        
           | reilly3000 wrote:
           | This sheds some light on why I had so much trouble running
           | GPU passthrough with Unraid (KVM) on my 1950X w/ Nvida. Those
           | were dark days.
        
             | rob_c wrote:
             | Frankly that was probably 200% unraid ...
        
           | masklinn wrote:
           | > Just as information for some later readers, but this is
           | applicable not just to Epyc but also the consumer version
           | Threadripper.
           | 
           | It's applicable to any Zen design with more than one CCX,
           | which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it
           | was 4).
           | 
           | The wiki has the explanation under the "Core config" entry of
           | the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16
           | cores).
           | 
           | Zen 3 APU are all single-CCX, though there are Zen 2 parts in
           | the 5000 range which are multi-CCX (because why not confuse
           | people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a
           | 2x4.
           | 
           | The rest is either low-core enough to be single-CCX Zen 2
           | (5300U) or topping out at a single 8-cores CCX (everything
           | else).
        
           | dagmx wrote:
           | This also applies to Ryzen as well. The first gen Ryzen was
           | plagued by software and schedulers that couldn't handle the
           | ccx affinity
        
         | 123pie123 wrote:
         | this is important when sizing and tuning VMs/ hypervisor
        
         | masklinn wrote:
         | > I recently learned this is no longer true! AMD EPYC
         | processors (and maybe other recent AMDs?) divide cores into
         | groups called "core complexes" (CCX), each of which has a
         | separate L3 cache.
         | 
         | It's still kinda true, just that "a CPU" in that context is a
         | CCX. Cross CCX communications has shown up a fair bit in
         | reviews and benches, and really in all chips at that scale
         | (e.g. Ampere's Altras and Intel's Xeons):
         | https://www.anandtech.com/show/16529/amd-epyc-milan-review/4
         | and one of the "improvements" in Zen 3 is the CCX are much
         | larger (there's one CCX per CCD, rather than 2) so there's less
         | crosstalk.
         | 
         | And it could already be untrue previously e.g. the Pentium D
         | were rushed out by sticking two P4 dies on the same PCB, I
         | think they even had to go through the northbridge to
         | communicate, so they were dual socket in all but physical
         | conformation (hence being absolute turds). I don't think they
         | had L3 at all though, so that wasn't really a factor, but
         | still...
        
           | tentacleuno wrote:
           | > And it could already be untrue previously e.g. the Pentium
           | D were rushed out by sticking two P4 dies on the same PCB, I
           | think they even had to go through the northbridge to
           | communicate
           | 
           | Part of me hopes you're wrong here. That is absolutely
           | absurd.
        
             | masklinn wrote:
             | Yeah you'd hope so wouldn't you?
             | 
             | But Intel got caught completely unaware by the switch to
             | multi-core, just as it had by the 64b switch.
             | 
             | The eventual Core 2 was not ready yet (Intel even had to
             | bridge to it with the intermediate Core, which really had
             | more to do with the Pentium M than with the Core 2 though
             | it did feature a proper dual-core design... so much so that
             | the Core solo was actually a binned Duo with one core
             | disabled).
             | 
             | So anyway Intel was caught with its pants around its ankle
             | for the second time and they couldn't let that happen. And
             | they actually beat AMD to market, having turned out a
             | working dual-core design in the time between AMD's
             | announcement of the dual-core opteron (and strong hints of
             | x2) and the actual release, about 8 months.
             | 
             | To manage that Intel could not rearchitecture their chip
             | (and probably didn't want to as it'd become clear Netburst
             | was a dead-end), so they stapled two Prescott cores
             | together, FSB included, and connected both to the
             | northbridge.
             | 
             | It probably took more time to validate that solution for
             | the server market, which is why where AMD released the dual
             | core Opterons in April and Athlons in May, it took until
             | October for the first dual-core Xeon to be available.
        
             | danudey wrote:
             | Here's the wikipedia article:
             | 
             | https://en.wikipedia.org/wiki/Pentium_D
             | 
             | "The Pentium D brand refers to two series of desktop dual-
             | core 64-bit x86-64 microprocessors with the NetBurst
             | microarchitecture, which is the dual-core variant of
             | Pentium 4 "Prescott" manufactured by Intel. Each CPU
             | comprised two dies, each containing a single core, residing
             | next to each other on a multi-chip module package."
        
           | IronWolve wrote:
           | Yup, and why overclocking on AMD tests each of your
           | ccx/cores, marks gold cores that can reach 5ghz+, and slower
           | cores that dont overclock. Also why AMD tests each ccx, and
           | moves worse ones to lower end products, and higher ones to
           | the 5950x.
           | 
           | Then you can OC tweak each core and try to milk out
           | performance, PBO tends to be pretty good on detecting which
           | cores with slight performance boost. With a few bios options,
           | all that adds up without effort.
           | 
           | And then finally get a better (advanced) scheduler in
           | windows/linux that can move workloads around per core
           | depending on workload. Windows released multiple fixes for
           | the scheduler on AMD starting with win10 1903.
           | 
           | I find scheduler mods and modest bios tweaks to increase
           | performance without much effort, very detectable performance.
           | 
           | Linux I use xanmod (and pf-kernel before that)
           | 
           | Windows I use project lasso and AMD PBO.
           | 
           | Slow/fast cores and scheduler tweaks is how win11/arm
           | mac/android, makes things appear faster too.
           | 
           | Amusing, scheduler/core tweaks been around for linux for a
           | decade+, making the desktop super smooth, but its now
           | mainstream in win11/arm osx.
        
             | Andys wrote:
             | does xanmod work as a drop-in replacement in Ubuntu with an
             | NVIDIA card?
        
           | sroussey wrote:
           | There are core complexes in Apple's M1 variations as well.
           | 
           | And it's not just L3 that can shared by various chips.
           | 
           | Each complex also has its own memory bandwidth, so running on
           | two core in two complexes will get around twice the memory
           | bandwidth of two cores in the same complex.
        
         | lend000 wrote:
         | I've always wondered how much this affects shared cloud
         | instances (for example, a single c5.small instance). Can your
         | neighbor, who shares your L3 cache, be using memory so much
         | more aggressively than you that it causes you to evict your
         | L2/L1 cache? Or is cache coherency maintained if the core that
         | evicted your L3 cache is known to be living in a different
         | memory space?
        
           | aspaceman wrote:
           | Coherency will be maintained (since the protocols support
           | that case). But yes, a separate process would evict that
           | memory. From the processors point of view, they're just
           | addresses tagged with data. Caching behavior doesnt depend on
           | virtual memory addresses because I believe that info is
           | stripped away at that point.
           | 
           | So if someone is thrashing cache on the same core you're on,
           | you will notice it if the processes aren't being shared
           | effectively.
           | 
           | The contents of the cache aren't stored as part of a paused
           | process or context switch. But I'd appreciate a correction
           | here if I'm wrong.
           | 
           | For an example, consider two processes A and B running on a
           | set of cores. If A makes many more memory accesses than B, A
           | can effectively starve B of "cached" memory accesses because
           | A accesses memory more frequently.
           | 
           | If B were run alone, then it's working set would fit in
           | cache. Effectively making the algorithm operate from cache
           | instead of RAM.
           | 
           | BUT. You really have to be hitting the caches hard. Doesn't
           | happen too often in casual applications. I only encountered
           | this on GPUs (where each core has sperate L1 but a shared
           | L2). Even then it's only aa problem if every core is hitting
           | different cache lines.
        
             | aspaceman wrote:
             | Found some detail - apparently the CPU manuals are the
             | place to check https://cs.stackexchange.com/a/1090
        
         | pdpi wrote:
         | Out of curiosity -- was the "bad" mode a case of two separate
         | workloads competing for limited cache within one CCX, or was it
         | really bad cache behaviour across CCXs causing problems because
         | the two processes shared memory?
        
           | kentonv wrote:
           | The two processes were a client and a server, the client just
           | sends HTTP requests to the server over a unix socket, but
           | otherwise they don't share resources. TBH I'm surprised there
           | was so much impact just from that. There might be something
           | more to the story -- I haven't investigated much yet. But
           | basically when I used `taskset` to limit the benchmark to two
           | cores in different CCX's, it ran about half the speed as when
           | limited to two cores in the same CCX. I guess another
           | possibility is that the kernel is allowing both processes to
           | swap between cores regularly, which would much better explain
           | the disastrous performance penalty. But, in that case I don't
           | understand the bimodal distribution I see when running the
           | benchmark with no core affinity...
        
         | hinkley wrote:
         | This kind of [stuff] drives me nuts every time I'm trying to
         | read the performance telemetry tea leaves at work.
         | 
         | Do we have requests that take longer, or did Linux do something
         | dumb with thread affinity?
        
           | dahfizz wrote:
           | Pro tip: if you care about performance, and especially if you
           | care about reliable performance measurements, you should pin
           | and accelerate your threads. Using isol CPUs is the next
           | step.
           | 
           | In other words: if your application is at the mercy of Linux
           | making bad decisions about what threads run where, that is a
           | performance bug in your app.
        
             | spockz wrote:
             | How would you do this when running in kubernetes? Afaik it
             | just guarantees that you get scheduled on some cpu for the
             | "right" amount of time. Not on which cpu that is.
             | 
             | I only know that there is one scheduler that gives you
             | dedicated cores if you set the request and limit both to
             | equal multiples of 1000.
        
               | spiddy wrote:
               | On post-docker pre-kubernetes time I used `--cpuset-cpus`
               | on docker args to dedicate specific cpus to redis
               | instances, using CoreOS and fleet for cluster
               | orchestration.
        
               | ncmncm wrote:
               | Another reason to put off moving to kubernetes.
        
           | masklinn wrote:
           | > Do we have requests that take longer, or did Linux do
           | something dumb with thread affinity?
           | 
           | Yes.
           | 
           | Cross-socket communications do take longer, but a properly
           | configured NUMA-aware OS should probably have segregated
           | threads of the same process to the same socket, so the
           | performance should have increased linearly from 1 to 12
           | threads, then fallen off a cliff as the cross-socket effect
           | started blowing up performances.
        
         | klysm wrote:
         | Do you use numa nodes to tell the kernel about this or
         | something else?
        
           | wmf wrote:
           | You can enable node-per-CCX mode in UEFI to do that.
           | Otherwise you have to use tools like hwloc to discover the
           | cache topology and pin workloads accordingly.
        
             | rob_c wrote:
             | Underrated and useful reply. Thank you!
        
         | buildbot wrote:
         | L3 being shared is only common on smaller cores really, on any
         | large Intel design the L3 isn't all in one place, it's broken
         | up across the core.
        
           | wmf wrote:
           | That's still a single logical L3 (e.g. a single core could
           | use the entire L3 capacity).
        
       | cheschire wrote:
       | There are only two hard things in Computer Science: cache
       | invalidation and naming things. - Phil Karlton
        
         | airstrike wrote:
         | There are actually only two hard problems in computer science:
         | 
         | 0) Cache invalidation
         | 
         | 1) Naming things
         | 
         | 5) Asynchronous callbacks
         | 
         | 2) Off-by-one errors
         | 
         | 3) Scope creep
         | 
         | 6) Bounds checking
        
           | c0balt wrote:
           | You forgot text encodings, endianess, ...
        
           | hinkley wrote:
           | Is 4 uninitialized variables?
        
           | Scarblac wrote:
           | There is only one hard problem in computer science: _people_.
        
           | [deleted]
        
           | nicoburns wrote:
           | Most languages will do bounds checking for you. I reckon that
           | one's mostly solved at this point.
        
             | chris_st wrote:
             | But... but... but... we HAVE to use C, it's the only
             | language that's fast enough!
             | 
             | /s
        
             | andi999 wrote:
             | Too slow.
        
         | hnaccount_rng wrote:
         | You missed the off-by-one errors
        
           | cheschire wrote:
           | So you mean I was off by one?
        
         | Pet_Ant wrote:
         | Better version:
         | 
         | There are only two hard things in Computer Science: cache
         | invalidation, naming things, and off-by-one errors.
        
           | MR4D wrote:
           | Better better version:
           | 
           | There are only two hard things in Computer Science: cache
           | invalidation, and...what did you ask me again?
        
           | jahnu wrote:
           | There are only two hard things in Computer Science: cache
           | invalidation, naming things, and off-by-oly two hard things
           | in Computer Scivetem dy gjera te veshtira ne Shkencen
           | Kompjute
        
             | hermitdev wrote:
             | Now, just toss in a byte order mark, and I think we've a
             | winner!
        
             | jacquesm wrote:
             | Hehe. Ok. That is very funny :)
        
               | gavinray wrote:
               | I don't get it -- am I slow?
        
               | [deleted]
        
               | jacquesm wrote:
               | Buffer overflow :)
        
               | vaidhy wrote:
               | This used to happen in C for me. Basically, you are
               | trying to print a string which has overflowed the
               | allocated length and you get junk printed on the string.
               | C-Strings are null terminated and the system will keep
               | printing till it encounters a null. So, bounds checking
               | and buffer overflows are also hard problems.
        
             | colejohnson66 wrote:
             | Gotta watch out for those nulls :)
        
               | silon42 wrote:
               | Nulls are easy compared to invalid non-null pointers.
        
               | ludamad wrote:
               | People say null in Java was a mistake, but I've never had
               | any issues in my null-only dialect of Java /s
        
           | kibwen wrote:
           | There are only two hard problems in Computer Science: cache
           | invalidation, naming things, off-by-one errors, and
           | remembering to update the documentation.
        
       | nikkinana wrote:
        
       | [deleted]
        
       | physicsgraph wrote:
       | This article is a well-written deep dive on parallel performance
       | debugging that ends up identifying the difference being due to
       | how caching is handled on multi-CPU vs single-CPU systems. Thanks
       | to the author for taking time to write up their troubleshooting
       | process description.
        
       | pengaru wrote:
       | TL;DR: serializing your parallel work dispatch on an atomic
       | global counter is bad for scaling across many cores.
       | 
       | Or, per-cpu counters exist for a reason.
        
       | dan-robertson wrote:
       | It seems to me that one solution would be to be able to shard an
       | Arc into separate counts for each thread, so that the
       | synchronisation wouldn't be required (?) but I don't know how it
       | would be done without making most Arcs unnecessarily expensive.
       | 
       | Perhaps another problem is that libraries can't be (or just
       | aren't) generic about the kind of smart-pointers that are used
       | but the kind of smart-pointers used can matter a lot (e.g.
       | libraries often have to use Arc for some of their clients even
       | though Rc would often suffice).
       | 
       | But maybe I'm just totally wrong about this?
        
         | Kranar wrote:
         | It's a great idea and in fact a variation of it is being
         | explored by Python to eliminate the use of the GIL without
         | introducing performance regressions.
         | 
         | The variation is instead of one count per thread, have two
         | reference counts where one is atomic and the other is not
         | atomic. The heuristic is that at any given time, the majority
         | of writes are performed within a single thread and so that
         | thread has exclusive read/write access to the non-atomic
         | reference count, all other threads share the atomic reference
         | count. There is some work needed to coordinate the atomic and
         | non-atomic reference count so as to test when an object can be
         | deleted, but the overhead of this work is less than the cost of
         | always using an atomic.
         | 
         | http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf
        
           | dan-robertson wrote:
           | In the case of the original article, the reference count was
           | being updated by all the threads so having an Arc-with-Rc-
           | for-one-thread wouldn't actually change things, I think. But
           | maybe you just mean for 'cheaper smart-pointers in general'?
           | 
           | The python thing is interesting. I thought there were other
           | reasons for the GIL than ref counts but I guess everything
           | must be addressed.
        
             | Kranar wrote:
             | Yep in this case it might not have helped, but all this to
             | say that some exploration of having multiple reference
             | counts is a great idea. The balance becomes the overhead of
             | coordinating multiple reference counts versus the gains
             | from reduced atomic operations.
        
       | Pxtl wrote:
       | That actually makes me appreciate that the single-threadedness of
       | Python might a good thing since Python uses refcounting.
        
       | reilly3000 wrote:
       | FWIW this is the kind of content I originally came to HN for and
       | what keeps me coming back.
        
       | chrisaycock wrote:
       | The culprit was a shared _atomic reference counter_ (Arc). The
       | cache invalidation across cores eliminated any kind of scaling
       | via multiple threads.
       | 
       | The solution is to stop sharing. The author had to make changes
       | specific to this codebase for each thread to have its own copy of
       | the offending dataset.
        
         | denysvitali wrote:
         | Awesome TL;DR! I read the full article but this summarizes it
         | well. The article is anyways worth a full read
        
         | divs1210 wrote:
         | Immutable collections save the day!
        
           | Thiez wrote:
           | Immutable collections wouldn't really help here since Rust
           | does not have a garbage collector, so they would still be
           | wrapped in an `Arc<T>`, or use one internally. Unless you're
           | giving each thread their own copy of the immutable
           | collection, in which case the immutability doesn't really add
           | much.
        
             | ericbarrett wrote:
             | You're right if they're wrapped in an Arc<T>. But it's a
             | lot easier to confidently clone an immutable data structure
             | to threads/async functions because you know they won't be
             | mutated by other cores. If you're cloning mutable data,
             | you'll have to collect each "shard" at the end of whatever
             | runtime loop your program has, and merge them back into the
             | authoritative model. (Or drop the changes, but then why
             | make the data mutable?) Gets hairy.
        
         | amelius wrote:
         | Or use a language with a good concurrent garbage collector.
        
           | dagmx wrote:
           | What language wouldn't hit this issue? The second you have an
           | atomic write across multiple numa clusters you'll hit this.
           | 
           | Edit: it appears they're talking specifically about the ref
           | counting whereas I was considering the entirety of a shared
           | context. That clarification helps me understand where their
           | statement was coming from
        
             | amelius wrote:
             | The problem was caused by a memory management primitive.
             | This wouldn't be an issue in a runtime with a GC designed
             | for concurrent loads.
        
               | lowbloodsugar wrote:
               | Wood for trees. The problem was caused by an ill-thought-
               | out design. We can do similarly performance degrading
               | things in GC languages, just the details will be
               | different. However, at some extreme, which, to be fair,
               | most systems won't hit, GC languages perform vastly worse
               | than non GC languages. In one service I own, Java's G1GC
               | uses 20x more CPU than Rust for an application specific
               | benchmark. Most of this time is spent in the concurrent
               | phase so Shenandoah and GenShen aren't going to make a
               | dent (and we can't afford the RAM for Shendandoah). 20x
               | CPU and 2x wall clock. The question we are looking at is,
               | "Should we just continue to spend 20x $ on operating
               | costs for the Java version to avoid writing it in Rust?"
        
               | dagmx wrote:
               | How would the GC avoid the atomic lock and cache
               | invalidation across numa boundaries? What language has a
               | sufficiently lock less rw capable GC?
               | 
               | Edit: to clarify I was thinking about a mutable context
               | share as shown in the code example, not solely about the
               | ref counting.
        
               | yencabulator wrote:
               | If the only purpose of the Arc was to be a reference
               | count, a language with a (non-refcount) GC wouldn't need
               | that atomic in the first place.
        
               | dagmx wrote:
               | That's fair. It read to me, from his post, that Rune was
               | using it for context sharing as well between threads
               | (since that's the source of the highest level Arc in his
               | code example) . If it's only for ref counts then it makes
               | sense that a concurrent GC could avoid the issue
        
               | Const-me wrote:
               | > How would the GC avoid the atomic lock and cache
               | invalidation across numa boundaries?
               | 
               | By not using reference counting. State of the art GCs
               | don't count references. They usually doing mark and
               | sweep, implementing multiple generations, and/or doing a
               | few other things.
               | 
               | Most of that overhead only happens while collecting.
               | Merely referencing an object from another thread doesn't
               | modify any shared cache lines.
               | 
               | > What language has a sufficiently lock less rw capable
               | GC?
               | 
               | Java, C#, F#, Golang.
        
               | dagmx wrote:
               | Yeah I think my confusion was that I was thinking about a
               | fully mutable context shared across threads based on the
               | code example in the post.
               | 
               | But if it's just for the ref counting part of the Arc
               | then I can see how a GC would solve it by not needing the
               | RC
        
               | Const-me wrote:
               | > I was thinking about a fully mutable context shared
               | across threads
               | 
               | A quote from the article: "No locks, no mutexes, no
               | syscalls, no shared mutable data here. There are some
               | read-only structures context and unit shared behind an
               | Arc, but read-only sharing shouldn't be a problem." As
               | you see, the data shared across threads was immutable.
               | 
               | However, the library they have picked was designed around
               | Rust's ref.counting Arc<> smart pointers. Apparently for
               | some other use cases, not needed by the OP, that library
               | needs to modify these objects.
               | 
               | > I can see how a GC would solve it by not needing the RC
               | 
               | Interestingly enough, C++ would also solve that. The
               | language does not stop programmers from changing things
               | from multiple threads concurrently. For this reason, very
               | few libraries have their public APIs designed around
               | std::shared_ptr<> (C++ equivalent of Rust's Arc<>).
               | Instead, what usually happens, library authors write in
               | the documentation things like "the object you pass to
               | this API must be thread safe" and "it's your
               | responsibility to make sure the pointer you pass stays
               | alive for as long as you using the API", and call it a
               | day.
        
               | amelius wrote:
               | GoLang. It was designed for highly concurrent loads. For
               | an overview of history and technology involved, read
               | e.g.:
               | 
               | https://blog.twitch.tv/en/2016/07/05/gos-march-to-low-
               | latenc...
        
               | dagmx wrote:
               | Edit: another comment explains that you're likely talking
               | about just the ref counting aspect rather than the entire
               | context sharing used by the rune code shown, in which
               | case, yes I see why a concurrent GC would avoid the issue
               | in that scenario.
               | 
               | ----------
               | 
               | I'm familiar with Go's GC. Your linked post doesn't
               | explain how it would avoid the hit mentioned by the cache
               | invalidation across multiple clusters.
               | 
               | It'll either try and put multiple go routines on a single
               | cluster (as listed in the link) or it'll need to copy the
               | necessary stack per thread. Which is effectively what the
               | original article ends up doing.
               | 
               | But if you encounter anything that needs to run
               | concurrently across threads while using a single r/w
               | object, you'll hit the same cliff surely?
        
               | YogurtFiend wrote:
               | While this isn't true with Go's GC, a GC that's stop-the-
               | world can avoid these cache coherency issues altogether.
               | If you pause every thread before marking and sweeping,
               | then you won't run into problems--only the GC is running.
               | While this may sound silly, stopping the world will
               | oftentimes lead to higher throughput than concurrent
               | collectors, at the cost of higher latency (which is why
               | it's not suitable for Go, which is often used for
               | building servers where latency is a priority).
        
               | jhgb wrote:
               | For one, I'd be surprised if Azul's C4 (Zing) didn't have
               | these issues already solved (assuming anyone has solved
               | that, _they_ probably did).
        
           | YogurtFiend wrote:
           | I don't know why you're being downvoted. Memory reclamation
           | is hard in the face of parallelism. It's much easier to
           | design lock-free algorithms in languages with a garbage
           | collector, because memory reclamation issues are taken care
           | of for you (and often times in a batched manner, which won't
           | affect the hot path of your code). There _are_ techniques to
           | avoid using a garbage collector: atomic reference counting,
           | epoch-based reclaimation, hazard points, and more. But they
           | can be quite difficult to implement and difficult to use.
        
             | quotemstr wrote:
             | Right. Plus GCs can compact memory, which we can't do with
             | manual memory management. I don't understand why people
             | fetishize manual memory management --- do you want to spend
             | your short time on Earth twiddling references?
        
               | titzer wrote:
               | After 28 years of programming in C and 25 years
               | programming in C++ (on and off), I am sick to death of
               | managing memory. Programmers can't do it right or well.
               | Let the super-intelligent idiot savant with billions of
               | cycles of computation per second and trillions of bits of
               | storage do the thinking for me.
               | 
               | Personally I glad the world is moving away from unsafe
               | manual memory management in C++, be that with better RAII
               | and managed pointers, Rust's type system, etc. But those
               | things are still breakable and IMHO, ultimately a big
               | waste of people's time. If Rust's lifetime annotations
               | could be inferred by a compiler, then by all means. But
               | if they can't, just GC already. All that extra thinking
               | pushes people to think about the wrong things and designs
               | end up ossified, difficult to refactor because of the
               | extra overhead of radically changing ownership. Forget it
               | already! Let computers figure out the whole memory
               | management problem.
               | 
               | Want your program to go fast? Don't allocate a bunch of
               | garbage. Redesign it to reuse data structures carefully.
               | Don't burden yourself forever twiddling with managing the
               | large and small alike, obsessing over every dang byte.
        
             | mcguire wrote:
             | Rust's Arc _is_ an atomic reference counter. But, as the
             | article describes, those aren 't free.
        
             | duped wrote:
             | One could argue that using a GC for reclamation isn't lock-
             | free if it's not a concurrent GC.
        
               | hayley-patton wrote:
               | There are very few GCs that are actually lock-free. While
               | uncooperative mutators cannot block other mutators in an
               | on-the-fly collector, for example, the collector is
               | blocked until all mutators cooperate. Nonetheless I'd
               | guess that a sufficiently concurrent collector is good
               | enough.
        
       ___________________________________________________________________
       (page generated 2021-12-31 23:00 UTC)