[HN Gopher] An Introduction to Lockless Algorithms ___________________________________________________________________ An Introduction to Lockless Algorithms Author : signa11 Score : 89 points Date : 2021-03-04 18:19 UTC (4 hours ago) (HTM) web link (lwn.net) (TXT) w3m dump (lwn.net) | dragontamer wrote: | pthread_join is a blocking call, and therefore against the spirit | of lockless programing. pthread_join will block until the child- | thread exists, which means there's a lock in there somewhere. I | personally think the first example they show is a terrible | example as a result. | | ------------ | | However, the smp_store_release and smp_load_acquire discussion is | good... very good. But I'm not sure if there's enough discussion | or theory to discuss why this is required. | | The gist is: L1 caches on POWER9 or ARM are allowed to reorder | loads/stores. thread 1 | -------------------------------- a.x = 1; | message = &a; | | "a.x = 1" should happens-before "message = &a". However, L1 | caches do NOT always follow this logic. "message=&a" could happen | first from the perspective of DDR4 RAM. | | How? Why? Well, maybe message=&a is least-recently-used, while | the thread continues to read/write to a.x variable. So later on, | when the cache updates DDR4 RAM, the message=&a happens-before | a.x=1 (from a DDR4 RAM perspective). | | This never happens on x86 (the x86 Cores preserves all store | orderings). But it is potentially possible on ARM or POWER9. | (There are other possible reasons for reorderings: Out-of-order | computations on the CPU, Load/store forwarding on the Core, as | well as compiler optimizations). | | It turns out that violating the total-store-ordering requirement | allows for L1 caches and CPU-cores to operate slightly faster. | Thanks to overly aggressive multicore processors and | architectures (like ARM or POWER9), we have to deal with this | non-obvious ordering violation. | | --------- | | The barrier forces the happens-before relationship to be | preserved. You need a barrier on two sides: the write side | (release) and the read-side (acquire). | | --------- | | Anyway... "happens-before" is about as bloody simple as you think | it is. The issue is that "happens-before" is now violated on a | regular basis on modern CPUs in multithreaded environments. | | Forcing "happens-before" relationships to happen in the expected | order is now a requirement for multithreaded programmers. | | Note: all lock() and unlock() functions innately have the proper | happens-before relationships in the right place. So a typical | programmer who uses mutex_lock() will never have to worry about | this happens-before reordering. All synchronization mechanisms: | locks, semaphores, condition variables, etc. etc. need to be | carefully written with these acquire / release barriers to work | on modern architectures. | | However, if you wish to step into the realm of high-performance | lock-free algorithms, you must absolutely understand the modern | processor, and how it violates happens-before relationships. (as | well as understanding the memory barriers that RESTORE the | happens-before relationship). | anonymousDan wrote: | But isn't there also the issue that the compiler could reorder | (and not just the hardware)? Edit: just noticed you also | mentioned that | CodeMage wrote: | > _pthread_join is a blocking call, and therefore against the | spirit of lockless programing. pthread_join will block until | the child-thread exists, which means there 's a lock in there | somewhere. I personally think the first example they show is a | terrible example as a result._ | | But they didn't use it as an example of lockless programming, | did they? They were explaining what "acquire" and "release" | mean, and they wanted to explain it in terms the reader should | already be familiar with. | | In fact, they explicitly say it's not lockless: | In the previous paragraph we have seen how the acquire and | release semantics of pthread_create() and pthread_join() allow | the creator of a thread to exchange information with that | thread and vice versa. We will now see how this kind of | communication can happen in a lockless manner while threads | run. | wtallis wrote: | Bruce Dawson recently put a bit of a different spin on things | with his explanation: | https://randomascii.wordpress.com/2020/11/29/arm-and-lock-fr... | | > _If you have circa-2005 code without these memory barriers | then your code is broken, and has always been broken, on all | CPUs, because compilers have always been allowed to rearrange | writes. With these memory barriers (implemented as needed for | different compilers and platforms) your code is beautiful and | portable._ | | > _It turns out that ARM's weak memory model really doesn't | make things any more complicated. If you are writing lock-free | code and not using any sort of memory barriers then your code | is potentially broken everywhere due to compiler reordering. If | you are using memory barriers then it should be easy to extend | them to include hardware memory barriers._ | dragontamer wrote: | The 00s were all about this multithreaded / multiCPU debate. | For Java programmers, the solution was to establish memory- | ordering at the language level (which then defined which | optimizations a Java compiler was allowed, or not allowed to | do). | | C++ programmers chose performance, though it wasn't until | C++11 when the memory model was fully standardized. Indeed: | ARM CPU-designers were still arguing for a weaker model than | acquire/release: ARM CPU-designers wanted to implement | consume/release semantics instead. (consume/release is | "weaker": allowing for more reorderings and therefore more | opportunities for higher-performance CPUs and/or code | optimizations). | | Java's sequentially-consistent models are too strong (not | enough optimizations available) for most programmers and most | situations. Consume/release is in C++ standard, but few | people understand it. Acquire/release is the model that seems | to be winning out after all of these years. | | As a result, ARM and POWER are updating their CPUs to better- | provide acquire/release semantics. (as the consume/release | model of ARMv7 is simply not as popular). | pdkl95 wrote: | > For this to happen, the store and load must be endowed with | release and acquire semantics respectively. To that end, we have | the "store-release" and "load-acquire" operations. | | I suppose release and acquire semantics are baked into Rust as | the language's "ownership" concept. A variable is moved (release) | out of the calling scope and the function takes ownership | (acquire) at the function call, which synchronizes the symmetric | operations. In the pthread_create/pthread_join example, thread 2 | is borrowing the variable "s". | millstone wrote: | acquire/release semantics restricts reordering of _other_ loads | and stores. That is, `a=1; b.store(2, release);` here the | release operation on b affects the previous store to a. Rust 's | ownership model cannot express this, as it treats separate | variables as independent. | | Your analogy to borrowing is correct in the pthread case, | because it has no race. But typically lockless programming has | races, and I don't think it's useful to think of it in terms of | ownership. | rurban wrote: | Thanks Paolo! ___________________________________________________________________ (page generated 2021-03-04 23:00 UTC)