[HN Gopher] An Introduction to Lockless Algorithms
       ___________________________________________________________________
        
       An Introduction to Lockless Algorithms
        
       Author : signa11
       Score  : 89 points
       Date   : 2021-03-04 18:19 UTC (4 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | dragontamer wrote:
       | pthread_join is a blocking call, and therefore against the spirit
       | of lockless programing. pthread_join will block until the child-
       | thread exists, which means there's a lock in there somewhere. I
       | personally think the first example they show is a terrible
       | example as a result.
       | 
       | ------------
       | 
       | However, the smp_store_release and smp_load_acquire discussion is
       | good... very good. But I'm not sure if there's enough discussion
       | or theory to discuss why this is required.
       | 
       | The gist is: L1 caches on POWER9 or ARM are allowed to reorder
       | loads/stores.                   thread 1
       | --------------------------------           a.x = 1;
       | message = &a;
       | 
       | "a.x = 1" should happens-before "message = &a". However, L1
       | caches do NOT always follow this logic. "message=&a" could happen
       | first from the perspective of DDR4 RAM.
       | 
       | How? Why? Well, maybe message=&a is least-recently-used, while
       | the thread continues to read/write to a.x variable. So later on,
       | when the cache updates DDR4 RAM, the message=&a happens-before
       | a.x=1 (from a DDR4 RAM perspective).
       | 
       | This never happens on x86 (the x86 Cores preserves all store
       | orderings). But it is potentially possible on ARM or POWER9.
       | (There are other possible reasons for reorderings: Out-of-order
       | computations on the CPU, Load/store forwarding on the Core, as
       | well as compiler optimizations).
       | 
       | It turns out that violating the total-store-ordering requirement
       | allows for L1 caches and CPU-cores to operate slightly faster.
       | Thanks to overly aggressive multicore processors and
       | architectures (like ARM or POWER9), we have to deal with this
       | non-obvious ordering violation.
       | 
       | ---------
       | 
       | The barrier forces the happens-before relationship to be
       | preserved. You need a barrier on two sides: the write side
       | (release) and the read-side (acquire).
       | 
       | ---------
       | 
       | Anyway... "happens-before" is about as bloody simple as you think
       | it is. The issue is that "happens-before" is now violated on a
       | regular basis on modern CPUs in multithreaded environments.
       | 
       | Forcing "happens-before" relationships to happen in the expected
       | order is now a requirement for multithreaded programmers.
       | 
       | Note: all lock() and unlock() functions innately have the proper
       | happens-before relationships in the right place. So a typical
       | programmer who uses mutex_lock() will never have to worry about
       | this happens-before reordering. All synchronization mechanisms:
       | locks, semaphores, condition variables, etc. etc. need to be
       | carefully written with these acquire / release barriers to work
       | on modern architectures.
       | 
       | However, if you wish to step into the realm of high-performance
       | lock-free algorithms, you must absolutely understand the modern
       | processor, and how it violates happens-before relationships. (as
       | well as understanding the memory barriers that RESTORE the
       | happens-before relationship).
        
         | anonymousDan wrote:
         | But isn't there also the issue that the compiler could reorder
         | (and not just the hardware)? Edit: just noticed you also
         | mentioned that
        
         | CodeMage wrote:
         | > _pthread_join is a blocking call, and therefore against the
         | spirit of lockless programing. pthread_join will block until
         | the child-thread exists, which means there 's a lock in there
         | somewhere. I personally think the first example they show is a
         | terrible example as a result._
         | 
         | But they didn't use it as an example of lockless programming,
         | did they? They were explaining what "acquire" and "release"
         | mean, and they wanted to explain it in terms the reader should
         | already be familiar with.
         | 
         | In fact, they explicitly say it's not lockless:
         | In the previous paragraph we have seen how the acquire and
         | release semantics of pthread_create() and pthread_join() allow
         | the creator of a thread to exchange information with that
         | thread and vice versa. We will now see how this kind of
         | communication can happen in a lockless manner while threads
         | run.
        
         | wtallis wrote:
         | Bruce Dawson recently put a bit of a different spin on things
         | with his explanation:
         | https://randomascii.wordpress.com/2020/11/29/arm-and-lock-fr...
         | 
         | > _If you have circa-2005 code without these memory barriers
         | then your code is broken, and has always been broken, on all
         | CPUs, because compilers have always been allowed to rearrange
         | writes. With these memory barriers (implemented as needed for
         | different compilers and platforms) your code is beautiful and
         | portable._
         | 
         | > _It turns out that ARM's weak memory model really doesn't
         | make things any more complicated. If you are writing lock-free
         | code and not using any sort of memory barriers then your code
         | is potentially broken everywhere due to compiler reordering. If
         | you are using memory barriers then it should be easy to extend
         | them to include hardware memory barriers._
        
           | dragontamer wrote:
           | The 00s were all about this multithreaded / multiCPU debate.
           | For Java programmers, the solution was to establish memory-
           | ordering at the language level (which then defined which
           | optimizations a Java compiler was allowed, or not allowed to
           | do).
           | 
           | C++ programmers chose performance, though it wasn't until
           | C++11 when the memory model was fully standardized. Indeed:
           | ARM CPU-designers were still arguing for a weaker model than
           | acquire/release: ARM CPU-designers wanted to implement
           | consume/release semantics instead. (consume/release is
           | "weaker": allowing for more reorderings and therefore more
           | opportunities for higher-performance CPUs and/or code
           | optimizations).
           | 
           | Java's sequentially-consistent models are too strong (not
           | enough optimizations available) for most programmers and most
           | situations. Consume/release is in C++ standard, but few
           | people understand it. Acquire/release is the model that seems
           | to be winning out after all of these years.
           | 
           | As a result, ARM and POWER are updating their CPUs to better-
           | provide acquire/release semantics. (as the consume/release
           | model of ARMv7 is simply not as popular).
        
       | pdkl95 wrote:
       | > For this to happen, the store and load must be endowed with
       | release and acquire semantics respectively. To that end, we have
       | the "store-release" and "load-acquire" operations.
       | 
       | I suppose release and acquire semantics are baked into Rust as
       | the language's "ownership" concept. A variable is moved (release)
       | out of the calling scope and the function takes ownership
       | (acquire) at the function call, which synchronizes the symmetric
       | operations. In the pthread_create/pthread_join example, thread 2
       | is borrowing the variable "s".
        
         | millstone wrote:
         | acquire/release semantics restricts reordering of _other_ loads
         | and stores. That is, `a=1; b.store(2, release);` here the
         | release operation on b affects the previous store to a. Rust 's
         | ownership model cannot express this, as it treats separate
         | variables as independent.
         | 
         | Your analogy to borrowing is correct in the pthread case,
         | because it has no race. But typically lockless programming has
         | races, and I don't think it's useful to think of it in terms of
         | ownership.
        
       | rurban wrote:
       | Thanks Paolo!
        
       ___________________________________________________________________
       (page generated 2021-03-04 23:00 UTC)