[HN Gopher] Programming Language Memory Models ___________________________________________________________________ Programming Language Memory Models Author : thinxer Score : 94 points Date : 2021-07-06 16:22 UTC (6 hours ago) (HTM) web link (research.swtch.com) (TXT) w3m dump (research.swtch.com) | raphlinus wrote: | A GPU followup to this article. | | While on CPU sequentially consistent semantics are efficient to | implement, that seems to be much less true on GPU. Thus, Vulkan | completely eliminates sequential consistency and provides only | acquire/release semantics[1]. | | It is extremely difficult to reason about programs using these | advanced memory semantics. For example, there is a discussion | about whether a spinlock implemented in terms of acquire and | release can be reordered in a way to introduce deadlock (see | reddit discussion linked from [2]). I was curious enough about | this I tried to model it in CDSChecker, but did not get | definitive results (the deadlock checker in that tool is enabled | for mutexes provided by API, but not for mutexes built out of | primitives). I'll also note that using AcqRel semantics is not | provided by the Rust version of compare_exchange_weak (perhaps a | nit on TFA's assertion that Rust adopts the C++ memory model | wholesale), so if acquire to lock the spinlock is not adequate, | it's likely it would need to go to SeqCst. | | Thus, I find myself quite unsure whether this kind of spinlock | would work on Vulkan or would be prone to deadlock. It's also | possible it could be fixed by putting a release barrier before | the lock loop. | | We have some serious experts on HN, so hopefully someone who | knows the answer can enlighten us - mixed in of course with all | the confidently wrong assertions that inevitably pop up in | discussions about memory model semantics. | | [1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v- | mem... | | [2]: https://rigtorp.se/spinlock/ | raphlinus wrote: | Also: it remains difficult to fully nail down the semantics of | sequential consistency as well, especially when it's mixed with | other memory semantics. Very likely next time Russ updates his | article he should add a reference to Repairing Sequential | Consistency in C/C++11[1]. | | [1]: https://plv.mpi-sws.org/scfix/full.pdf | dragontamer wrote: | GPU-spinlocks are a bad idea, unless the spinlock is applied | over the entire Thread-group. | | Even then, I'm pretty sure the spinlock is a bad idea, because | you probably should be using GPUs as a coprocessor and | enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. | The kernel-spawn and kernel-end mechanism provides you your | synchronization functionality ("happens-before") when you need | it. | | --------- | | From there on out: the GPU-low level synchronization of choice | is the thread-barrier (which can extend out beyond a wavefront, | but only up to a block). | | -------- | | So that'd be my advice: use a thread-barrier at the lowest | level for thread blocks (synchronization between 1024 threads | and below). And use kernel-start / kernel-end graphs (aka: CUDA | Stream and/or OpenCL Task Graphs) for synchronizing groups of | more than 1024 threads together. | | Otherwise, I've done some experiments with acquire/release and | basic lock/unlock mechanisms. They seem to work as expected. | You get deadlocks immediately on older hardware because of the | implicit SIMD-execution (so you want only thread#0 or active- | thread#0 to perform the lock for the whole wavefront / thread | block). You'll still want to use thread-barriers for higher | performance synchronization. | | Frankly, I'm not exactly sure why you'd want to use a spinlock | since thread-barriers are simply higher performance in the GPU | world. | raphlinus wrote: | In _general_ spinlocks are a bad idea, but you do see them in | contexts like decoupled look-back. As you say, thread | granularity is a problem (unless you 're on CUDA on Volta+ | hardware, which has independent thread scheduling), so you | want threadgroup or workgroup granularity. | | In any case, I'm interested in pushing the boundaries of | lock-free algorithms. It is of course easy to reason about | kernel-{start/end} synchronization, but the granularity may | be too coarse for some interesting applications. | dragontamer wrote: | This is the first time I've heard of the term "decoupled | look-back". But I see that it refers to CUB's | implementation of device-wide scan. | | I briefly looked at the code, and came across: https://gith | ub.com/NVIDIA/cub/blob/main/cub/agent/agent_scan... | | I'm seeing lots of calls to "CTA_SYNC()", which ends up | being just a "__syncthreads" (a simple thread-barrier). | See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a | 13026f... | | I admit that I'm looking rather quickly though, but... I'm | not exactly seeing where this mysterious "spinlock" is that | you're talking about. I haven't tried very hard yet but | maybe you can point out what code exactly in this | device_scan / decoupled look-back uses a spinlock? Cause | I'm just not seeing it. | | ---------- | | And of course: a call to cub's "device scan" is innately | ordered to kernel-start / kernel-end. So there's your | synchronization mechanism right there and then. | raphlinus wrote: | I don't think CUB is doing decoupled look-back, the | reference you want is: | https://research.nvidia.com/publication/single-pass- | parallel... | | It doesn't use the word "spin" but repeated polling (step | 4 in the algorithm presented in section 4.1, particularly | when the flag is X) is basically the same. | dragontamer wrote: | 3rd paragraph: | | > In this report, we describe the decoupled-lookback | method of single-pass parallel prefix scan and its | implementation within the open-source CUB library of GPU | parallel primitives | | The CUB-library also states: | | https://nvlabs.github.io/cub/structcub_1_1_device_scan.ht | ml | | >> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs | have implemented our "decoupled look-back" algorithm for | performing global prefix scan with only a single pass | through the input data, as described in our 2016 | technical report [1] | | Where [1] is a footnote pointing at the exact paper you | just linked. | | ----------- | | > It doesn't use the word "spin" but repeated polling | (step 4 in the algorithm presented in section 4.1, | particularly when the flag is X) is basically the same. | | That certainly sounds spinlock-ish. At least that gives | me what to look for in the code. | raphlinus wrote: | Ah, good then, I didn't know that, thanks for the cite. I | haven't studied the CUB code on it carefully. | knz42 wrote: | A lot of the complexity comes from the lack of expressivity in | languages to relate variables (or data structure fields) | semantically to each other. If there was a way to tell the | compiler "these variables are always accessed in tandem", the | compiler could be smart about ordering and memory fences. | | The idea to extend programming languages and type systems in that | direction is not new: folk who've been using distributed | computing for computations have to think about this already, and | could teach a few things to folk who use shared memory multi- | processors. | | Here's an idea for ISA primitives that could help a language | group variables together: bind/propagate operators on | (combinations of) address ranges. | https://pure.uva.nl/ws/files/1813114/109501_19.pdf | smasher164 wrote: | Even with that expressivity, someone who incorrectly relates or | forgets to relate two variables could experience the same | issues. It's still important to address what happens when the | program has data races or when it is data-race-free but the | memory model permits overreaching optimizations. The language | and implementation should strive to make a program | approximately correct. | dragontamer wrote: | That's Java's Object.lock() mechanism. | | All variables inside of an object (aka: any class) are assumed | to be related to each other. synchronized(foobar_object){ | baz(); } ensures that all uses of foobar_object inside the | synchronization{} area are sequential (and therefore correct). | | -------- | | The issue is that some people (a minority) are interested in | "beating locks" and making something even more efficient. | karmakaze wrote: | In Java, any object can be used to synchronize any data, e.g. | synchronized(foobar_object){ foo(); } | synchronized(foobar_object){ bar(); } | synchronized(foobar_object){ baz(); } | | Will have foo, bar, baz methods well behaved in any data that | they share regardless of whether they are foobar methods or | methods of any other class(es). It is exactly analogous to | the S(a) -> S(a) synchronizing instruction from the article | that establishes a happens-before partitioning each thread | into before/after the S(a). | | The only time synchronized(explicit_object) relates to | anything else is when also using the keyword where | `synchronized void foo()` is equivalent (with a minor | performance difference) to `synchronized(this) { ... }` | wrapping the entire body of the foo method. | voidnullnil wrote: | These "memory models" are too complex for languages intended for | dilettante developers. It was a disaster in Java/C#. Not even | more than a handful of programmers in existence know in depth how | it works, as in, can they understand any given trivial program in | their language. At best they only know some vague stuff like that | locking prevents any non visibility issues. It goes far deeper | than that though (which is also the fault of complex language | designs like Java and C#). | | The common programmer does not understand that you've just | transformed their program - for which they were taught merely | that multiple threads needs synchronization - into a new game, | which has an entire separate specification, where every shared | variable obeys a set of abstruse rules revolving around the | happens-before relationship. Locks, mutexes, atomic variables are | all one thing. Fences are a completely different thing. At least | in the way most people intuit programs to work. | | Go tries to appeal to programmers as consumers (that is, when | given a choice between cleaner design and pleasing the user who | just wants to "get stuff done", they choose the latter), but yet | also adds in traditional complexities like this. Yes, there is | performance trade off to having shared memory behave intuitively, | but that's much better than bugs that 99% of your CHOSEN userbase | do not know how to avoid. Also remember Go has lots of weird edge | cases, like sharing a slice across threads can lead to memory | corruption (in the C / assembly sense, not merely within that | array) despite the rest of the language being memory-safe. | Multiply that by the "memory model". | | Edit: forgot spaces between paragraphs. | filleduchaos wrote: | > These "memory models" are too complex for languages intended | for dilettante developers. | | It would be nice if sometime we stopped pretending that | beginners are too slow to know/understand things and instead | faced the fact that their instructors and mentors are bad at | teaching. | temac wrote: | They are not too complex for "languages intended for dilettante | developers": they can (and should) use sequential consistency | or locks everywhere. | voidnullnil wrote: | How many programmers do you know put locks around polling | varibles for seemingly no reason (as they are not cognizant | of the memory model)? | pjmlp wrote: | Well, while it may appear gatekeeping, maybe those | dilettante developers should be using something else | instead, like BASIC? | formerly_proven wrote: | CPython has a pleasant memory model from what I've heard | :) | electricshampo1 wrote: | "Java and JavaScript have avoided introducing weak | (acquire/release) synchronizing atomics, which seem tailored for | x86." | | This is not true for Java; see | | http://gee.cs.oswego.edu/dl/html/j9mm.html | | https://docs.oracle.com/en/java/javase/16/docs/api/java.base... | dragontamer wrote: | Its not true in general. x86 CANNOT have weak acquire/release | semantics. x86 is "too strong", you get total-store ordering by | default. | | If you want to test out weaker acquire/release semantics, you | need to buy an ARM or POWER9 processor. | bullen wrote: | In a 100 years the main languages used will still be C on the | client (with a C++ compiler) and Java on the server. | | Go has no VM but it has a GC. WASM has a VM but no GC. | | Eveything has been tried and Java still kicks everythings ass to | the moon on the server. | | Fragmentation is bad, lets stop using bad languages and focus on | the products we build instead. | | "While I'm on the topic of concurrency I should mention my far | too brief chat with Doug Lea. He commented that multi-threaded | Java these days far outperforms C, due to the memory management | and a garbage collector. If I recall correctly he said "only 12 | times faster than C means you haven't started optimizing"." - | Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html | | "Many lock-free structures offer atomic-free read paths, notably | concurrent containers in garbage collected languages, such as | ConcurrentHashMap in Java. Languages without garbage collection | have fewer straightforward options, mostly because safe memory | reclamation is a hard problem..." - Travis Downs | https://travisdowns.github.io/blog/2020/07/06/concurrency-co... | pphysch wrote: | Pretty sure Java is/was popular (and has enormous momentum) | because "it just works" at a time when the Internet is taking | off, not because it is some linguistic or technological marvel. | It will definitely stick around, just like COBOL and C and Go. | bullen wrote: | COBOL is not around in anything interesting, and trust me go | is not going to be used to build anything that we'll use in | 20 years. | pphysch wrote: | Do I hear "containers are a temporary fad"? | bullen wrote: | Yep | skohan wrote: | I'm sorry is this comment from 1998? I've been working in | software for over a decade, and I haven't seen server work | being done in Java in ages. | | From my perspective, Go in the context of serverless | programming seems to currently be the best choice for server- | side programming. | | In the next 20 years I expect Go will be supplanted by a | language which is a lot like go (automatic memory management, | simple, easy to learn & write and performant enough) but with | the addition of algebraic data types, named parameters, and a | _slightly_ higher level of abstraction. | pjmlp wrote: | And I haven't seen server work done in C++ since 2006, we | keep replacing those systems with Java and .NET ones. | | To each its own. | bullen wrote: | Infinite growth does not exist, everything peaks at some | point. You have to wonder if a memory model from 2005 still | kicks go's ass in 2021 how your your prediction that "there | will always be something new and shiny to distract us from | the focus we need to leverage the real value of the internet" | will play out? | | What have you built with go that is interesting? | skohan wrote: | It's not about new and shiny. Programming is still a | relatively new field when compared to other fields. For | instance, mathematics and physics took centuries to land on | the right way to formalize things. | | C is maybe the only good programming language invented so | far. Java was a failed attempt at improving C. I think | we're rapidly converging on the second good programming | language, and it's not going to have null pointer | exceptions. | valbaca wrote: | > In the next 20 years I expect Go will be supplanted by a | language which is a lot like go (automatic memory management, | simple, easy to learn & write and performant enough) but with | the addition of algebraic data types, named parameters, and a | slightly higher level of abstraction. | | I'd love for this to be Crystal: https://crystal-lang.org/ | | > I haven't seen server work being done in Java in ages. | | In the meantime, I've been doing a large amount of Java | backend server work for the past 10 years. | skohan wrote: | I have a feeling it's going to have C-like syntax and | frankly I hope so because using an `end` keyword instead of | braces makes no sense to me. | filleduchaos wrote: | Arguably using curly braces to delineate blocks makes no | inherent sense either. We just do it because that's what | everybody else does. | jqpabc123 wrote: | _If thread 2 copies done into a register before thread 1 | executes, it may keep using that register for the entire loop, | never noticing that thread 1 later modifies done._ | | Alternative solution: Forget all the "atomic" semantics and | simply avoid "optimization" of global variables. Access to any | global variable should always occur direct from memory. Sure, | this will be less than optimal in some cases but such is the | price of using globals. Their use should be discouraged anyway. | | In other words, make "atomic" the sensible and logical default | with globals. Assignment is an "atomic" operation, just don't | circumvent it by using a local copy as an "optimization". | dragontamer wrote: | Removing optimizations on "global" variables will leave the bug | in Singleton objects (which are very similar to global | variables, but the compiler doesn't know that they're global) | | --------- | | "Volatile" is close but not good enough semantically to | describe what we want. That's why these new Atomic-variables | are being declared with seqcst semantics (be it in Java, C++, | C, or whatever you program in). | | That's the thing: we need a new class of variables that wasn't | known 20 years ago. Variables that follow the sequential- | consistency requirements, for this use case. | | --------- | | Note: on ARM systems, even if the compiler doesn't mess up, the | L1 cache could mess you up. ARM has multiple load/store | semantics available. If you have relaxed (default) semantics on | a load, it may be on a "stale" value from DDR4. | | That is to say: your loop may load a value into L1 cache, then | your core will read the variable over and over from L1 cache | (not realizing that L3 cache has been updated to a new value). | Not only does your compiler need to know to "not store the | value in a register", the memory-subsystem also needs to know | to read the data from L3 cache over-and-over again (never using | L1 cache). | | Rearranging loads/stores on x86 is not allowed in this manner. | But ARM is more "Relaxed" than x86. If reading the "Freshest" | value is important, you need to have the appropriate memory | barriers on ARM (or PowerPC). | jqpabc123 wrote: | _which are very similar to global variables, but the compiler | doesn 't know that they're global_ | | Since as you say, they are very similar, wouldn't it be | reasonable to assume for access purposes that they are | effectively global? | dragontamer wrote: | Lets do this example in Java (but it should be simple | enough that C#, Python, Javascript and other programmers | would understand it). public void | myFunction(FooObject o){ o.doSomething(); | } | | How does the compiler know if "FooObject o" is a singleton | or not? That's the thing about the "Singleton" pattern, you | have an effective "global-ish" variable, but all of your | code is written with normal pass-the-object style. | | EDIT: If you're not aware, the way this works is that you | call myFunction(getTheSingleton());, where | "getTheSingleton()" fetches the Singleton object. | myFunction() has no way of "knowing" its actually | interacting with the singleton. This is a useful pattern, | because you can create unit-tests over the "global" | variable by simply mocking out the Singleton for a mock | object (or maybe an object with preset state for better | unit testing). Among other benefits (but also similar | downsides to using a global variable: difficult to reason | because you have this "shared state" being used all over | the place) | mumblemumble wrote: | Is there anything special about singletons here? | | In Java, there's no real difference between a singleton | and any other object. A singleton is an object that just | happens to have a single instance. Practically speaking, | they're typically used as a clever design pattern to | "work around" Java's lack of language-level support for | global variables, so there's that. But I think that that | fact might not be relevant to the issue at hand? | | The more basic issue is, if you have two different | threads concurrently executing `myFunction`, what happens | when they're both operating on the same instance of | `FooObject`? | dragontamer wrote: | > Is there anything special about singletons here? | | No, aside from the fact that the root commenter clearly | understands the issue with global variables, but not | necessarily singletons. | | I'm trying to use the singleton concept as a "teaching | bridge" moment, as the Singleton is clearly "like a | global variable" in terms of the data-race, but | generalizes to any object in your code. | | The commenter I'm replying seems to think that global- | variables are the only kind of variable where this | problem occurs. He's wrong. All objects and all variables | have this problem. | ekiwi wrote: | > Access to any global variable should always occur direct from | memory. | | What if your function takes a pointer that might be pointing to | a global variable? Does that mean that all accesses through a | pointer are now excempt from optimization unless the compiler | can prove that the pointer will never point to a global | variable? | jqpabc123 wrote: | What if your function takes a pointer that might be pointing | to an "atomic" variable? | | Pointers can be used to circumvent most safety measures. If | you obscure the access, you should assume responsibility for | the result. | cbsmith wrote: | The great thing about simple memory models is they work | really well until you think about it. ;-) | mumblemumble wrote: | This problem isn't specific to global variables; it happens | with all shared mutable state. I would assume that the author | only used global variables because that lets them keep the | working examples as short as possible, and minimize irrelevant | details. | | And yes, you _can_ put a full memory fence around every access | to a variable that is shared across threads. But doing so would | just destroy the performance of your program. Compared to using | a register, accessing main memory typically takes something on | the order of 100 times as long. Given that we 're talking about | concerns that are specific to a relatively low-level approach | to parallelism, I think it's safe to assume that performance is | the whole point, so that would be an unacceptable tradeoff. | dragontamer wrote: | > And yes, you can put a full memory fence around every | access to a variable that is shared across threads. But doing | so would just destroy the performance of your program. Given | that we're talking about concerns that are specific to a | relatively low-level approach to parallelism, I think it's | safe to assume that performance is the whole point, so that | would be an unacceptable tradeoff. | | Indeed. | | Just a reminder to everyone: your pthreads_mutex_lock() and | pthreads_mutex_unlock() functions already contain the | appropriate compiler / cache memory barriers in the correct | locations. | | This "Memory Model" discussion is only for people who want to | build faster systems: for people searching for a "better | spinlock", or for writing lock-free algorithms / lock-free | data structures. | | This is the stuff of cutting edge research right now: its a | niche subject. Your typical programmer _SHOULD_ just stick a | typical pthread_mutex_t onto an otherwise single-threaded | data-structure and call it. Locks work. They're not "the | best", but "the best" is constantly being researched / | developed right now. I'm pretty sure that any new lockfree | data-structure with decent performance is pretty much an | instant PH.D thesis material. | | ----------- | | Anyway, the reason why "single-threaded data-structure behind | a mutex" works is because your data-structure still keeps all | of its performance benefits (from sticking to L1 cache, or | letting the compiler "manually cache" data to registers when | appropriate), and then you only lose performance when | associated with the lock() or unlock() calls (which will | innately have memory barriers to publish the results) | | That's 2 memory barriers (one barrier for lock() and one | barrier for unlock()). The thing about lock-free algorithms | is that they __might__ get you down to __1__ memory barrier | per operation if you're a really, really good programmer. But | its not exactly easy. (Or: they might still have 2 memory | barriers but the lockfree aspect of "always forward progress" | and/or deadlock free might be easier to prove) | | Writing a low-performance but otherwise correct lock free | algorithm isn't actually that hard. Writing a lock free | algorithm that beats your typical mutex + data-structure | however, is devilishly hard. | voidnullnil wrote: | > This "Memory Model" discussion is only for people who | want to build faster systems: for people searching for a | "better spinlock", or for writing lock-free algorithms / | lock-free data structures. | | Actually, most practitioner code has bugs from their | implicit assumptions that shared variable writes are | visible or ordered the way they think they are. | dragontamer wrote: | But the practitioner doesn't need to know the memory | model (aside from "memory models are complicated"). | | To solve that problem, the practitioner only needs to | know that "mutex.lock()" and "mutex.unlock()" orders | reads/writes in a clearly defined manner. If the | practitioner is wondering about the difference between | load-acquire and load-relaxed, they've probably gone too | deep. | voidnullnil wrote: | > To solve that problem, the practitioner only needs to | know that "mutex.lock()" | | This is true, but they do not know that. If you do not | give some kind of substantiation, they will shrug it off | and go back to "nah this thing doesn't need a mutex", | like with a polling variable (contrived example). | Kranar wrote: | Can you explain what you mean by a "polling variable" | needing a mutex? Usually polling is done using atomic | instructions instead of a mutex. Are you referring to | condition variables? | mumblemumble wrote: | In a lot of code I've seen, there are threads polling | some variable without using any sort of special guard. | The assumption (based, I assume, on how you really could | get away with this back in the days of single-core, | single-CPU computers) is that you only need to worry | about race conditions when writing to primitive | variables, and that simply reading them is always safe. | Kranar wrote: | Okay but the poster mentioned a mutex, which would not be | a good way to go about polling a variable in Java. All | you need to guarantee synchronization of primitive values | in Java is the use of volatile [1]. If you need to | compose atomic operations together, then you can use an | atomic or a mutex, but it would not occur to me to use a | mutex to perform a single atomic read or write on a | variable in Java. | | [1] https://docs.oracle.com/javase/specs/jls/se8/html/jls | -8.html... | wcarss wrote: | The prior article in this series from ~a week ago is 'Hardware | Memory Models', at https://research.swtch.com/hwmm, with some hn- | discussion here: https://news.ycombinator.com/item?id=27684703 | | Another somewhat recently posted (but years-old) page with | different but related content is 'Memory Models that Underlie | Programming Languages': http://canonical.org/~kragen/memory- | models/ | | a few previous hn discussions of that one: | | https://news.ycombinator.com/item?id=17099608 | | https://news.ycombinator.com/item?id=27455509 | | https://news.ycombinator.com/item?id=13293290 ___________________________________________________________________ (page generated 2021-07-06 23:00 UTC)