[HN Gopher] A viable solution for Python concurrency ___________________________________________________________________ A viable solution for Python concurrency Author : zorgmonkey Score : 392 points Date : 2021-10-15 17:54 UTC (5 hours ago) (HTM) web link (lwn.net) (TXT) w3m dump (lwn.net) | ferdowsi wrote: | If this effort succeeds (and I hope it does) now Python | developers will need to contend with the event-loop albatross of | asyncio and all of its weird complexity. | | In an alternate Python timeline, asyncio was not introduced into | the Python standard library, and instead we got a natively | supported, robust, easy-to-use concurrency paradigm built around | green/virtual threading that accommodates both IO and CPU bound | work. | BiteCode_dev wrote: | asyncio is not a competition to threads, it's complementary. | | In fact, it's a perfectly viable strat in python to have | several processes, each having several threads, each having an | event loop. | | And it will still be so, once this comes out. You will | certainly use threads more, and processes less, but replacing | 1000000 coroutines by 1000000 system threads is not necessarily | the right strategy for your task. See nginx vs apache. | dralley wrote: | Multiple threads with one asyncio loop per thread would be | absolutely pointless in Python, because of the GIL. | | With that said, sure, threads and asyncio are complimentary | in the sense that you can run tasks on threadpool executors | and treat them as if they were coroutines on an event loop. | But that serves no purpose unless you're trying to do | blocking IO without blocking your whole process. | bonzini wrote: | In Python it would be pointless, but for example it's how | Seastar/ScyllaDB work: each thread is bound to a CPU on the | host and has its own reactor (event loop) with coroutines | on it. QEMU has a similar design. | yellowapple wrote: | It's also (to my knowledge) how Erlang's VMs (e.g. BEAM) | work: one thread per CPU core, and a VM on each thread | preemptively switching between processes. | BiteCode_dev wrote: | It would not be pointless at all, because while one thread | may lock on CPU, context switching will let another one | deal with IO. This can let you smooth out the progress of | each part of your program, and can be useful for workload | when you don't want anything to block for too long. | heavyset_go wrote: | I read it as each process having multiple threads _and_ an | event loop. If the threads are performing I /O or calling | out to compiled code and releasing the GIL, said GIL won't | block the event loop. | azinman2 wrote: | This entire article is about removing the GIL | Zarathust wrote: | "Viable" as in "you have no other choice sometimes". This | forces you to deal with 3 libraries each with their own | quirks, pitfalls and incompatibilities. Sometimes you even | deal with dependencies reimplementing some parts in a 4th or | 5th library to deal with shortcomings. | | I really don't care that much which of them survive, I just | want to rely on less of them | BiteCode_dev wrote: | No, it's just useful. They are techs with different trade | off, and life is full of opportunities. | throwaway81523 wrote: | Python Zen = one obvious way to do it. Having a bunch of | very different ones, each with serious disadvantages, is | a bad look. | heavyset_go wrote: | Zen of Python is an ideal, and at this point, kind of | tongue-in-cheek. | | This is the same language that shipped with at least 3 | different methods to apply functions across iterables | when the Zen of Python was adopted as a PEP in 2004. | throwaway81523 wrote: | There is at least some recognition in those cases that | they introduced the new thing because they got it wrong | in the old thing. That's different than saying they | should co-exist on equal terms. | BiteCode_dev wrote: | It's a technical thread, not a political one. If you were | so sure of your argument, you wouldn't use a throwaway. | | Besides, it's weird, like saying we should not have int, | float and complex, there should be one way to do it. | | Just because those are 3 numbers doesn't mean they don't | have each their own specific benefit. | throwaway81523 wrote: | int, float, and complex are for different purposes. async | and threads paper over each others' weaknesses, instead | of fixing the weaknesses at the start. Async itself is an | antipattern (technical opinion, so there) but Python uses | it because of the hazards and high costs of threads. | Chuck Moore figured out 50 years ago to keep the async | stuff out of the programmer's way, when he put | multitasking into Polyforth, which ran on tiny machines. | Python (and Node) still make the programmer deal with it. | | If you look at Haskell, Erlang/Elixir, and Go, they all | let you write performant sequential code by pushing the | async into the runtime where the programmer doesn't have | to see it. Python had an opportunity to do the same, but | stayed with async and coroutines. What a pain. | throwaway81523 wrote: | Yes, I have a big sense of tragedy about Python 3. Python | should run on something like (or maybe the actual) Erlang BEAM | with lightweight isolated processes. All my threaded Python | code is written using that style anyway (threads communicating | through synchronized queues) and I've almost never needed | traditional shared mutable objects. Maybe completely never, but | I'm not sure about a certain program any more. | | Added: I don't understand the downvotes. If Python 3 was going | to make an incompatible departure from Python 2, they might as | well have done stuff like the above, that brought real | benefits. Instead they had 10+ years of pain over relatively | minor changes that arguably weren't all improvements. | BeetleB wrote: | You are likely being downvoted because most claims about the | pain of a Python 3 transition are inflated/hyperbole. | | It took less than a day to migrate all my code to Python 3. | And by "less than a day" I mean "less than 2 hours". Granted, | bigger projects would take longer, but saying stuff like "10+ | years of pain" is ridiculous. Probably less than 1% of | projects had serious issues with the migration. We just hear | of a few popular ones that had some pain and assume that was | representative. | KaiserPro wrote: | > easy-to-use concurrency paradigm | | Well it has queues and threads already. | | Its just that asyncio for socket handling at least (in the | testing that I did) is about 5% faster. (one asyncio socket | "server" vs ten threads [with a number of ways to monitor for | new connections]) | | I always assumed that people wanted asyncio because they look | at javascript and thought "hey I want GOTOs cosplaying as a fun | paradigm" | BiteCode_dev wrote: | GOTO cosplaying should go away with structured concurrency | (via TaskGroup) being adopted in 3.11, as pioneered by Trio. | | Check out anyio if you want to use them now. | nine_k wrote: | BTW I wonder why async is so painless in ES6 compared to | Python. Why the presence of GIL (which JS also has) did not | make running async coroutines completely transparent, as it | made running generators (which are, well, coroutines already). | Why the whole even loop thing is even visible at all. | laurencerowe wrote: | Because JavaScript never had threads so I/O in JavaScript has | always been non-blocking and the whole ecosystem surrounding | it has grown up under that assumption. | | JavaScript doesn't need a GIL because it doesn't really have | threads. WebWorkers are more akin to multiprocessing than | threads in Python. Objects cannot be shared directly across | WebWorkers so transferring data comes with the expense of | serializing/deserializing at the boundary. | catlifeonmars wrote: | JS now has shared array buffers. | laurencerowe wrote: | SharedArrayBuffer is just raw memory similar to using | mmap from Python multiprocessing. The developer | experience is very different to simply sharing objects | across threads. | BiteCode_dev wrote: | I used them both extensively, and here are the main reasons I | can think of: | | - The event loop in JS is invisible and implicit. V8 proved | it can be done without paying a cost for it, and in fact most | real life python projects are using uvloop because it's | faster than asyncio default loop. JS dev don't think of the | loop at all, because it's always been there. They don't have | to chose a loop, or thinking about its lifecycle or | scheduling. The API doesn't show the loop at all. | | - Asynchronous functions in JS are scheduled automatically. | On python, calling a coroutine function does...nothing. You | have to either await it, or pass it to something like | asyncio.create_task(). The later is not only verbose, it's | not intuitive. | | - Async JS functions can be called from sync functions | transparently. It just returns a Promise after all, and you | can use good old callbacks. Instantiating a Python coroutine | does... nothing as we said. You need to schedule it AND await | it. If you don't, it may or may not be executed. Which is why | asyncio.gather() and co are to be used in python. Most people | don't know that, and even if you know, it's verbose, and you | can forget. All that, again, because using the event loop | must be explicit. That's one thing TaskGroup from trio will | help with in the next Python versions... | | - the early asyncio API sucked. The new one is ok, | asyncio.run() and create_task() with implicit loop is a huge | improvement. But you better use 3.7 at least. And you have to | think about all the options for awaiting: | https://stackoverflow.com/questions/42231161/asyncio- | gather-... | | - asyncio tutorials and docs are not great, people have no | idea how to use it. Since it's more complex, it compounds. | | E.G, if you use await: | | With node v14.8+: await async_func(params) | | With python 3.7+: import asyncio | async def main(): # no top level await, it must | happen in a loop await async_func(params) | asyncio.run(main) # explicit loop, but easy one thanks to 3.7 | | E.G, deep inside functions calls, but no await: | | With node: ... async_func(params) | | With python 3.7+: ... # | async_func(params) alone would do nothing res = | asyncio.create_task(async_func(params)) ... | # you MAY get away with not using gather() or wait() | # but you also may get "coroutine is never awaited" # | RuntimeWarning: coroutine 'async_func' was never awaited | asyncio.gather(res) | | Of course, you could use "run_until_complete()", but then you | would be blocking. Which is just not possible in JS, there is | one way to do it, and it's always non blocking and easy. | Ironic, isn't it? Beside, which Python dev knows all this? | I'm guessing most readers of this post will have heard of it | for the first time. | | Python is my favorite language, and I can live with the | explicit loop, but explicit scheduling is ridiculous. Just | run the damn coroutine, I'm not instantiating it for the | beauty of it. If I want a lazy construct, I can always make a | factory. | | Now, thanks to the trio nursery concept, we will get | TaskGroup in the next release (also you can already use them | with anyio): async with asyncio.TaskGroup() | as tg: tg.start_soon(async_func, params) | | Which, while still verbose, is way better: | | - no gather or wait. Schedule it, it will run or be cleaned | up. | | - no need to chose an awaiting strat, or learn about a 1000 | things. This works for every cases. Wanna use it in a sync | call ? Pass the tg reference in it. | | - lifecycle is cleanly scoped, a real problem with a lot of | async code (including in JS, where it doesn't have a clean | solution) | heavyset_go wrote: | > _Why the whole even loop thing is even visible at all._ | | It isn't anymore. In [3]: from asyncio | import run In [4]: async def async_func(): | print('Ran async_func()') In [5]: | run(async_func()) Ran async_func() | | Top-level async/await is also available in the Python REPL | and IPython, and there are discussions on the Python mailing | list about making top-level async/await the default for | Python[1]. In [1]: async def async_func(): | print('Ran async_func()') In [2]: await | async_func() Ran async_func() | | [1] https://groups.google.com/g/python- | ideas/c/PN1_j7Md4j0/m/0xy... | BiteCode_dev wrote: | Oh, top level await... I missed that. | | Not sure it will get there, but it would be nice. I think | putting a top level "await" is explicit enough for stating | you want an event loop anyway. | | Now, with TaskGroup in 3.11, things are going to get pretty | nice, espacially if this top level await plays out, | provided they include async for and async with in the mix. | | Now, if they could just make so that coroutines are | automatically schedules to the nearest task group, we would | almost have something usable. | dekhn wrote: | so true. I've been writing thread-callback code for decades | (common in network and gui event loops, see QtPy as an example) | and when I looked at asyncio my first thought is "this is not | better". It's entirely nontrivial to analyze code using asyncio | (or yield) compared to callbacks. | harpiaharpyja wrote: | If you are ever considering making use of asyncio for your | project, I would strongly recommend taking a look at curio [1] | as an alternative. It's like asyncio but far, far easier to | use. | | [1] https://curio.readthedocs.io/en/latest/index.html | acidbaseextract wrote: | The video (or blog post) below is one of the best | explanations I've seen about what subtle bugs are easy to | make with asyncio, why it's easy to make them, and how the | trio library addresses them. | | But yes, consider alternatives before you pick asyncio as | your approach! | | Talk: https://www.youtube.com/watch?v=oLkfnc_UMcE | | Blog post: https://vorpus.org/blog/notes-on-structured- | concurrency-or-g... | VWWHFSfQ wrote: | Highly recommend curio | [deleted] | BiteCode_dev wrote: | While the design of Curio is quite interesting, it's may not | be a good choice, not for technical reasons, but for | logistical reasons: the chances it gets a wide adoption are | slim to None. | | And since we are stuck with colored functions in python, the | choice of stack matters very much. | | Now, if you want easier concurrency, and a solution to a lot | of concurrency problems that curio solves, while still being | compatible with asyncio, use anyio: | | https://anyio.readthedocs.io/en/stable/ | | It's a layer that works on top of asyncio, so it's compatible | with all of it. But it features the nursery concept from | Trio, which makes async programming so much simpler and | safer. | heavyset_go wrote: | anyio is also compatible with asyncio and Trio, so you can | use it with either library or paradigm. | sandGorgon wrote: | Uvloop/uvicorn - which is the production grade asgi | server only works with asyncio. | | Hypercorn works with trio..but you lose a LOT of | performance | quietbritishjim wrote: | Curio's spiritual successor is Trio [1], which was written by | one of the main Curio contributors and is more actively | maintained (and, at this point, much more widely used). Like | Curio, it's much easier to use than asyncio, although ideas | from it are gradually being incorporated back into asyncio | e.g. asyncio.run() was inspired by curio.run()/trio.run(). | | I have used Trio in real projects and I thoroughly recommend | it. | | This blog post [2] by the creator of Trio explains some of | the benefits of those libraries in a very readable way. | | [1] https://trio.readthedocs.io/en/stable/ | | [2] https://vorpus.org/blog/some-thoughts-on-asynchronous- | api-de... | btown wrote: | > instead we got a natively supported, robust, easy-to-use | concurrency paradigm built around green/virtual threading that | accommodates both IO and CPU bound work | | Minus the "natively supported" part, we have this today in | http://www.gevent.org/ ! It's so, so empowering to be able to | access the entire historical body of work of synchronous-I/O | Python libraries, and with a single monkey patch cause every | I/O operation, no matter how deep in the stack, to yield to | your greenlet pool _without code changes_. | | We fire up one process per core (gevent doesn't have good | support for multiprocessing, but if you're relying on that, | you're stuck on one machine anyways), spend perhaps 1 person- | day a quarter dealing with its quirks, and in turn we never | need to worry about the latencies of external services; our web | servers and batch workers have throughput limited only by CPU | and RAM, for which there's relatively little (though nonzero) | overhead. | | IMO Python should have leaned into official adoption of gevent. | It may not beat asyncio in raw performance numbers because | asyncio can rely on custom-built bytecode instructions, whereas | gevent has "userspace" code that must execute upon every yield. | And, as with asyncio, you have to be careful about CPU- | intensive code that may prevent you from yielding. But it's | perfect for most horizontal-scaling soft-realtime web-style use | cases. | int_19h wrote: | How would those green/virtual threads interface with native | async APIs (e.g. the entirety of WinRT)? | tbabb wrote: | What specifically is the problem with asyncio? I quite like | using it, so I'm curious if there's some aspect that makes it | unsustainable? | fullstop wrote: | I like using it as well, but I've been bit several times by | having runtime exceptions completely swallowed. | throwaway81523 wrote: | > What specifically is the problem with asyncio? | | Watch the very NSFW (lots of swearing) but hysterically funny | video "node.js is bad ass rock star tech" on youtube sometime | ;). https://www.youtube.com/watch?v=bzkRVzciAZg | calpaterson wrote: | The key disadvantage is largely that it bifurcates the | library base. Async libraries and sync libraries co-exist | uneasily in the same program. | | For nearly every popular library there is now a (usually | inferior, less robust) async one. The benefits of Linus' Law | are reduced. | Redoubts wrote: | Trifucates, since now there's stdlib asyncio, and a popular | trio async flavor too. | tbabb wrote: | Fair point. Async is a big enough idea that it probably | warrants designing the language with it in mind. I guess | another way of phrasing it would be that it violates the | "there's only one way to do it" maxim, and the "two ways of | doing it" circumstance necessarily came about because the | idea was discovered long after the core language and | libraries were already written. | aeyes wrote: | It solves only one problem, the name says it: Async I/O | | If you do anything on the CPU or if you have any I/O which is | not async you stall the event loop and everything grinds to a | halt. | | Imagine a program which needs to send heartbeats or data to a | server in a short interval to show liveness, Kafka for | example. Asyncio alone can't reliably do this, you need to | take great care to not stall the event loop. You only have | exactly one CPU core to work with, if you do work on the CPU | you stall the event loop. | | We see web frameworks built on asyncio but even simple API | only applications constantly need to serialize data which is | CPU-bound. These frameworks make no effort (and asyncio | doesn't give us any tools) to protect the event loop from | getting stalled by your code. They work great in simple | benchmarks and for a few types of applications but you have | to know the limits. And I feel that the general public does | not know the limitations of asyncio, it wasn't made for | building web frameworks on the async event loop. It was made | for communicating with external services like databases and | calling APIs. | twic wrote: | > With this scheme, the reference count in each object is split | in two, with one "local" count for the owner (creator) of the | object and a shared count for all other threads. Since the owner | has exclusive access to its count, increments and decrements can | be done with fast, non-atomic instructions. Any other thread | accessing the object will use atomic operations on the shared | reference count. | | > Whenever the owning thread drops a reference to an object, it | checks both reference counts against zero. If both the local and | the shared count are zero, the object can be freed, since no | other references exist. If the local count is zero but the shared | count is not, a special bit is set to indicate that the owning | thread has dropped the object; any subsequent decrements of the | shared count will then free the object if that count goes to | zero. | | So in this program: import threading | def produce(): global global_foo local_foo = | "potato" global_foo = local_foo def consume(): | global global_foo local_foo = global_foo | global_foo = None if __name__ == '__main__': | produce() thread = threading.Thread(target=consume) | thread.start() thread.join() | | What happens to the counts on the string "potato"? | | In produce, the main thread creates it and puts it in a local, | and increments the local count. It assigns it to a global, and | increments the local count. It then drops the local when produce | returns, and decrements the local count. In consume, the second | thread copies the global to a local, and increments the shared | count. It clears out the global, and decrements the shared count. | It then drops the local when consume returns, and decrements the | shared count. | | That leaves the local count at 1 and the shared count at -1! | | You might think that there must be special handling around | globals, but that doesn't fix it. Wrap the string in a perfectly | ordinary list, and put the list in the global, and you have the | same problem. | | I imagine this is explained in the paper by Choi et al, but i | have not read it! | ameixaseca wrote: | I couldn't find this in the design document but the only | obvious solution is to track globals via the shared count. | Since a global reference is part of all threads simultaneously, | it cannot be treated as local. | | If you follow this reasoning, the operations above result in | local=0/shared=0 after the last assignment. | twic wrote: | As i said in the comment, that doesn't work. Put a list in | the global, and then push and pop the string on the list. | Even better, push the string into a local list, then put that | list in another local list, then put that in a global, etc. | You would need to dynamically keep every object reachable | from a global marked as such, and that's a non-starter. | twic wrote: | The paper: | | > When the shared counter for an object becomes negative for | the first time, the non-owner thread updating the counter also | sets the object's Queued flag. In addition, it puts the object | in a linked list belonging to the object's owner thread called | QueuedObjects. Without any special action, this object would | leak. This is because, even after all the references to the | object are removed, the biased counter will not reach zero -- | since the shared counter is negative. As a result, the owner | would trigger neither a counter merge nor a potential | subsequent object deallocation. | | > To handle this case, BRC provides a path for the owner thread | to explicitly merge the counters called the ExplicitMerge | operation. Specifically, each thread has its own thread-safe | QueuedObjects list. The thread owns the objects in the list. At | regular intervals, a thread examines its list. For each queued | object, the thread merges the object's counters by accumulating | the biased counter into the shared counter. If the sum is zero, | the thread deallocates the object. Otherwise, the thread | unbiases the object, and sets the Merged flag. Then, when a | thread sets the shared counter to zero, it will deallocate the | object. Overall, as shown in invariant I4, an owner only gives | up ownership when it merges the counters. | | Well, that works, but it's a bit naff. | Jtsummers wrote: | Two spaces in front of each line of the code block. As written, | right now, your comment is hard to parse: | import threading def produce(): global | global_foo local_foo = "potato" global_foo = | local_foo def consume(): global global_foo | local_foo = global_foo global_foo = None if | __name__ == '__main__': produce() thread = | threading.Thread(target=consume) thread.start() | thread.join() | twic wrote: | Sorry about that. I did indent before pasting the code - but | gedit indents with tabs, which HN ignores! | kzrdude wrote: | It can be configured. I remember when Gedit was quite the | potent editor, language plugins, snippets and stuff. But it | has the basics left still :) | overgard wrote: | I feel like Gvr just doesnt want to change things, Feels doomed | | This has been a problem for like 20 years and they have refused | fixes before. And there have been fixes. They just don't see this | as important | | it's practically a religion that its a thing they wont change | heavyset_go wrote: | I disagree entirely. The last few releases of Python have made | significant changes to the language, coinciding with the | project becoming community-led after Guido stepped down. | mixmastamyk wrote: | Yes, stepped down as a result of him forcing the walrus | operator change into the language over significant | opposition. | randlet wrote: | Guido is no longer the BDF and spoke fairly positively about | this change in the mailing list thread[1]. | | "To be clear, Sam's basic approach is a bit slower for single- | threaded code, and he admits that. But to sweeten the pot he | has also applied a bunch of unrelated speedups that make it | faster in general, so that overall it's always a win. But | presumably we could upstream the latter easily, separately from | the GIL-freeing part." | | [1] https://mail.python.org/archives/list/python- | dev@python.org/... | Waterluvian wrote: | I can see it now: | | "My program has 2^64 references to an object, which caused it to | become immortal" | | =) | notriddle wrote: | In a 64-bit address space, with objects requiring more than one | word to store, that's literally impossible. | The_rationalist wrote: | Or you could just use GraalVM python | https://github.com/oracle/graalpython | misnome wrote: | > This "optimization" actually slows single-threaded accesses | down slightly, according to the design document, but that penalty | becomes worthwhile once multi-threaded execution becomes | possible. | | My understanding was that CPython viewed any single-threaded | performance regression as a blocker to GIL-removal attempts, | regardless of if other work by the developer has sped up the | interpreter? This article seems to somewhat gloss over that with | "it's only small". I'd be interested in knowing other estimations | of what the "better-than-average chance" of this (promising | sounding) attempt were. | | Breaking C extensions (especially the less-conforming ones, which | seem likely to be the least maintained) also seems like it would | be a very hard pill to swallow, and the sort of thing that might | make it a Python 3-to-4 breaking change, which I imagine would | also be approached extremely carefully given there are still | people to-this-day who believe that python 3 is a mistake and one | day everyone will realise it and go back to python 2 (yes, | really). | singhrac wrote: | From the article: | | > Gross has also put some significant work into improving the | performance of the CPython interpreter in general. This was | done to address the concern that has blocked GIL-removal work | in the past: the performance impact on single-threaded code. | The end result is that the new interpreter is 10% faster than | CPython 3.9 for single-threaded programs. | singhrac wrote: | Sorry, to be clear, I missed your point "regardless of if | other work by the developer has sped up the interpreter". | That's fair, though my personal opinion is that that seems | like an incredibly high bar for any language. | a1369209993 wrote: | > and one day everyone will realise it | | No? Why would we think that? There are people who willingly use | _java_ ; compared to that the problems with python 3 are | downright non-obvious as long as you never need to work with | things like non-Unicode text. | Kranar wrote: | C extensions can continue to be supported. Said extensions | already explicitly lock/release the GIL, so to keep things | backwards compatible it would be perfectly fine if there was a | GIL that existed strictly for C extension compatibility. | masklinn wrote: | > My understanding was that CPython viewed any single-threaded | performance regression as a blocker to GIL-removal attempts, | regardless of if other work by the developer has sped up the | interpreter? | | Previous GILectomy attempts incurred significant single- | threaded performance penalties, on on the order of 50% or | above. If Gross's work yields low single-digit performance | penalty it's pretty likely to be accepted as this is the sort | of impacts which can happen semi-routinely as part of | interpreter updates. | | The complete breakage of C extensions would be a much bigger | issue. | ajkjk wrote: | There are people who believe all kinds of crazy things; it | doesn't reflect their truth. Going back to Python 2 is not | going to ever happen (and no one working on Py3 would ever want | to, anyway). | | A hard pill to swallow.. ain't that bad if it also benefits you | tremendously, which fixing the GIL would do. | EamonnMR wrote: | I do wish for a world where Python 3 had handled | unicode/bytes very differently. | fatbird wrote: | It was Guido's requirement that GIL removal not degrade single | threaded performance at all, but in the talk I attend at PyCon | 2019, the speaker mentioned nothing about qualifications on | that. Guido's restriction was presented, quite reasonably, as | "no one should have to suffer because of removing the GIL". So | a net break-even or performance improvement is fine. | | And on top of that, Guido has retired now, and the steering | committee may feel differently as long as the spirit of the | restrictions is upheld. | fatbird wrote: | Guido has replied to Gross's announcement to observe that his | performance improvements are not tied to removing the GIL and | could be accepted separately. But he doesn't reject Gross's | work outright, and if the same release that includes the GIL | removal also delivers a concrete performance upgrade, I | suspect that Guido would be fine with it. His concern is, | after all, practical, to do with the actual use of python and | not some architecture principle. | efoto wrote: | "The biggest source of problems might be multi-threaded programs | with concurrency-related bugs that have been masked by the GIL | until now." | zinodaur wrote: | > concurrency-related bugs that have been masked by the GIL | | Yeah... could phrase this as "All programs written with the | assumption of a GIL are now broken" instead. Wish they had done | this as part of the breaking changes for python 3, I guess | they'll have to wait for Python 4 for this? | Animats wrote: | Yes. I once discovered that CPickle was not thread-safe. The | response was that much of the library didn't really work in | multi-threaded programs. | formerly_proven wrote: | You mean programs where you put an object into pickle and | some other threads modify it while pickle is processing it? | Doesn't surprise me - the equivalent written in plain Python | would be very thread unsafe as well. | Animats wrote: | No, I mean several threads doing completely separate | CPickle streams with no shared data or variables at the | Python level. | kzrdude wrote: | Has it since been fixed? | toyg wrote: | Probably not. CPickle is famously shunned by anyone who | has to do serious, performance-critical | serialization/deserialization. | kzrdude wrote: | I was curious, and an issue that fits the description was | fixed in Py 3.7.x here: | https://bugs.python.org/issue34572 but other threading | bugs remain: https://bugs.python.org/issue38884 | nomdep wrote: | If the Python maintainers doesn't want to approve this, Gross | should talk to the Pypi developers. | otterley wrote: | This may be a silly question, but if you really need concurrency, | why not use a language that's built for concurrency from the | ground up instead? Elixir is a great example. | klyrs wrote: | I rarely need concurrency, and do a lot of Python because it's | what all my dependencies are written in. But sometimes, I find | myself bottlenecked on a trivially parallelizable operation. In | the state (my dependecies are in Python, I have a working | Python implementation), there's _no way in hell_ that (rewrite | my dependencies in Elixir, rewrite my code in Elixir) is a | sensible next move. | lucb1e wrote: | Are you proposing to write anything that will need concurrency | anywhere in your favorite language, or just call into the | concurrent code from python? (Since comments like | https://news.ycombinator.com/item?id=28883990 seem to be taking | it as the former whereas I took it as the latter.) | ska wrote: | To a first approximation, people don't use python for itself, | they use it for the vast ecosystem and network effect. If you | jump to another language for better concurrency, what are you | giving up? | | Unless you really are doing greenfield development in an | isolated application, these considerations often trump any | language feature. | otterley wrote: | Don't get me wrong; I'm not suggesting that anyone dump | Python altogether to switch to a different language for any | arbitrary project or purpose. Many businesses I work with use | different languages for different components or applications, | using the network or storage to intercommunicate when | necessary. The right tool for the job, as it were. | ferdowsi wrote: | There are some organizations with lots of domain knowledge and | expertise around around developing, securing and deploying | Python and they don't have the Innovation Currency to spend on | investing in a new language. | | Specific to your point, recruiting for Elixir talent is a | problem compared to more mainstream languages. Recruiting in | general is extremely hard at this moment. | otterley wrote: | Given all the corner cases people are going to continue to | find whilst trying to coax Python into behaving correctly in | a highly concurrent program -- especially one that utilizes | random libraries from the ecosystem -- I can't help but | wonder whether the Innovation Currency is better spent | replacing the components that require high concurrency (which | often is only a subset of them) instead of getting stuck in | the mire of bug-smashing. | pmontra wrote: | A possible answer is that everybody in the company knows Python | and no other language. Another one is that they have to reuse | or extend a bunch of existing Python code. The latter happened | to me. Performances were definitely not a concern but I | suddenly needed threads doing extra functionality over the | original single threaded algorithm. BTW, I used a queue to pass | messages between them. | otterley wrote: | Using multiple interpreters with message passing is a | workable, if expensive, way to deal with the problem. It is | trading one cost for another. (These sort of tradeoffs are | encountered all the time in business, to be sure.) | Fordec wrote: | Sometimes a project starts off aiming to solve a problem. maybe | it's a data science problem, so support already exists in | python so lets do that. Ok, it worked great and it's catching | on with users. Now we need to scale, but we are running into | concurrency issues. What is a better answer? Ok we will work on | improving python concurrency under the hood, or completely | scrap the code base and switch to a different language? | | Very few people set out going asking themselves about such low | level details on day one of a project. Especially something | that was an MVP or POC | otterley wrote: | I'll plead ignorance here: Do data science workflows often | require high concurrency using a single interpreter? I | thought all that stuff was compute-bound and parceled out to | workers that farm out calculations to CPUs and GPUs. | Animats wrote: | Or you could just use PyPy, which uses a garbage collector, does | more compile-time analysis, and runs much faster. | | CPython is a naive interpreter, like original JavaScript. There's | been progress since then. | llimllib wrote: | lots of people need C extensions, which you can't* have on | pypy. | | *: mostly true | willvarfar wrote: | Pypy is still single threaded. | https://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-gil... | | This work is super exciting! Can pypy use the same recipe to | offer true parallelism plus the jit?? | | Will be really interesting to see what pypy devs think of this | work and how they might also lever it! | nas wrote: | I think it can't use the same recipe. Sam's approach for | CPython uses biased reference counting. Internally, Pypy uses | a tracing garbage collector, not reference counting. I don't | know how difficult it would be to make their GC thread-safe. | Probably you don't want to "stop the world" on every GC pass | so I guess changes are non-trivial. | | Sam's changes to CPython's container objects (dicts, lists), | to make them thread safe might also be hard to port directly | to Pypy. Pypy implements those objects differently. | willvarfar wrote: | I think the biggest thing it will give is a need to go | there. Until now, pypy has been able to not do parallelism. | But if cpython is suddenly faster for a big class of | program, pypy will have to bite the bullet to stay | relevant? | masklinn wrote: | pypy also has a GIL. | nerdponx wrote: | PyPy being stuck on 3.7 hurts. If 3.8 support comes out soon, | I'll be happy to switch for general-purpose work. 3.9 would be | even nicer, to support the type annotation improvements. I | donate every month, but I'm just an individual donating pocket | change; it'd be great to see some corporate support for PyPy. | calpaterson wrote: | There are very few new features in 3.8. | | It is a much less important release (for features) than 3.7, | which for example added dataclasses and lots of typing and | asyncio stuff. | | The most significant change in 3.8 is a notoriously | controversial new infix operator. Even it's supporters would | say that it's a niche usecase. | masklinn wrote: | > There are very few new features in 3.8. | | > It is a much less important release (for features) than | 3.7, which for example added dataclasses and lots of typing | and asyncio stuff. | | That's funny because my take is the exact opposite: | dataclasses are not very useful (attrs exists and does | more), deferred type annotations are meh, contextvars, | breakpoint(), and module-level getattr/settattr but not | exactly anything you can't do without. | | Assignment expressions provide for great cleanups in some | contexts (and avoiding redundant evaluations in e.g. | comprehensions), expr= is tremendous for printf-debugging, | posonly args is really useful, \N in regex can much improve | their readability when relevant. | | $dayjob has migrated to python 3.7 and there's really | nothing I'm excited to use (possibly aside from doing weird | things with breakpoint), whereas 3.8 would be a genuine | improvement to my day-to-day enjoyment. | nerdponx wrote: | Deferred type annotations with `from __future__ import | annotations` are a game-changer IMO. You can use them | 3.7, which is good enough for me. The big improvement in | 3.9 is not having to use `typing.*` for a lot of basic | data types. | | The biggest improvements between 3.7, 3.8, 3.9, and 3.10 | are in `asyncio`, which was pretty rough in 3.7 and very | usable in 3.9. I use the 3rd-party `anyio` library in a | lot of cases anyway (https://anyio.readthedocs.io/), but | it's not always feasible. | laurencerowe wrote: | It's been a few years since I last played around with PyPy but | while it provided amazing performance gains for simple | algorithmic code I saw no speed up on a more complex web | application. | typical182 wrote: | This is a great list of influences on the design (from the | article comments where the prototype author Sam Gross responded | to someone wishing for more cross pollination across language | communities): | | ---------- | | "... but I'll give a few more examples specific to this project | of ideas (or code) taken from other communities: | | - Biased reference counting (originally implemented for Swift) | | - mimalloc (originally developed for Koka and Lean) | | - The design of the internal locks is taken from WebKit | (https://webkit.org/blog/6161/locking-in-webkit/) | | - The collection thread-safety adapts some code from FreeBSD | (https://github.com/colesbury/nogil/blob/nogil/Python/qsbr.c) | | - The interpreter took ideas from LuaJIT and V8's ignition | interpreter (the register-accumulator model from ignition, fast | function calls and other perf ideas from LuaJIT) | | - The stop-the-world implementation is influenced by Go's design | (https://github.com/golang/go/blob/fad4a16fd43f6a72b6917eff65... | )" | [deleted] | Ericson2314 wrote: | > Gross has also put some significant work into improving the | performance of the CPython interpreter in general. | | Earmarks work, folks! | a1369209993 wrote: | > With this scheme, the reference count in each object is split | in two, with one "local" count for the owner (creator) of the | object and a shared count for all other threads. Since the owner | has exclusive access to its count, increments and decrements can | be done with fast, non-atomic instructions. Any other thread | accessing the object will use atomic operations on the shared | reference count. | | > Whenever the owning thread drops a reference to an object, it | checks both reference counts against zero. If both the local and | the shared count are zero, the object can be freed, since no | other references exist. If the local count is zero but the shared | count is not, [ _]a special bit is set to indicate that the | owning thread has dropped the object[_ ]; any subsequent | decrements of the shared count will then free the object if that | count goes to zero. | | This seems... off. Wouldn't it work better for the owning thread | to hold (exactly) one atomic reference, which is released (using | the same decref code as other threads) when the local reference | count goes to zero? | | Edit: I probably should have explicitly noted that, as jetrink | points out, the object is initialized with a atomic refcount of | one (the "local refcount is nonzero" reference), and destroyed | when the atomic refcount is one and to-be-decremented, so a | purely local object never has atomic writes. | [deleted] | Someone wrote: | > which is released (using the same decref code as other | threads) when the local reference count goes to zero? | | (I may misunderstand your remark, as 'releasing' is a bit | ambiguous. It could mean decreasing reference count and freeing | the memory if the count goes to zero or just plain freeing the | memory) | | The local ref count can go to zero while other threads still | have references to the object (e.g. when the allocating thread | sends an object as a message to another thread and, knowing the | message arrived, releases it), so freeing the memory when it | does would be a serious bug. | | Also, the shared ref count can go negative. From the paper: | | > _As an example, consider two threads T1 and T2. Thread T1 | creates an object and sets itself as the owner of it. It points | a global pointer to the object, setting the biased counter to | one. Then, T2 overwrites the global pointer, decrementing the | shared counter of the object. As a result, the shared counter | becomes negative._ | | That can't happen with the biased counter because, when it | would end up going negative, the object gets unbiased, and the | shared counter gets decreased instead. | | That asymmetry is what ensures that only a single thread | updates the biased counter, so that no locks are needed to do | that. | a1369209993 wrote: | > I may misunderstand your remark, as 'releasing' is a bit | ambiguous. | | The _reference_ is released; ie the (atomic) reference count | is decremented (and the object is only freed if that caused | the atomic reference count to go to zero). | | > From the paper | | I missed that there was a paper and was referring to the | proposed implementation in python that was described in TFA. | IIUC, biased refcount (in paper) is local (in my | description), and shared is atomic, correct? | | > the shared ref count can go negative | | And _that_ makes sense. Thanks. (And also explains how to | deal with references added by one thread and removed by | another, when one of those threads is the object owner.) | morelisp wrote: | This seems like it would be less efficient if most objects | don't escape their owning thread (you would need one atomic | inc/dec versus zero), which is probably true of most objects. | a1369209993 wrote: | Sorry, should have been more clear; edited. | johntb86 wrote: | Suppose thread A (the owner) keeps a reference, but also puts | another reference in a global variable. This would increment | its local refcount to 1 and have a shared refcount of 1. | | Then thread B clears the global variable. With your scheme the | local refcount would be 1 but the shared refcount would be 0, | so thread B would destroy the object even though it's | referenced by thread A. | kccqzy wrote: | I like this idea. In fact another possibility is to have a | thread-local reference count for each thread that uses the | object which can use fast non-atomic operations, and then each | thread can use a shared atomic reference count, that counts how | many threads use the object. When each thread-local count goes | to zero, the shared count is decremented by one. | | This way, if an object is created in one thread and transferred | to another, the other thread wouldn't even need to do a lot of | atomic reference count manipulations. There wouldn't be | surprising behavior in which different threads run the same | code with different speed, just by virtue of whether they | created the objects or not. | Fronzie wrote: | Good point. Windows COM did not follow your suggestion, | leading to all sorts of awkwardness in applications that have | compute- and ui-threads and share objects between the two. | Object destruction becomes non-predictable and can hold up a | UI thread. | twic wrote: | How would the storage be laid out? | | With the proposed scheme, there are two counters (and, i | assume, the ID of the owning thread), so a small fixed-size | structure, which can sit directly in the object header. With | your scheme, you need a variable and unbounded number of | counters. Where would they go? | wishawa wrote: | Maybe each thread could have a mapping storing the number | of references held (in that thread) for each object? This | way only the atomic refcount has to be in the object | header. Also I don't think there would be an owning thread | at all with this idea, so no ID needed. | wishawa wrote: | I think yours is a much cleaner design. In the original plan, | if the owning thread just set the special bit, but before that | set is propagated, another thread drops the shared refcount to | zero, the object would never be released, would it? | | EDIT: never mind the question, I just read that the special bit | is atomic. | dennisafa wrote: | I think that idea was mentioned earlier in the article: | | > The simplest change would be to replace non-atomic reference | count operations with their atomic equivalents. However, atomic | instructions are more expensive than their non-atomic | counterparts. Replacing Py_INCREF and Py_DECREF with atomic | variants would result in a 60% average slowdown on the | pyperformance benchmark suite. | a1369209993 wrote: | > to replace non-atomic reference count operations with their | atomic equivalents. | | Nope, my proposal still uses two reference counts (one | atomic, one local); it just avoids having a seperate flag bit | to indicate that the owning thread is done. | cogman10 wrote: | Yeah, I'm not exactly getting all the complexity here. | | I'm digging the 2 reference counters, that makes sense to me, | but I don't know why it isn't something more like: | | "every time a new thread takes a reference, atomic +1, every | time a new thread's local count hits 0, atomic -1. If the | shared reference is 0, free". | | IDK what special purpose the flags are serving here. | [deleted] | jeremyjh wrote: | Most objects are never shared so there would be a performance | impact from incrementing (and decrementing) an atomic counter | even just once. | jetrink wrote: | That is true, but what if the shared count were initialized | to one and the creator thread frees an object when the | shared count is equal to one and the local count is | decremented zero? (Since it knows it holds one shared | reference.) Then the increment and decrement would be | avoided for non-shared objects. | a1369209993 wrote: | I probably should have mentioned that explicitly; edited. | kzrdude wrote: | To have one local count per thread would add memory overhead, | I think? In his solution there are only two counters per | object, local and shared. | | Any other thread can't know if it's the first time or not | it's taking a reference to an object. | AgentME wrote: | >Edit: I probably should have explicitly noted that, as jetrink | points out, the object is initialized with a atomic refcount of | one (the "local refcount is nonzero" reference), and destroyed | when the atomic refcount is one and to-be-decremented, so a | purely local object never has atomic writes. | | I think you're under the impression that the refcount would | only ever need to be incremented if the object was shared to | another thread, but that's not the case. The refcount isn't a | count of how many threads have a reference to the object; it's | a count of how many objects and closures have a reference to | the object. Even objects that never leave their creator thread | will be likely to have their reference count incremented and | decremented a few times over their life. | a1369209993 wrote: | > Even objects that never leave their creator thread will be | likely to have their reference count incremented and | decremented a few times over their life. | | I think you're under the impression that there's only one | refcount. The point of the original design (and this one) is | that there are two refcounts: one that's updated only by the | thread that created the object, and therefor doesn't need to | use slow atomic accesses, and one that's updated atomically, | and therefor can be adjusted by arbitrary threads. | AgentME wrote: | Oh, I misunderstood you then. I thought you were trying to | get rid of the local refcount and make the atomic one | handle its job too, but what you're suggesting is a | possible simplification of the logic that detects when it's | time to destroy the object. That makes sense, just seems | more minor than I thought you were going at and I guess I | missed it. | jeremyjh wrote: | Threads don't hold references - other objects do and we have to | know how many do. If threads held a reference it might never be | released. Since most objects are never shared we wouldn't want | to increment an atomic counter even once for those. | a1369209993 wrote: | I don't _think_ the objection you 're actually making is | valid (the extra atomic reference is just a representation of | the fact that the local refcount is nonzero), but come to | think of it, even in the original version, how the heck does | a thread know whether a reference held by (say) a dictionary | that is itself accesible to multiple threads was increfed by | the owning thread or another thread? | lormayna wrote: | Why not using something trio or curio? They are quite easy to | learn, very powerful and have an approach similar to channel in | golang. | synchronizing wrote: | async != multithreading | calpaterson wrote: | Those are not multithreading, they are asynchronous io which is | different. With asynchronous io in Python the only | concurrency/parallelism you can do is for IO. | | Multithreading in Python currently has the same limitation | though but it needn't. | rich_sasha wrote: | This is some of the best news I read in a while! | | Multiprocessing sort of works but it's really sucky. | jeremyis wrote: | Way to go Sam! Mark my words: our generations' Carmack! | dsr_ wrote: | I'm going to assume that there is a reason that this isn't a | switch control, so that the default is a single-threaded program | and the programmer needs to state explicitly that this one will | be multi-threaded, upon which the interpreter changes into the | atomic mode for the rest of execution? | toxik wrote: | That would be expensive. | mikepurvis wrote: | Basically no one would get the glorious single-threaded | performance then, since the first time you pip install | anything, you're going to discover that it spins up a thread | under the hood that you're never exposed to. | | Or worse, you end up with the async schism all over again, with | new "threadless" versions of popular libraries springing up. | __s wrote: | Most references are thread local, where this implementation | will still beat out atomic refcounts in a multi-threaded app | sandGorgon wrote: | has anyone built and run this in docker ? would love to test this | out - i dont have a lot of experience in compiling python inside | docker | | EDIT: there is a dockerfile in there | https://raw.githubusercontent.com/colesbury/nogil/nogil/Dock... | cormacrelf wrote: | > _" biased reference counts" and is described in this paper by | Jiho Choi et al. With this scheme, the reference count in each | object is split in two, with one "local" count for the owner | (creator) of the object and a shared count for all other threads_ | | > _The interpreter 's memory allocator has been replaced with | mimalloc_ | | These are very similar ideas! | | Mimalloc is notable for its use of separate local and remote free | lists, where objects that are being freed from a different thread | than the page's heap's owner are placed in a separate queue. The | local free list is (IIRC) non-atomic until it is empty and local | allocs start pulling from the remote queue. | | The general idea is clearly lazy support for concurrency, | matching up perfectly with Python's need to keep any single | threaded perf it has. I'm impressed with the application of all | of these things at once. | marris wrote: | How big a problem is the possible breakage of C extensions for | new code? Is there currently some standard "future proofed for | multi-thread" way of writing them that will reduce the odds of | the C extension breaking? And maybe also being compatible with | PyPy? Or do developers today need to write a separate version for | each interpreter that they want to support? | heavyset_go wrote: | There are projects[1] that are abstracting away the C extension | interface in order to standardize C extensions across | implementations and prevent breaking changes. | | [1] https://github.com/hpyproject/hpy | singhrac wrote: | Notably the dev proposing this (Sam Gross aka colesbury) is/was a | major PyTorch developer, so someone quite familiar with high | performance Python and extensions. | ajtulloch wrote: | and he's a genius! | jeremyis wrote: | +1 ! Though, not a very good Oculus player (yet)! | mzs wrote: | Yikes, C extensions can't assume they are under GIL by default: | | https://github.com/colesbury/numpy/commits/v1.19.3-nogil | kzrdude wrote: | It looks like a total of four lines needed changing in numpy | due to his change. That's a very good score in my book, numpy | is huge. | Jweb_Guru wrote: | Unfortunately, every C extension will need to undergo manual | review for safety, unless there's some very easy way to have | the C extension opt into using the GIL. And some of them will | be close to impossible to detangle in this way. | veryupwork wrote: | no | int_19h wrote: | It really depends on how the library is written, and how much | shared data it has. It has been very common to use GIL as a | general-purpose synchronization mechanism in native Python | modules, since you have to pay that tax either way. | ikiris wrote: | I was half expecting a link to go or rust. | jeffybefffy519 wrote: | I dont know about others, but I really enjoy content about the | Python GIL. Its a fascinatingly complex problem. | lucb1e wrote: | For a minute I thought I finally found someone else who likes | the GIL, but then you said _content about_. Programs that just | divide up work across processes are much easier to write | without introducing obscure bugs due to the lack of atomicity. | I 'm definitely excited for a GIL-less python, even if it's a | rare scenario where it makes sense to try to do performant code | in python in the first place rather than offloading a few lines | to another language to be fast, but I am a bit afraid that | people (particularly beginner programmers) will grab this with | too many hands. Having also seen recommendations for this-or- | that threading method going around in other languages, threads | are recommended really much more often than where it makes | sense and beginners won't have a comparative experience yet of | writing multi-process code instead. | | That said, I am also always interested in GIL-related content | like this! Loved the article. | solarmist wrote: | Yup. Me too, but I'm not sad to see it go. | phkahler wrote: | >> If that bit is set, the interpreter doesn't bother tracking | references for the relevant object at all. That avoids contention | (and cache-line bouncing) for the reference counts in these | heavily-used objects. This "optimization" actually slows single- | threaded accesses down slightly, according to the design | document, but that penalty becomes worthwhile once multi-threaded | execution becomes possible. | | Was going to say do the opposite. Set the bit if you want | counting and then modify the increment and decrement to add or | subtract the bit, thereby eliminating condition checking and | branching. But it sounds like the concern is cache behavior when | the count is written. Checking the bit can avoid any modification | at all. | Jtsummers wrote: | Under this scheme, objects get freed when both local and shared | counts are zero. By using a special value that makes the shared | count non-zero (for eternal and long-lived objects), it ensures | that should the owner (for some reason) drop them, they will | not be freed. No extra logic has to be introduced, the shared | count is non-zero and that's all that's needed to prevent | freeing. ___________________________________________________________________ (page generated 2021-10-15 23:00 UTC)