[HN Gopher] A viable solution for Python concurrency
       ___________________________________________________________________
        
       A viable solution for Python concurrency
        
       Author : zorgmonkey
       Score  : 392 points
       Date   : 2021-10-15 17:54 UTC (5 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | ferdowsi wrote:
       | If this effort succeeds (and I hope it does) now Python
       | developers will need to contend with the event-loop albatross of
       | asyncio and all of its weird complexity.
       | 
       | In an alternate Python timeline, asyncio was not introduced into
       | the Python standard library, and instead we got a natively
       | supported, robust, easy-to-use concurrency paradigm built around
       | green/virtual threading that accommodates both IO and CPU bound
       | work.
        
         | BiteCode_dev wrote:
         | asyncio is not a competition to threads, it's complementary.
         | 
         | In fact, it's a perfectly viable strat in python to have
         | several processes, each having several threads, each having an
         | event loop.
         | 
         | And it will still be so, once this comes out. You will
         | certainly use threads more, and processes less, but replacing
         | 1000000 coroutines by 1000000 system threads is not necessarily
         | the right strategy for your task. See nginx vs apache.
        
           | dralley wrote:
           | Multiple threads with one asyncio loop per thread would be
           | absolutely pointless in Python, because of the GIL.
           | 
           | With that said, sure, threads and asyncio are complimentary
           | in the sense that you can run tasks on threadpool executors
           | and treat them as if they were coroutines on an event loop.
           | But that serves no purpose unless you're trying to do
           | blocking IO without blocking your whole process.
        
             | bonzini wrote:
             | In Python it would be pointless, but for example it's how
             | Seastar/ScyllaDB work: each thread is bound to a CPU on the
             | host and has its own reactor (event loop) with coroutines
             | on it. QEMU has a similar design.
        
               | yellowapple wrote:
               | It's also (to my knowledge) how Erlang's VMs (e.g. BEAM)
               | work: one thread per CPU core, and a VM on each thread
               | preemptively switching between processes.
        
             | BiteCode_dev wrote:
             | It would not be pointless at all, because while one thread
             | may lock on CPU, context switching will let another one
             | deal with IO. This can let you smooth out the progress of
             | each part of your program, and can be useful for workload
             | when you don't want anything to block for too long.
        
             | heavyset_go wrote:
             | I read it as each process having multiple threads _and_ an
             | event loop. If the threads are performing I /O or calling
             | out to compiled code and releasing the GIL, said GIL won't
             | block the event loop.
        
             | azinman2 wrote:
             | This entire article is about removing the GIL
        
           | Zarathust wrote:
           | "Viable" as in "you have no other choice sometimes". This
           | forces you to deal with 3 libraries each with their own
           | quirks, pitfalls and incompatibilities. Sometimes you even
           | deal with dependencies reimplementing some parts in a 4th or
           | 5th library to deal with shortcomings.
           | 
           | I really don't care that much which of them survive, I just
           | want to rely on less of them
        
             | BiteCode_dev wrote:
             | No, it's just useful. They are techs with different trade
             | off, and life is full of opportunities.
        
               | throwaway81523 wrote:
               | Python Zen = one obvious way to do it. Having a bunch of
               | very different ones, each with serious disadvantages, is
               | a bad look.
        
               | heavyset_go wrote:
               | Zen of Python is an ideal, and at this point, kind of
               | tongue-in-cheek.
               | 
               | This is the same language that shipped with at least 3
               | different methods to apply functions across iterables
               | when the Zen of Python was adopted as a PEP in 2004.
        
               | throwaway81523 wrote:
               | There is at least some recognition in those cases that
               | they introduced the new thing because they got it wrong
               | in the old thing. That's different than saying they
               | should co-exist on equal terms.
        
               | BiteCode_dev wrote:
               | It's a technical thread, not a political one. If you were
               | so sure of your argument, you wouldn't use a throwaway.
               | 
               | Besides, it's weird, like saying we should not have int,
               | float and complex, there should be one way to do it.
               | 
               | Just because those are 3 numbers doesn't mean they don't
               | have each their own specific benefit.
        
               | throwaway81523 wrote:
               | int, float, and complex are for different purposes. async
               | and threads paper over each others' weaknesses, instead
               | of fixing the weaknesses at the start. Async itself is an
               | antipattern (technical opinion, so there) but Python uses
               | it because of the hazards and high costs of threads.
               | Chuck Moore figured out 50 years ago to keep the async
               | stuff out of the programmer's way, when he put
               | multitasking into Polyforth, which ran on tiny machines.
               | Python (and Node) still make the programmer deal with it.
               | 
               | If you look at Haskell, Erlang/Elixir, and Go, they all
               | let you write performant sequential code by pushing the
               | async into the runtime where the programmer doesn't have
               | to see it. Python had an opportunity to do the same, but
               | stayed with async and coroutines. What a pain.
        
         | throwaway81523 wrote:
         | Yes, I have a big sense of tragedy about Python 3. Python
         | should run on something like (or maybe the actual) Erlang BEAM
         | with lightweight isolated processes. All my threaded Python
         | code is written using that style anyway (threads communicating
         | through synchronized queues) and I've almost never needed
         | traditional shared mutable objects. Maybe completely never, but
         | I'm not sure about a certain program any more.
         | 
         | Added: I don't understand the downvotes. If Python 3 was going
         | to make an incompatible departure from Python 2, they might as
         | well have done stuff like the above, that brought real
         | benefits. Instead they had 10+ years of pain over relatively
         | minor changes that arguably weren't all improvements.
        
           | BeetleB wrote:
           | You are likely being downvoted because most claims about the
           | pain of a Python 3 transition are inflated/hyperbole.
           | 
           | It took less than a day to migrate all my code to Python 3.
           | And by "less than a day" I mean "less than 2 hours". Granted,
           | bigger projects would take longer, but saying stuff like "10+
           | years of pain" is ridiculous. Probably less than 1% of
           | projects had serious issues with the migration. We just hear
           | of a few popular ones that had some pain and assume that was
           | representative.
        
         | KaiserPro wrote:
         | > easy-to-use concurrency paradigm
         | 
         | Well it has queues and threads already.
         | 
         | Its just that asyncio for socket handling at least (in the
         | testing that I did) is about 5% faster. (one asyncio socket
         | "server" vs ten threads [with a number of ways to monitor for
         | new connections])
         | 
         | I always assumed that people wanted asyncio because they look
         | at javascript and thought "hey I want GOTOs cosplaying as a fun
         | paradigm"
        
           | BiteCode_dev wrote:
           | GOTO cosplaying should go away with structured concurrency
           | (via TaskGroup) being adopted in 3.11, as pioneered by Trio.
           | 
           | Check out anyio if you want to use them now.
        
         | nine_k wrote:
         | BTW I wonder why async is so painless in ES6 compared to
         | Python. Why the presence of GIL (which JS also has) did not
         | make running async coroutines completely transparent, as it
         | made running generators (which are, well, coroutines already).
         | Why the whole even loop thing is even visible at all.
        
           | laurencerowe wrote:
           | Because JavaScript never had threads so I/O in JavaScript has
           | always been non-blocking and the whole ecosystem surrounding
           | it has grown up under that assumption.
           | 
           | JavaScript doesn't need a GIL because it doesn't really have
           | threads. WebWorkers are more akin to multiprocessing than
           | threads in Python. Objects cannot be shared directly across
           | WebWorkers so transferring data comes with the expense of
           | serializing/deserializing at the boundary.
        
             | catlifeonmars wrote:
             | JS now has shared array buffers.
        
               | laurencerowe wrote:
               | SharedArrayBuffer is just raw memory similar to using
               | mmap from Python multiprocessing. The developer
               | experience is very different to simply sharing objects
               | across threads.
        
           | BiteCode_dev wrote:
           | I used them both extensively, and here are the main reasons I
           | can think of:
           | 
           | - The event loop in JS is invisible and implicit. V8 proved
           | it can be done without paying a cost for it, and in fact most
           | real life python projects are using uvloop because it's
           | faster than asyncio default loop. JS dev don't think of the
           | loop at all, because it's always been there. They don't have
           | to chose a loop, or thinking about its lifecycle or
           | scheduling. The API doesn't show the loop at all.
           | 
           | - Asynchronous functions in JS are scheduled automatically.
           | On python, calling a coroutine function does...nothing. You
           | have to either await it, or pass it to something like
           | asyncio.create_task(). The later is not only verbose, it's
           | not intuitive.
           | 
           | - Async JS functions can be called from sync functions
           | transparently. It just returns a Promise after all, and you
           | can use good old callbacks. Instantiating a Python coroutine
           | does... nothing as we said. You need to schedule it AND await
           | it. If you don't, it may or may not be executed. Which is why
           | asyncio.gather() and co are to be used in python. Most people
           | don't know that, and even if you know, it's verbose, and you
           | can forget. All that, again, because using the event loop
           | must be explicit. That's one thing TaskGroup from trio will
           | help with in the next Python versions...
           | 
           | - the early asyncio API sucked. The new one is ok,
           | asyncio.run() and create_task() with implicit loop is a huge
           | improvement. But you better use 3.7 at least. And you have to
           | think about all the options for awaiting:
           | https://stackoverflow.com/questions/42231161/asyncio-
           | gather-...
           | 
           | - asyncio tutorials and docs are not great, people have no
           | idea how to use it. Since it's more complex, it compounds.
           | 
           | E.G, if you use await:
           | 
           | With node v14.8+:                   await async_func(params)
           | 
           | With python 3.7+:                   import asyncio
           | async def main():             # no top level await, it must
           | happen in a loop             await async_func(params)
           | asyncio.run(main) # explicit loop, but easy one thanks to 3.7
           | 
           | E.G, deep inside functions calls, but no await:
           | 
           | With node:                   ...         async_func(params)
           | 
           | With python 3.7+:                   ...         #
           | async_func(params) alone would do nothing         res =
           | asyncio.create_task(async_func(params))              ...
           | # you MAY get away with not using gather() or wait()
           | # but you also may get "coroutine is never awaited"         #
           | RuntimeWarning: coroutine 'async_func' was never awaited
           | asyncio.gather(res)
           | 
           | Of course, you could use "run_until_complete()", but then you
           | would be blocking. Which is just not possible in JS, there is
           | one way to do it, and it's always non blocking and easy.
           | Ironic, isn't it? Beside, which Python dev knows all this?
           | I'm guessing most readers of this post will have heard of it
           | for the first time.
           | 
           | Python is my favorite language, and I can live with the
           | explicit loop, but explicit scheduling is ridiculous. Just
           | run the damn coroutine, I'm not instantiating it for the
           | beauty of it. If I want a lazy construct, I can always make a
           | factory.
           | 
           | Now, thanks to the trio nursery concept, we will get
           | TaskGroup in the next release (also you can already use them
           | with anyio):                   async with asyncio.TaskGroup()
           | as tg:             tg.start_soon(async_func, params)
           | 
           | Which, while still verbose, is way better:
           | 
           | - no gather or wait. Schedule it, it will run or be cleaned
           | up.
           | 
           | - no need to chose an awaiting strat, or learn about a 1000
           | things. This works for every cases. Wanna use it in a sync
           | call ? Pass the tg reference in it.
           | 
           | - lifecycle is cleanly scoped, a real problem with a lot of
           | async code (including in JS, where it doesn't have a clean
           | solution)
        
           | heavyset_go wrote:
           | > _Why the whole even loop thing is even visible at all._
           | 
           | It isn't anymore.                   In [3]: from asyncio
           | import run              In [4]: async def async_func():
           | print('Ran async_func()')              In [5]:
           | run(async_func())         Ran async_func()
           | 
           | Top-level async/await is also available in the Python REPL
           | and IPython, and there are discussions on the Python mailing
           | list about making top-level async/await the default for
           | Python[1].                   In [1]: async def async_func():
           | print('Ran async_func()')              In [2]: await
           | async_func()         Ran async_func()
           | 
           | [1] https://groups.google.com/g/python-
           | ideas/c/PN1_j7Md4j0/m/0xy...
        
             | BiteCode_dev wrote:
             | Oh, top level await... I missed that.
             | 
             | Not sure it will get there, but it would be nice. I think
             | putting a top level "await" is explicit enough for stating
             | you want an event loop anyway.
             | 
             | Now, with TaskGroup in 3.11, things are going to get pretty
             | nice, espacially if this top level await plays out,
             | provided they include async for and async with in the mix.
             | 
             | Now, if they could just make so that coroutines are
             | automatically schedules to the nearest task group, we would
             | almost have something usable.
        
         | dekhn wrote:
         | so true. I've been writing thread-callback code for decades
         | (common in network and gui event loops, see QtPy as an example)
         | and when I looked at asyncio my first thought is "this is not
         | better". It's entirely nontrivial to analyze code using asyncio
         | (or yield) compared to callbacks.
        
         | harpiaharpyja wrote:
         | If you are ever considering making use of asyncio for your
         | project, I would strongly recommend taking a look at curio [1]
         | as an alternative. It's like asyncio but far, far easier to
         | use.
         | 
         | [1] https://curio.readthedocs.io/en/latest/index.html
        
           | acidbaseextract wrote:
           | The video (or blog post) below is one of the best
           | explanations I've seen about what subtle bugs are easy to
           | make with asyncio, why it's easy to make them, and how the
           | trio library addresses them.
           | 
           | But yes, consider alternatives before you pick asyncio as
           | your approach!
           | 
           | Talk: https://www.youtube.com/watch?v=oLkfnc_UMcE
           | 
           | Blog post: https://vorpus.org/blog/notes-on-structured-
           | concurrency-or-g...
        
           | VWWHFSfQ wrote:
           | Highly recommend curio
        
           | [deleted]
        
           | BiteCode_dev wrote:
           | While the design of Curio is quite interesting, it's may not
           | be a good choice, not for technical reasons, but for
           | logistical reasons: the chances it gets a wide adoption are
           | slim to None.
           | 
           | And since we are stuck with colored functions in python, the
           | choice of stack matters very much.
           | 
           | Now, if you want easier concurrency, and a solution to a lot
           | of concurrency problems that curio solves, while still being
           | compatible with asyncio, use anyio:
           | 
           | https://anyio.readthedocs.io/en/stable/
           | 
           | It's a layer that works on top of asyncio, so it's compatible
           | with all of it. But it features the nursery concept from
           | Trio, which makes async programming so much simpler and
           | safer.
        
             | heavyset_go wrote:
             | anyio is also compatible with asyncio and Trio, so you can
             | use it with either library or paradigm.
        
               | sandGorgon wrote:
               | Uvloop/uvicorn - which is the production grade asgi
               | server only works with asyncio.
               | 
               | Hypercorn works with trio..but you lose a LOT of
               | performance
        
           | quietbritishjim wrote:
           | Curio's spiritual successor is Trio [1], which was written by
           | one of the main Curio contributors and is more actively
           | maintained (and, at this point, much more widely used). Like
           | Curio, it's much easier to use than asyncio, although ideas
           | from it are gradually being incorporated back into asyncio
           | e.g. asyncio.run() was inspired by curio.run()/trio.run().
           | 
           | I have used Trio in real projects and I thoroughly recommend
           | it.
           | 
           | This blog post [2] by the creator of Trio explains some of
           | the benefits of those libraries in a very readable way.
           | 
           | [1] https://trio.readthedocs.io/en/stable/
           | 
           | [2] https://vorpus.org/blog/some-thoughts-on-asynchronous-
           | api-de...
        
         | btown wrote:
         | > instead we got a natively supported, robust, easy-to-use
         | concurrency paradigm built around green/virtual threading that
         | accommodates both IO and CPU bound work
         | 
         | Minus the "natively supported" part, we have this today in
         | http://www.gevent.org/ ! It's so, so empowering to be able to
         | access the entire historical body of work of synchronous-I/O
         | Python libraries, and with a single monkey patch cause every
         | I/O operation, no matter how deep in the stack, to yield to
         | your greenlet pool _without code changes_.
         | 
         | We fire up one process per core (gevent doesn't have good
         | support for multiprocessing, but if you're relying on that,
         | you're stuck on one machine anyways), spend perhaps 1 person-
         | day a quarter dealing with its quirks, and in turn we never
         | need to worry about the latencies of external services; our web
         | servers and batch workers have throughput limited only by CPU
         | and RAM, for which there's relatively little (though nonzero)
         | overhead.
         | 
         | IMO Python should have leaned into official adoption of gevent.
         | It may not beat asyncio in raw performance numbers because
         | asyncio can rely on custom-built bytecode instructions, whereas
         | gevent has "userspace" code that must execute upon every yield.
         | And, as with asyncio, you have to be careful about CPU-
         | intensive code that may prevent you from yielding. But it's
         | perfect for most horizontal-scaling soft-realtime web-style use
         | cases.
        
         | int_19h wrote:
         | How would those green/virtual threads interface with native
         | async APIs (e.g. the entirety of WinRT)?
        
         | tbabb wrote:
         | What specifically is the problem with asyncio? I quite like
         | using it, so I'm curious if there's some aspect that makes it
         | unsustainable?
        
           | fullstop wrote:
           | I like using it as well, but I've been bit several times by
           | having runtime exceptions completely swallowed.
        
           | throwaway81523 wrote:
           | > What specifically is the problem with asyncio?
           | 
           | Watch the very NSFW (lots of swearing) but hysterically funny
           | video "node.js is bad ass rock star tech" on youtube sometime
           | ;). https://www.youtube.com/watch?v=bzkRVzciAZg
        
           | calpaterson wrote:
           | The key disadvantage is largely that it bifurcates the
           | library base. Async libraries and sync libraries co-exist
           | uneasily in the same program.
           | 
           | For nearly every popular library there is now a (usually
           | inferior, less robust) async one. The benefits of Linus' Law
           | are reduced.
        
             | Redoubts wrote:
             | Trifucates, since now there's stdlib asyncio, and a popular
             | trio async flavor too.
        
             | tbabb wrote:
             | Fair point. Async is a big enough idea that it probably
             | warrants designing the language with it in mind. I guess
             | another way of phrasing it would be that it violates the
             | "there's only one way to do it" maxim, and the "two ways of
             | doing it" circumstance necessarily came about because the
             | idea was discovered long after the core language and
             | libraries were already written.
        
           | aeyes wrote:
           | It solves only one problem, the name says it: Async I/O
           | 
           | If you do anything on the CPU or if you have any I/O which is
           | not async you stall the event loop and everything grinds to a
           | halt.
           | 
           | Imagine a program which needs to send heartbeats or data to a
           | server in a short interval to show liveness, Kafka for
           | example. Asyncio alone can't reliably do this, you need to
           | take great care to not stall the event loop. You only have
           | exactly one CPU core to work with, if you do work on the CPU
           | you stall the event loop.
           | 
           | We see web frameworks built on asyncio but even simple API
           | only applications constantly need to serialize data which is
           | CPU-bound. These frameworks make no effort (and asyncio
           | doesn't give us any tools) to protect the event loop from
           | getting stalled by your code. They work great in simple
           | benchmarks and for a few types of applications but you have
           | to know the limits. And I feel that the general public does
           | not know the limitations of asyncio, it wasn't made for
           | building web frameworks on the async event loop. It was made
           | for communicating with external services like databases and
           | calling APIs.
        
       | twic wrote:
       | > With this scheme, the reference count in each object is split
       | in two, with one "local" count for the owner (creator) of the
       | object and a shared count for all other threads. Since the owner
       | has exclusive access to its count, increments and decrements can
       | be done with fast, non-atomic instructions. Any other thread
       | accessing the object will use atomic operations on the shared
       | reference count.
       | 
       | > Whenever the owning thread drops a reference to an object, it
       | checks both reference counts against zero. If both the local and
       | the shared count are zero, the object can be freed, since no
       | other references exist. If the local count is zero but the shared
       | count is not, a special bit is set to indicate that the owning
       | thread has dropped the object; any subsequent decrements of the
       | shared count will then free the object if that count goes to
       | zero.
       | 
       | So in this program:                 import threading
       | def produce():         global global_foo         local_foo =
       | "potato"         global_foo = local_foo            def consume():
       | global global_foo         local_foo = global_foo
       | global_foo = None            if __name__ == '__main__':
       | produce()         thread = threading.Thread(target=consume)
       | thread.start()         thread.join()
       | 
       | What happens to the counts on the string "potato"?
       | 
       | In produce, the main thread creates it and puts it in a local,
       | and increments the local count. It assigns it to a global, and
       | increments the local count. It then drops the local when produce
       | returns, and decrements the local count. In consume, the second
       | thread copies the global to a local, and increments the shared
       | count. It clears out the global, and decrements the shared count.
       | It then drops the local when consume returns, and decrements the
       | shared count.
       | 
       | That leaves the local count at 1 and the shared count at -1!
       | 
       | You might think that there must be special handling around
       | globals, but that doesn't fix it. Wrap the string in a perfectly
       | ordinary list, and put the list in the global, and you have the
       | same problem.
       | 
       | I imagine this is explained in the paper by Choi et al, but i
       | have not read it!
        
         | ameixaseca wrote:
         | I couldn't find this in the design document but the only
         | obvious solution is to track globals via the shared count.
         | Since a global reference is part of all threads simultaneously,
         | it cannot be treated as local.
         | 
         | If you follow this reasoning, the operations above result in
         | local=0/shared=0 after the last assignment.
        
           | twic wrote:
           | As i said in the comment, that doesn't work. Put a list in
           | the global, and then push and pop the string on the list.
           | Even better, push the string into a local list, then put that
           | list in another local list, then put that in a global, etc.
           | You would need to dynamically keep every object reachable
           | from a global marked as such, and that's a non-starter.
        
         | twic wrote:
         | The paper:
         | 
         | > When the shared counter for an object becomes negative for
         | the first time, the non-owner thread updating the counter also
         | sets the object's Queued flag. In addition, it puts the object
         | in a linked list belonging to the object's owner thread called
         | QueuedObjects. Without any special action, this object would
         | leak. This is because, even after all the references to the
         | object are removed, the biased counter will not reach zero --
         | since the shared counter is negative. As a result, the owner
         | would trigger neither a counter merge nor a potential
         | subsequent object deallocation.
         | 
         | > To handle this case, BRC provides a path for the owner thread
         | to explicitly merge the counters called the ExplicitMerge
         | operation. Specifically, each thread has its own thread-safe
         | QueuedObjects list. The thread owns the objects in the list. At
         | regular intervals, a thread examines its list. For each queued
         | object, the thread merges the object's counters by accumulating
         | the biased counter into the shared counter. If the sum is zero,
         | the thread deallocates the object. Otherwise, the thread
         | unbiases the object, and sets the Merged flag. Then, when a
         | thread sets the shared counter to zero, it will deallocate the
         | object. Overall, as shown in invariant I4, an owner only gives
         | up ownership when it merges the counters.
         | 
         | Well, that works, but it's a bit naff.
        
         | Jtsummers wrote:
         | Two spaces in front of each line of the code block. As written,
         | right now, your comment is hard to parse:
         | import threading            def produce():         global
         | global_foo         local_foo = "potato"         global_foo =
         | local_foo            def consume():         global global_foo
         | local_foo = global_foo         global_foo = None            if
         | __name__ == '__main__':         produce()         thread =
         | threading.Thread(target=consume)         thread.start()
         | thread.join()
        
           | twic wrote:
           | Sorry about that. I did indent before pasting the code - but
           | gedit indents with tabs, which HN ignores!
        
             | kzrdude wrote:
             | It can be configured. I remember when Gedit was quite the
             | potent editor, language plugins, snippets and stuff. But it
             | has the basics left still :)
        
       | overgard wrote:
       | I feel like Gvr just doesnt want to change things, Feels doomed
       | 
       | This has been a problem for like 20 years and they have refused
       | fixes before. And there have been fixes. They just don't see this
       | as important
       | 
       | it's practically a religion that its a thing they wont change
        
         | heavyset_go wrote:
         | I disagree entirely. The last few releases of Python have made
         | significant changes to the language, coinciding with the
         | project becoming community-led after Guido stepped down.
        
           | mixmastamyk wrote:
           | Yes, stepped down as a result of him forcing the walrus
           | operator change into the language over significant
           | opposition.
        
         | randlet wrote:
         | Guido is no longer the BDF and spoke fairly positively about
         | this change in the mailing list thread[1].
         | 
         | "To be clear, Sam's basic approach is a bit slower for single-
         | threaded code, and he admits that. But to sweeten the pot he
         | has also applied a bunch of unrelated speedups that make it
         | faster in general, so that overall it's always a win. But
         | presumably we could upstream the latter easily, separately from
         | the GIL-freeing part."
         | 
         | [1] https://mail.python.org/archives/list/python-
         | dev@python.org/...
        
       | Waterluvian wrote:
       | I can see it now:
       | 
       | "My program has 2^64 references to an object, which caused it to
       | become immortal"
       | 
       | =)
        
         | notriddle wrote:
         | In a 64-bit address space, with objects requiring more than one
         | word to store, that's literally impossible.
        
       | The_rationalist wrote:
       | Or you could just use GraalVM python
       | https://github.com/oracle/graalpython
        
       | misnome wrote:
       | > This "optimization" actually slows single-threaded accesses
       | down slightly, according to the design document, but that penalty
       | becomes worthwhile once multi-threaded execution becomes
       | possible.
       | 
       | My understanding was that CPython viewed any single-threaded
       | performance regression as a blocker to GIL-removal attempts,
       | regardless of if other work by the developer has sped up the
       | interpreter? This article seems to somewhat gloss over that with
       | "it's only small". I'd be interested in knowing other estimations
       | of what the "better-than-average chance" of this (promising
       | sounding) attempt were.
       | 
       | Breaking C extensions (especially the less-conforming ones, which
       | seem likely to be the least maintained) also seems like it would
       | be a very hard pill to swallow, and the sort of thing that might
       | make it a Python 3-to-4 breaking change, which I imagine would
       | also be approached extremely carefully given there are still
       | people to-this-day who believe that python 3 is a mistake and one
       | day everyone will realise it and go back to python 2 (yes,
       | really).
        
         | singhrac wrote:
         | From the article:
         | 
         | > Gross has also put some significant work into improving the
         | performance of the CPython interpreter in general. This was
         | done to address the concern that has blocked GIL-removal work
         | in the past: the performance impact on single-threaded code.
         | The end result is that the new interpreter is 10% faster than
         | CPython 3.9 for single-threaded programs.
        
           | singhrac wrote:
           | Sorry, to be clear, I missed your point "regardless of if
           | other work by the developer has sped up the interpreter".
           | That's fair, though my personal opinion is that that seems
           | like an incredibly high bar for any language.
        
         | a1369209993 wrote:
         | > and one day everyone will realise it
         | 
         | No? Why would we think that? There are people who willingly use
         | _java_ ; compared to that the problems with python 3 are
         | downright non-obvious as long as you never need to work with
         | things like non-Unicode text.
        
         | Kranar wrote:
         | C extensions can continue to be supported. Said extensions
         | already explicitly lock/release the GIL, so to keep things
         | backwards compatible it would be perfectly fine if there was a
         | GIL that existed strictly for C extension compatibility.
        
         | masklinn wrote:
         | > My understanding was that CPython viewed any single-threaded
         | performance regression as a blocker to GIL-removal attempts,
         | regardless of if other work by the developer has sped up the
         | interpreter?
         | 
         | Previous GILectomy attempts incurred significant single-
         | threaded performance penalties, on on the order of 50% or
         | above. If Gross's work yields low single-digit performance
         | penalty it's pretty likely to be accepted as this is the sort
         | of impacts which can happen semi-routinely as part of
         | interpreter updates.
         | 
         | The complete breakage of C extensions would be a much bigger
         | issue.
        
         | ajkjk wrote:
         | There are people who believe all kinds of crazy things; it
         | doesn't reflect their truth. Going back to Python 2 is not
         | going to ever happen (and no one working on Py3 would ever want
         | to, anyway).
         | 
         | A hard pill to swallow.. ain't that bad if it also benefits you
         | tremendously, which fixing the GIL would do.
        
           | EamonnMR wrote:
           | I do wish for a world where Python 3 had handled
           | unicode/bytes very differently.
        
         | fatbird wrote:
         | It was Guido's requirement that GIL removal not degrade single
         | threaded performance at all, but in the talk I attend at PyCon
         | 2019, the speaker mentioned nothing about qualifications on
         | that. Guido's restriction was presented, quite reasonably, as
         | "no one should have to suffer because of removing the GIL". So
         | a net break-even or performance improvement is fine.
         | 
         | And on top of that, Guido has retired now, and the steering
         | committee may feel differently as long as the spirit of the
         | restrictions is upheld.
        
           | fatbird wrote:
           | Guido has replied to Gross's announcement to observe that his
           | performance improvements are not tied to removing the GIL and
           | could be accepted separately. But he doesn't reject Gross's
           | work outright, and if the same release that includes the GIL
           | removal also delivers a concrete performance upgrade, I
           | suspect that Guido would be fine with it. His concern is,
           | after all, practical, to do with the actual use of python and
           | not some architecture principle.
        
       | efoto wrote:
       | "The biggest source of problems might be multi-threaded programs
       | with concurrency-related bugs that have been masked by the GIL
       | until now."
        
         | zinodaur wrote:
         | > concurrency-related bugs that have been masked by the GIL
         | 
         | Yeah... could phrase this as "All programs written with the
         | assumption of a GIL are now broken" instead. Wish they had done
         | this as part of the breaking changes for python 3, I guess
         | they'll have to wait for Python 4 for this?
        
         | Animats wrote:
         | Yes. I once discovered that CPickle was not thread-safe. The
         | response was that much of the library didn't really work in
         | multi-threaded programs.
        
           | formerly_proven wrote:
           | You mean programs where you put an object into pickle and
           | some other threads modify it while pickle is processing it?
           | Doesn't surprise me - the equivalent written in plain Python
           | would be very thread unsafe as well.
        
             | Animats wrote:
             | No, I mean several threads doing completely separate
             | CPickle streams with no shared data or variables at the
             | Python level.
        
               | kzrdude wrote:
               | Has it since been fixed?
        
               | toyg wrote:
               | Probably not. CPickle is famously shunned by anyone who
               | has to do serious, performance-critical
               | serialization/deserialization.
        
               | kzrdude wrote:
               | I was curious, and an issue that fits the description was
               | fixed in Py 3.7.x here:
               | https://bugs.python.org/issue34572 but other threading
               | bugs remain: https://bugs.python.org/issue38884
        
       | nomdep wrote:
       | If the Python maintainers doesn't want to approve this, Gross
       | should talk to the Pypi developers.
        
       | otterley wrote:
       | This may be a silly question, but if you really need concurrency,
       | why not use a language that's built for concurrency from the
       | ground up instead? Elixir is a great example.
        
         | klyrs wrote:
         | I rarely need concurrency, and do a lot of Python because it's
         | what all my dependencies are written in. But sometimes, I find
         | myself bottlenecked on a trivially parallelizable operation. In
         | the state (my dependecies are in Python, I have a working
         | Python implementation), there's _no way in hell_ that (rewrite
         | my dependencies in Elixir, rewrite my code in Elixir) is a
         | sensible next move.
        
         | lucb1e wrote:
         | Are you proposing to write anything that will need concurrency
         | anywhere in your favorite language, or just call into the
         | concurrent code from python? (Since comments like
         | https://news.ycombinator.com/item?id=28883990 seem to be taking
         | it as the former whereas I took it as the latter.)
        
         | ska wrote:
         | To a first approximation, people don't use python for itself,
         | they use it for the vast ecosystem and network effect. If you
         | jump to another language for better concurrency, what are you
         | giving up?
         | 
         | Unless you really are doing greenfield development in an
         | isolated application, these considerations often trump any
         | language feature.
        
           | otterley wrote:
           | Don't get me wrong; I'm not suggesting that anyone dump
           | Python altogether to switch to a different language for any
           | arbitrary project or purpose. Many businesses I work with use
           | different languages for different components or applications,
           | using the network or storage to intercommunicate when
           | necessary. The right tool for the job, as it were.
        
         | ferdowsi wrote:
         | There are some organizations with lots of domain knowledge and
         | expertise around around developing, securing and deploying
         | Python and they don't have the Innovation Currency to spend on
         | investing in a new language.
         | 
         | Specific to your point, recruiting for Elixir talent is a
         | problem compared to more mainstream languages. Recruiting in
         | general is extremely hard at this moment.
        
           | otterley wrote:
           | Given all the corner cases people are going to continue to
           | find whilst trying to coax Python into behaving correctly in
           | a highly concurrent program -- especially one that utilizes
           | random libraries from the ecosystem -- I can't help but
           | wonder whether the Innovation Currency is better spent
           | replacing the components that require high concurrency (which
           | often is only a subset of them) instead of getting stuck in
           | the mire of bug-smashing.
        
         | pmontra wrote:
         | A possible answer is that everybody in the company knows Python
         | and no other language. Another one is that they have to reuse
         | or extend a bunch of existing Python code. The latter happened
         | to me. Performances were definitely not a concern but I
         | suddenly needed threads doing extra functionality over the
         | original single threaded algorithm. BTW, I used a queue to pass
         | messages between them.
        
           | otterley wrote:
           | Using multiple interpreters with message passing is a
           | workable, if expensive, way to deal with the problem. It is
           | trading one cost for another. (These sort of tradeoffs are
           | encountered all the time in business, to be sure.)
        
         | Fordec wrote:
         | Sometimes a project starts off aiming to solve a problem. maybe
         | it's a data science problem, so support already exists in
         | python so lets do that. Ok, it worked great and it's catching
         | on with users. Now we need to scale, but we are running into
         | concurrency issues. What is a better answer? Ok we will work on
         | improving python concurrency under the hood, or completely
         | scrap the code base and switch to a different language?
         | 
         | Very few people set out going asking themselves about such low
         | level details on day one of a project. Especially something
         | that was an MVP or POC
        
           | otterley wrote:
           | I'll plead ignorance here: Do data science workflows often
           | require high concurrency using a single interpreter? I
           | thought all that stuff was compute-bound and parceled out to
           | workers that farm out calculations to CPUs and GPUs.
        
       | Animats wrote:
       | Or you could just use PyPy, which uses a garbage collector, does
       | more compile-time analysis, and runs much faster.
       | 
       | CPython is a naive interpreter, like original JavaScript. There's
       | been progress since then.
        
         | llimllib wrote:
         | lots of people need C extensions, which you can't* have on
         | pypy.
         | 
         | *: mostly true
        
         | willvarfar wrote:
         | Pypy is still single threaded.
         | https://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-gil...
         | 
         | This work is super exciting! Can pypy use the same recipe to
         | offer true parallelism plus the jit??
         | 
         | Will be really interesting to see what pypy devs think of this
         | work and how they might also lever it!
        
           | nas wrote:
           | I think it can't use the same recipe. Sam's approach for
           | CPython uses biased reference counting. Internally, Pypy uses
           | a tracing garbage collector, not reference counting. I don't
           | know how difficult it would be to make their GC thread-safe.
           | Probably you don't want to "stop the world" on every GC pass
           | so I guess changes are non-trivial.
           | 
           | Sam's changes to CPython's container objects (dicts, lists),
           | to make them thread safe might also be hard to port directly
           | to Pypy. Pypy implements those objects differently.
        
             | willvarfar wrote:
             | I think the biggest thing it will give is a need to go
             | there. Until now, pypy has been able to not do parallelism.
             | But if cpython is suddenly faster for a big class of
             | program, pypy will have to bite the bullet to stay
             | relevant?
        
         | masklinn wrote:
         | pypy also has a GIL.
        
         | nerdponx wrote:
         | PyPy being stuck on 3.7 hurts. If 3.8 support comes out soon,
         | I'll be happy to switch for general-purpose work. 3.9 would be
         | even nicer, to support the type annotation improvements. I
         | donate every month, but I'm just an individual donating pocket
         | change; it'd be great to see some corporate support for PyPy.
        
           | calpaterson wrote:
           | There are very few new features in 3.8.
           | 
           | It is a much less important release (for features) than 3.7,
           | which for example added dataclasses and lots of typing and
           | asyncio stuff.
           | 
           | The most significant change in 3.8 is a notoriously
           | controversial new infix operator. Even it's supporters would
           | say that it's a niche usecase.
        
             | masklinn wrote:
             | > There are very few new features in 3.8.
             | 
             | > It is a much less important release (for features) than
             | 3.7, which for example added dataclasses and lots of typing
             | and asyncio stuff.
             | 
             | That's funny because my take is the exact opposite:
             | dataclasses are not very useful (attrs exists and does
             | more), deferred type annotations are meh, contextvars,
             | breakpoint(), and module-level getattr/settattr but not
             | exactly anything you can't do without.
             | 
             | Assignment expressions provide for great cleanups in some
             | contexts (and avoiding redundant evaluations in e.g.
             | comprehensions), expr= is tremendous for printf-debugging,
             | posonly args is really useful, \N in regex can much improve
             | their readability when relevant.
             | 
             | $dayjob has migrated to python 3.7 and there's really
             | nothing I'm excited to use (possibly aside from doing weird
             | things with breakpoint), whereas 3.8 would be a genuine
             | improvement to my day-to-day enjoyment.
        
               | nerdponx wrote:
               | Deferred type annotations with `from __future__ import
               | annotations` are a game-changer IMO. You can use them
               | 3.7, which is good enough for me. The big improvement in
               | 3.9 is not having to use `typing.*` for a lot of basic
               | data types.
               | 
               | The biggest improvements between 3.7, 3.8, 3.9, and 3.10
               | are in `asyncio`, which was pretty rough in 3.7 and very
               | usable in 3.9. I use the 3rd-party `anyio` library in a
               | lot of cases anyway (https://anyio.readthedocs.io/), but
               | it's not always feasible.
        
         | laurencerowe wrote:
         | It's been a few years since I last played around with PyPy but
         | while it provided amazing performance gains for simple
         | algorithmic code I saw no speed up on a more complex web
         | application.
        
       | typical182 wrote:
       | This is a great list of influences on the design (from the
       | article comments where the prototype author Sam Gross responded
       | to someone wishing for more cross pollination across language
       | communities):
       | 
       | ----------
       | 
       | "... but I'll give a few more examples specific to this project
       | of ideas (or code) taken from other communities:
       | 
       | - Biased reference counting (originally implemented for Swift)
       | 
       | - mimalloc (originally developed for Koka and Lean)
       | 
       | - The design of the internal locks is taken from WebKit
       | (https://webkit.org/blog/6161/locking-in-webkit/)
       | 
       | - The collection thread-safety adapts some code from FreeBSD
       | (https://github.com/colesbury/nogil/blob/nogil/Python/qsbr.c)
       | 
       | - The interpreter took ideas from LuaJIT and V8's ignition
       | interpreter (the register-accumulator model from ignition, fast
       | function calls and other perf ideas from LuaJIT)
       | 
       | - The stop-the-world implementation is influenced by Go's design
       | (https://github.com/golang/go/blob/fad4a16fd43f6a72b6917eff65...
       | )"
        
         | [deleted]
        
       | Ericson2314 wrote:
       | > Gross has also put some significant work into improving the
       | performance of the CPython interpreter in general.
       | 
       | Earmarks work, folks!
        
       | a1369209993 wrote:
       | > With this scheme, the reference count in each object is split
       | in two, with one "local" count for the owner (creator) of the
       | object and a shared count for all other threads. Since the owner
       | has exclusive access to its count, increments and decrements can
       | be done with fast, non-atomic instructions. Any other thread
       | accessing the object will use atomic operations on the shared
       | reference count.
       | 
       | > Whenever the owning thread drops a reference to an object, it
       | checks both reference counts against zero. If both the local and
       | the shared count are zero, the object can be freed, since no
       | other references exist. If the local count is zero but the shared
       | count is not, [ _]a special bit is set to indicate that the
       | owning thread has dropped the object[_ ]; any subsequent
       | decrements of the shared count will then free the object if that
       | count goes to zero.
       | 
       | This seems... off. Wouldn't it work better for the owning thread
       | to hold (exactly) one atomic reference, which is released (using
       | the same decref code as other threads) when the local reference
       | count goes to zero?
       | 
       | Edit: I probably should have explicitly noted that, as jetrink
       | points out, the object is initialized with a atomic refcount of
       | one (the "local refcount is nonzero" reference), and destroyed
       | when the atomic refcount is one and to-be-decremented, so a
       | purely local object never has atomic writes.
        
         | [deleted]
        
         | Someone wrote:
         | > which is released (using the same decref code as other
         | threads) when the local reference count goes to zero?
         | 
         | (I may misunderstand your remark, as 'releasing' is a bit
         | ambiguous. It could mean decreasing reference count and freeing
         | the memory if the count goes to zero or just plain freeing the
         | memory)
         | 
         | The local ref count can go to zero while other threads still
         | have references to the object (e.g. when the allocating thread
         | sends an object as a message to another thread and, knowing the
         | message arrived, releases it), so freeing the memory when it
         | does would be a serious bug.
         | 
         | Also, the shared ref count can go negative. From the paper:
         | 
         | > _As an example, consider two threads T1 and T2. Thread T1
         | creates an object and sets itself as the owner of it. It points
         | a global pointer to the object, setting the biased counter to
         | one. Then, T2 overwrites the global pointer, decrementing the
         | shared counter of the object. As a result, the shared counter
         | becomes negative._
         | 
         | That can't happen with the biased counter because, when it
         | would end up going negative, the object gets unbiased, and the
         | shared counter gets decreased instead.
         | 
         | That asymmetry is what ensures that only a single thread
         | updates the biased counter, so that no locks are needed to do
         | that.
        
           | a1369209993 wrote:
           | > I may misunderstand your remark, as 'releasing' is a bit
           | ambiguous.
           | 
           | The _reference_ is released; ie the (atomic) reference count
           | is decremented (and the object is only freed if that caused
           | the atomic reference count to go to zero).
           | 
           | > From the paper
           | 
           | I missed that there was a paper and was referring to the
           | proposed implementation in python that was described in TFA.
           | IIUC, biased refcount (in paper) is local (in my
           | description), and shared is atomic, correct?
           | 
           | > the shared ref count can go negative
           | 
           | And _that_ makes sense. Thanks. (And also explains how to
           | deal with references added by one thread and removed by
           | another, when one of those threads is the object owner.)
        
         | morelisp wrote:
         | This seems like it would be less efficient if most objects
         | don't escape their owning thread (you would need one atomic
         | inc/dec versus zero), which is probably true of most objects.
        
           | a1369209993 wrote:
           | Sorry, should have been more clear; edited.
        
         | johntb86 wrote:
         | Suppose thread A (the owner) keeps a reference, but also puts
         | another reference in a global variable. This would increment
         | its local refcount to 1 and have a shared refcount of 1.
         | 
         | Then thread B clears the global variable. With your scheme the
         | local refcount would be 1 but the shared refcount would be 0,
         | so thread B would destroy the object even though it's
         | referenced by thread A.
        
         | kccqzy wrote:
         | I like this idea. In fact another possibility is to have a
         | thread-local reference count for each thread that uses the
         | object which can use fast non-atomic operations, and then each
         | thread can use a shared atomic reference count, that counts how
         | many threads use the object. When each thread-local count goes
         | to zero, the shared count is decremented by one.
         | 
         | This way, if an object is created in one thread and transferred
         | to another, the other thread wouldn't even need to do a lot of
         | atomic reference count manipulations. There wouldn't be
         | surprising behavior in which different threads run the same
         | code with different speed, just by virtue of whether they
         | created the objects or not.
        
           | Fronzie wrote:
           | Good point. Windows COM did not follow your suggestion,
           | leading to all sorts of awkwardness in applications that have
           | compute- and ui-threads and share objects between the two.
           | Object destruction becomes non-predictable and can hold up a
           | UI thread.
        
           | twic wrote:
           | How would the storage be laid out?
           | 
           | With the proposed scheme, there are two counters (and, i
           | assume, the ID of the owning thread), so a small fixed-size
           | structure, which can sit directly in the object header. With
           | your scheme, you need a variable and unbounded number of
           | counters. Where would they go?
        
             | wishawa wrote:
             | Maybe each thread could have a mapping storing the number
             | of references held (in that thread) for each object? This
             | way only the atomic refcount has to be in the object
             | header. Also I don't think there would be an owning thread
             | at all with this idea, so no ID needed.
        
         | wishawa wrote:
         | I think yours is a much cleaner design. In the original plan,
         | if the owning thread just set the special bit, but before that
         | set is propagated, another thread drops the shared refcount to
         | zero, the object would never be released, would it?
         | 
         | EDIT: never mind the question, I just read that the special bit
         | is atomic.
        
         | dennisafa wrote:
         | I think that idea was mentioned earlier in the article:
         | 
         | > The simplest change would be to replace non-atomic reference
         | count operations with their atomic equivalents. However, atomic
         | instructions are more expensive than their non-atomic
         | counterparts. Replacing Py_INCREF and Py_DECREF with atomic
         | variants would result in a 60% average slowdown on the
         | pyperformance benchmark suite.
        
           | a1369209993 wrote:
           | > to replace non-atomic reference count operations with their
           | atomic equivalents.
           | 
           | Nope, my proposal still uses two reference counts (one
           | atomic, one local); it just avoids having a seperate flag bit
           | to indicate that the owning thread is done.
        
         | cogman10 wrote:
         | Yeah, I'm not exactly getting all the complexity here.
         | 
         | I'm digging the 2 reference counters, that makes sense to me,
         | but I don't know why it isn't something more like:
         | 
         | "every time a new thread takes a reference, atomic +1, every
         | time a new thread's local count hits 0, atomic -1. If the
         | shared reference is 0, free".
         | 
         | IDK what special purpose the flags are serving here.
        
           | [deleted]
        
           | jeremyjh wrote:
           | Most objects are never shared so there would be a performance
           | impact from incrementing (and decrementing) an atomic counter
           | even just once.
        
             | jetrink wrote:
             | That is true, but what if the shared count were initialized
             | to one and the creator thread frees an object when the
             | shared count is equal to one and the local count is
             | decremented zero? (Since it knows it holds one shared
             | reference.) Then the increment and decrement would be
             | avoided for non-shared objects.
        
               | a1369209993 wrote:
               | I probably should have mentioned that explicitly; edited.
        
           | kzrdude wrote:
           | To have one local count per thread would add memory overhead,
           | I think? In his solution there are only two counters per
           | object, local and shared.
           | 
           | Any other thread can't know if it's the first time or not
           | it's taking a reference to an object.
        
         | AgentME wrote:
         | >Edit: I probably should have explicitly noted that, as jetrink
         | points out, the object is initialized with a atomic refcount of
         | one (the "local refcount is nonzero" reference), and destroyed
         | when the atomic refcount is one and to-be-decremented, so a
         | purely local object never has atomic writes.
         | 
         | I think you're under the impression that the refcount would
         | only ever need to be incremented if the object was shared to
         | another thread, but that's not the case. The refcount isn't a
         | count of how many threads have a reference to the object; it's
         | a count of how many objects and closures have a reference to
         | the object. Even objects that never leave their creator thread
         | will be likely to have their reference count incremented and
         | decremented a few times over their life.
        
           | a1369209993 wrote:
           | > Even objects that never leave their creator thread will be
           | likely to have their reference count incremented and
           | decremented a few times over their life.
           | 
           | I think you're under the impression that there's only one
           | refcount. The point of the original design (and this one) is
           | that there are two refcounts: one that's updated only by the
           | thread that created the object, and therefor doesn't need to
           | use slow atomic accesses, and one that's updated atomically,
           | and therefor can be adjusted by arbitrary threads.
        
             | AgentME wrote:
             | Oh, I misunderstood you then. I thought you were trying to
             | get rid of the local refcount and make the atomic one
             | handle its job too, but what you're suggesting is a
             | possible simplification of the logic that detects when it's
             | time to destroy the object. That makes sense, just seems
             | more minor than I thought you were going at and I guess I
             | missed it.
        
         | jeremyjh wrote:
         | Threads don't hold references - other objects do and we have to
         | know how many do. If threads held a reference it might never be
         | released. Since most objects are never shared we wouldn't want
         | to increment an atomic counter even once for those.
        
           | a1369209993 wrote:
           | I don't _think_ the objection you 're actually making is
           | valid (the extra atomic reference is just a representation of
           | the fact that the local refcount is nonzero), but come to
           | think of it, even in the original version, how the heck does
           | a thread know whether a reference held by (say) a dictionary
           | that is itself accesible to multiple threads was increfed by
           | the owning thread or another thread?
        
       | lormayna wrote:
       | Why not using something trio or curio? They are quite easy to
       | learn, very powerful and have an approach similar to channel in
       | golang.
        
         | synchronizing wrote:
         | async != multithreading
        
         | calpaterson wrote:
         | Those are not multithreading, they are asynchronous io which is
         | different. With asynchronous io in Python the only
         | concurrency/parallelism you can do is for IO.
         | 
         | Multithreading in Python currently has the same limitation
         | though but it needn't.
        
       | rich_sasha wrote:
       | This is some of the best news I read in a while!
       | 
       | Multiprocessing sort of works but it's really sucky.
        
       | jeremyis wrote:
       | Way to go Sam! Mark my words: our generations' Carmack!
        
       | dsr_ wrote:
       | I'm going to assume that there is a reason that this isn't a
       | switch control, so that the default is a single-threaded program
       | and the programmer needs to state explicitly that this one will
       | be multi-threaded, upon which the interpreter changes into the
       | atomic mode for the rest of execution?
        
         | toxik wrote:
         | That would be expensive.
        
         | mikepurvis wrote:
         | Basically no one would get the glorious single-threaded
         | performance then, since the first time you pip install
         | anything, you're going to discover that it spins up a thread
         | under the hood that you're never exposed to.
         | 
         | Or worse, you end up with the async schism all over again, with
         | new "threadless" versions of popular libraries springing up.
        
         | __s wrote:
         | Most references are thread local, where this implementation
         | will still beat out atomic refcounts in a multi-threaded app
        
       | sandGorgon wrote:
       | has anyone built and run this in docker ? would love to test this
       | out - i dont have a lot of experience in compiling python inside
       | docker
       | 
       | EDIT: there is a dockerfile in there
       | https://raw.githubusercontent.com/colesbury/nogil/nogil/Dock...
        
       | cormacrelf wrote:
       | > _" biased reference counts" and is described in this paper by
       | Jiho Choi et al. With this scheme, the reference count in each
       | object is split in two, with one "local" count for the owner
       | (creator) of the object and a shared count for all other threads_
       | 
       | > _The interpreter 's memory allocator has been replaced with
       | mimalloc_
       | 
       | These are very similar ideas!
       | 
       | Mimalloc is notable for its use of separate local and remote free
       | lists, where objects that are being freed from a different thread
       | than the page's heap's owner are placed in a separate queue. The
       | local free list is (IIRC) non-atomic until it is empty and local
       | allocs start pulling from the remote queue.
       | 
       | The general idea is clearly lazy support for concurrency,
       | matching up perfectly with Python's need to keep any single
       | threaded perf it has. I'm impressed with the application of all
       | of these things at once.
        
       | marris wrote:
       | How big a problem is the possible breakage of C extensions for
       | new code? Is there currently some standard "future proofed for
       | multi-thread" way of writing them that will reduce the odds of
       | the C extension breaking? And maybe also being compatible with
       | PyPy? Or do developers today need to write a separate version for
       | each interpreter that they want to support?
        
         | heavyset_go wrote:
         | There are projects[1] that are abstracting away the C extension
         | interface in order to standardize C extensions across
         | implementations and prevent breaking changes.
         | 
         | [1] https://github.com/hpyproject/hpy
        
       | singhrac wrote:
       | Notably the dev proposing this (Sam Gross aka colesbury) is/was a
       | major PyTorch developer, so someone quite familiar with high
       | performance Python and extensions.
        
         | ajtulloch wrote:
         | and he's a genius!
        
           | jeremyis wrote:
           | +1 ! Though, not a very good Oculus player (yet)!
        
       | mzs wrote:
       | Yikes, C extensions can't assume they are under GIL by default:
       | 
       | https://github.com/colesbury/numpy/commits/v1.19.3-nogil
        
         | kzrdude wrote:
         | It looks like a total of four lines needed changing in numpy
         | due to his change. That's a very good score in my book, numpy
         | is huge.
        
           | Jweb_Guru wrote:
           | Unfortunately, every C extension will need to undergo manual
           | review for safety, unless there's some very easy way to have
           | the C extension opt into using the GIL. And some of them will
           | be close to impossible to detangle in this way.
        
             | veryupwork wrote:
             | no
        
           | int_19h wrote:
           | It really depends on how the library is written, and how much
           | shared data it has. It has been very common to use GIL as a
           | general-purpose synchronization mechanism in native Python
           | modules, since you have to pay that tax either way.
        
       | ikiris wrote:
       | I was half expecting a link to go or rust.
        
       | jeffybefffy519 wrote:
       | I dont know about others, but I really enjoy content about the
       | Python GIL. Its a fascinatingly complex problem.
        
         | lucb1e wrote:
         | For a minute I thought I finally found someone else who likes
         | the GIL, but then you said _content about_. Programs that just
         | divide up work across processes are much easier to write
         | without introducing obscure bugs due to the lack of atomicity.
         | I 'm definitely excited for a GIL-less python, even if it's a
         | rare scenario where it makes sense to try to do performant code
         | in python in the first place rather than offloading a few lines
         | to another language to be fast, but I am a bit afraid that
         | people (particularly beginner programmers) will grab this with
         | too many hands. Having also seen recommendations for this-or-
         | that threading method going around in other languages, threads
         | are recommended really much more often than where it makes
         | sense and beginners won't have a comparative experience yet of
         | writing multi-process code instead.
         | 
         | That said, I am also always interested in GIL-related content
         | like this! Loved the article.
        
         | solarmist wrote:
         | Yup. Me too, but I'm not sad to see it go.
        
       | phkahler wrote:
       | >> If that bit is set, the interpreter doesn't bother tracking
       | references for the relevant object at all. That avoids contention
       | (and cache-line bouncing) for the reference counts in these
       | heavily-used objects. This "optimization" actually slows single-
       | threaded accesses down slightly, according to the design
       | document, but that penalty becomes worthwhile once multi-threaded
       | execution becomes possible.
       | 
       | Was going to say do the opposite. Set the bit if you want
       | counting and then modify the increment and decrement to add or
       | subtract the bit, thereby eliminating condition checking and
       | branching. But it sounds like the concern is cache behavior when
       | the count is written. Checking the bit can avoid any modification
       | at all.
        
         | Jtsummers wrote:
         | Under this scheme, objects get freed when both local and shared
         | counts are zero. By using a special value that makes the shared
         | count non-zero (for eternal and long-lived objects), it ensures
         | that should the owner (for some reason) drop them, they will
         | not be freed. No extra logic has to be introduced, the shared
         | count is non-zero and that's all that's needed to prevent
         | freeing.
        
       ___________________________________________________________________
       (page generated 2021-10-15 23:00 UTC)