[HN Gopher] Goroutines Under the Hood (2020)
       ___________________________________________________________________
        
       Goroutines Under the Hood (2020)
        
       Author : s3micolon0
       Score  : 107 points
       Date   : 2022-06-03 12:21 UTC (2 days ago)
        
 (HTM) web link (osmh.dev)
 (TXT) w3m dump (osmh.dev)
        
       | rounakdatta wrote:
       | While the blog is a great introductory post,
       | https://www.youtube.com/watch?v=KBZlN0izeiY is a great watch if
       | you're interested in the magical optimizations in goroutine
       | scheduling.
        
       | qwertox wrote:
       | I've fallen in love with Python's asyncio for some time now, but
       | I know that go has coroutines integrated as a first class
       | citizen.
       | 
       | This article (which I have not read but just skimmed) made me
       | search for a simple example, and I landed at "A Tour of Go -
       | Goroutines"[0]
       | 
       | That is one of the cleanest examples I've ever seen on this
       | topic, and it shows just how well integrated they are in the
       | language.
       | 
       | [0] https://go.dev/tour/concurrency/1
        
         | throwaway894345 wrote:
         | Having used both in production for many years, Go's model is
         | waayyyy better, mostly because Python's model results in a
         | bunch of runtime bugs.
         | 
         | The least of which are type error things like forgetting to
         | await an async function--these can be caught with a type
         | checker (although this means you need to have a type checker
         | running in your CI and annotations for all of your
         | dependencies).
         | 
         | The most serious are the ones where someone calls a sync or
         | CPU-heavy function (directly or transitively) and it starves
         | the event loop causing timeouts in unrelated endpoints and
         | eventually bringing down the entire application (load shedding
         | can help mitigate this somewhat). Go dodges these problems by
         | not having sync functions at all (everything is async under the
         | covers) and parallelism means CPU-bound workloads don't block
         | the whole event loop.
        
       | aaronbwebber wrote:
       | I love Go and goroutines, but...
       | 
       | > A newly minted goroutine is given a few kilobytes
       | 
       | a line later
       | 
       | > It is practical to create hundreds of thousands of goroutines
       | in the same address space
       | 
       | So it's not practical to create 100s of Ks of goroutines - it's
       | possible, sure, but because you incur GBs of memory overhead if
       | you are actually creating that many goroutines means that for any
       | practical problem you are going to want to stick to a few
       | thousand goroutines. I can almost guarantee you that you have
       | something better to do with those GBs of memory than store
       | goroutine stacks.
       | 
       | Asking the scheduler to handle scheduling 100s of Ks of
       | goroutines is also not a great idea in my experience either.
        
         | barsonme wrote:
         | In addition to the other comments about memory usage, I'll
         | mention that there is a proposal (that's either going to make
         | it into Go 1.19 or 1.20?) that uses heuristics to determine a
         | good starting stack size for goroutines.
        
         | sumy23 wrote:
         | My experience is that whatever you're doing with the go routine
         | is usually a bottleneck before the go routine itself. E.g. if
         | you make a network request, you become network bound before
         | memory bound from go routines.
        
         | hamburglar wrote:
         | If you asked me what "a few kb" times "hundreds of thousands"
         | is, I'd have characterized it as "more than a few hundreds of
         | thousands of kb", not necessarily "gigabytes," and that doesn't
         | sound impractical at all. My JVM heaps are usually 16GB.
         | 
         | And go actually does a pretty good job of scheduling hundreds
         | of thousands of threads. 6 months ago I did some fairly abusive
         | high-thread-count experiments solving problems in silly ways
         | that required all N goroutines to participate and I didn't see
         | much perf falloff on my laptop until I got 1.5-2 million
         | goroutines.
        
         | uluyol wrote:
         | Why is spending GB on stack space a bad thing? Ultimately, in a
         | server, you need to store state for each request. Whether
         | that's on the stack or heap, it's still memory that necessarily
         | has to be used.
        
           | guenthert wrote:
           | Despite popular belief, not everything is a (web) server. I
           | can imagine many threads to be appealing in e.g. simulations.
        
           | jjnoakes wrote:
           | If you need the stack space then there is no difference. The
           | difference arises because if you preallocate all that stack
           | space using worst case stack sizes and don't use most of it,
           | you've wasted lots of memory.
           | 
           | Also there is a ton of nuance here like overcommitted pages
           | and large address spaces which mitigate some of those
           | downsides.
        
         | scottlamb wrote:
         | > So it's not practical to create 100s of Ks of goroutines -
         | it's possible, sure, but because you incur GBs of memory
         | overhead if you are actually creating that many goroutines
         | means that for any practical problem you are going to want to
         | stick to a few thousand goroutines. I can almost guarantee you
         | that you have something better to do with those GBs of memory
         | than store goroutine stacks.
         | 
         | You lost me in a couple places:
         | 
         | 1) "GBs of memory overhead" being a lot. A rule of thumb I've
         | seen in a datacenter situation is that (iirc) 1 hyperthread and
         | 6 GiB of RAM are roughly equivalent in cost. (I'm sure it
         | varies over processor/RAM generations, so you should probably
         | check this on your platform rather than take my word for it.) I
         | think most engineers are way too stingy with RAM. It often
         | makes sense to use more of it to reduce CPU, and to just spend
         | it on developer convenience. Additionally, often one goroutine
         | matches up to one incoming or outgoing socket connection (see
         | below). How much RAM are you spending per connection on socket
         | buffers? Probably a lot more than a few kilobytes...
         | 
         | 2) The idea that you target a certain number of goroutines.
         | They model some activity, often a connection or request. I
         | don't target a certain number of those; I target filling the
         | machine. (Either the most constrained resource of
         | CPU/RAM/SSD/disk/network if you're the only thing running
         | there, or a decent chunk of it with Kubernetes or whatever,
         | bin-packing to use all dimensions of the machine as best as
         | possible.) Unless the goroutines' work is exclusively CPU-
         | bound, of course, then you want them to match the number of
         | CPUs available, so thousands is too much already.
        
           | deepsun wrote:
           | > I think most engineers are way too stingy with RAM. It
           | often makes sense to use more of it to reduce CPU, and to
           | just spend it on developer convenience.
           | 
           | Hey! That's Java's argument!
        
           | morelisp wrote:
           | I agree that GBs for 100Ks of go routines is not in some
           | sense "a lot", in that you might still be using memory pretty
           | effectively. But I don't see that a "6GB vs 1 core" tradeoff
           | makes any sense to talk about.
           | 
           | We have HTTP ingress that needs ~100 cores but could
           | theoretically all fit in 1GB. We have k/v stores that need
           | only 16 cores but would like 500GB. And we have data points
           | at most places in-between. We can't give the ingress 600GB
           | instead, and we can't give the k/v stores 100 cores. So the
           | fact they're financially interchangeable is meaningless for
           | capacity planning.
           | 
           | Arguably, for _most code_ and especially in a GCd language,
           | using less memory and less CPU go hand-in-hand.
        
             | scottlamb wrote:
             | If you are in aggregate making good use of all the
             | dimensions of the available machines/VMs, great. I think
             | often people either leave one dimension unused or (when
             | buying their own hardware / selecting a VM shape) could be
             | adding more RAM cheaply.
             | 
             | > Arguably, for _most code_ and especially in a GCd
             | language, using less memory and less CPU go hand-in-hand.
             | 
             | Agreed in general. Even in a non-GC language, less dense
             | data structures means worse CPU cache utilization. But on
             | the other hand, memoization and the like can provide a real
             | trade-off.
             | 
             | In this case, I don't think it's costing much CPU. The GC
             | isn't traversing beyond the bounds of the stack, and it
             | mostly shouldn't end up in the CPU cache either. (Just a
             | partial cache line at the boundary, and some more after a
             | goroutine's stack shrinks or the goroutine exits.)
        
       | geodel wrote:
       | I mean it has many diagrams and logical explanation of Goroutines
       | and concurrency concepts in general but it is definitely not
       | under the hood descriptions.
        
         | geertj wrote:
         | I came here to write the same thing. Things I had hoped this
         | would go into is how goroutines grow their stacks, and how they
         | are preempted.
        
           | ignoramous wrote:
           | There you go: Vicki Niu, _Goroutines: Under the hood_ ,
           | https://www.youtube-nocookie.com/embed/S-MaTH8WpOM (2020).
        
       | jeffbee wrote:
       | It didn't really lift the hood at all, unfortunately. Luckily for
       | us the runtime is extensively commented, e.g.
       | https://github.com/golang/go/blob/master/src/runtime/proc.go
        
       | guenthert wrote:
       | So it's just coroutines on top of n:m scheduling, similar to what
       | SysV offered a while ago?
        
       | captainmuon wrote:
       | One thing that really goes against my intuition is that user
       | space threads (lightweight treads, goroutines) are faster than
       | kernel threads. Without knowing too much assembly, I would assume
       | any modern processor would make a context switch a one
       | instruction affair. Interrupt -> small scheduler code picks the
       | thread to run -> LOAD THREAD instruction and the processor swaps
       | in all the registers and the instruction pointer.
       | 
       | You probably can't beat that in user space, especially if you
       | want to preempt threads yourself. You'd have to check after every
       | step, or profile your own process or something like that. And
       | indeed, Go's scheduler is cooperative.
       | 
       | But then, why can't you get the performance of Goroutines with OS
       | threads? Is it just because of legacy issues? Or does it only
       | work with cooperative threading, which requires language support?
       | 
       | One thing I'm missing from that article is how the
       | cooperativeness is implemented. I think in Go (and in Java's
       | Project Loom), you have "normal code", but then deep down in
       | network and IO functions, you have magic "yield" instructions. So
       | all the layers above can pretend they are running on regular
       | threads, and you avoid the "colored function problem", but you
       | get runtime behavior similar to coroutines. Which only works if
       | really every blocking IO is modified to include yielding
       | behavior. If you call a blocking OS function, I assume something
       | bad will happen.
        
         | heavyset_go wrote:
         | I don't have any specific recommendations to give you, but skim
         | through an operating systems text book, or college course that
         | puts its slides and whatnot online, when it comes to kernel
         | context switching. It'll give you an idea of what kind of work
         | a kernel must do when context switching between threads and
         | processes, and why userspace multitasking can be more
         | efficient.
        
         | garaetjjte wrote:
         | One of the issues is that OS schedulers are complex, and
         | actually much more expensive than context switch itself. You
         | can mitigate this with user-mode scheduling of kernel threads:
         | https://www.youtube.com/watch?v=KXuZi9aeGTw
        
         | danachow wrote:
         | > modern processor would make a context switch a one
         | instruction affair.
         | 
         | Reduced instruction set (complexity) is a hallmark of modern
         | processor designs, not the other way around.
         | 
         | You might want to read about what is involved in a task switch
         | (either "thread" with same memory mapping, or "process") but it
         | is not something that is conducive to reasonably carry out in
         | one instruction.
        
         | duped wrote:
         | > I would assume any modern processor would make a context
         | switch a one instruction affair. Interrupt -> small scheduler
         | code picks the thread to run -> LOAD THREAD instruction and the
         | processor swaps in all the registers and the instruction
         | pointer.
         | 
         | It can't be a single instruction, since the details of what a
         | "context" contains depends on the OS and ABI. For example on
         | Linux, the signal mask is a part of the OS thread context (but
         | usually not user thread contexts) which requires a syscall to
         | retrieve it from kernel memory before saving it in the context.
         | 
         | The reason why user threads are so much faster than OS threads
         | is precisely because it can be reduced to a handful of
         | instructions without caring about all the details that OS
         | threads need to care about.
         | 
         | > Which only works if really every blocking IO is modified to
         | include yielding behavior. If you call a blocking OS function,
         | I assume something bad will happen.
         | 
         | That's exactly what Go does, they introduce yield points into
         | function prologues and i/o ops. You don't have direct FFI calls
         | in Go so it's not as big of an issue. It's roughly the same
         | problem as GC safepoints in multithreaded interpreters that
         | support FFI.
        
         | pron wrote:
         | There are a few reasons why context switching in user mode
         | could be faster, but that has little if anything to do with the
         | performance benefits of usermode threads. The performance
         | benefit of usermode threads is a result of their _quantity_ ,
         | due to Little's law. They're not "faster", just more numerous,
         | and that's what you need for higher throughput. More precisely,
         | OS threads, because of their scarcity, introduce an artificial
         | bound on throughput that's lower than what the hardware can
         | support, and usermode threads remove that bound.
         | 
         | More here: https://inside.java/2020/08/07/loom-performance/
         | 
         | As to why it's hard for the OS to allow that many threads, the
         | OS would need to keep thread stacks small and resizable, and
         | that is hard to do if you don't know the specifics of how the
         | language uses the stack. For example, to accommodate low-level
         | languages that allow pointers into the stack you would need to
         | manipulate virtual memory (to keep addresses valid), but that
         | only works at a page granularity, or to introduce split stacks,
         | which would require a new kind of ABI known to compilers (and
         | would probably have a cost to performance).
        
           | ibraheemdev wrote:
           | > More precisely, OS threads, because of their scarcity,
           | introduce an artificial bound on throughput that's lower than
           | what the hardware can support, and usermode threads remove
           | that bound.
           | 
           | Why are OS threads scarce? The OS allocates thread stacks
           | lazily. Given a kernel stack of ~8kb (two pages) and a user
           | stack of ~4kb, one could spawn 100k threads with just over
           | 1GB. A userspace runtime will allow you to bring that number
           | down, but if you're at the scale of concurrency it is
           | unlikely to matter much.
        
         | masklinn wrote:
         | > And indeed, Go's scheduler is cooperative.
         | 
         | It hasn't been cooperative for a few versions now, the
         | scheduler became preemptive in 1.14. And before that there were
         | yield points at every function prolog (as well as all IO
         | primitives) so there were relatively few situations where
         | cooperation was necessary.
         | 
         | > Without knowing too much assembly, I would assume any modern
         | processor would make a context switch a one instruction affair.
         | 
         | Any context switch (to the kernel) is expensive, and way more
         | than a single operation. The kernel also has a ton of stuff to
         | do, it's not just "picks the thread to run", you have to
         | restore the ip and sp, but also may have to restore FPU/SSE/AVX
         | state (AVX512 is over 2KB of state), traps state.
         | 
         | Kernel-level context switching costs on the order of 10x what
         | userland context switching does:
         | https://eli.thegreenplace.net/2018/measuring-context-switchi...
         | 
         | > LOAD THREAD
         | 
         | There is no load thread instruction
        
           | jaytaylor wrote:
           | > It hasn't been cooperative for a few versions now, the
           | scheduler became preemptive in 1.14. And before that there
           | were yield points at every function prolog (as well as all IO
           | primitives) so there were relatively few situations where
           | cooperation was necessary.
           | 
           | Since co-op was most unnecessary, do you know why it was
           | changed to preemptive or what the specific cases were that
           | are resolved with preemptive scheduling?
        
             | mseepgood wrote:
             | Tight loops without function calls.
        
               | kgeist wrote:
               | IIRC in earlier versions, an infinite loop without
               | function calls could freeze the entire runtime: GC's stop
               | the world event is triggered => goroutines are being
               | suspended cooperatively => but this one goroutine never
               | enters a function prolog and thus never suspends => all
               | goroutines except one are suspended and there's no
               | progress. Preemptive scheduling is much more robust.
               | Although it's solvable in cooperative scheduling with an
               | additional suspension check at the end of each loop, but
               | it adds overhead for all loops. If I remember correctly,
               | .NET or JVM implement safe points for GC (which can be
               | used to switch contexts cooperatively as well) by simply
               | reading a pointer from a special preallocated virtual
               | memory page which is remapped to nothing when a stop-the-
               | world event is triggered, so such a read traps into an
               | invalid memory handler where you can park your thread.
               | But I'm not sure how costly it is for thousands/millions
               | of coroutines.
        
               | jesboat wrote:
               | > But I'm not sure how costly it is for
               | thousands/millions of coroutines.
               | 
               | Still cheap: you only need to preempt the threads which
               | are actively running user code. If a coroutine is ready
               | to run, but not actually running, you don't have to do
               | anything with it (as long as you check for safepoints
               | before entering user code.) That means your safepoints
               | cost is `O(os threads currently running user code)` which
               | in most runtimes is `O(num cores)`
        
               | throwaway894345 wrote:
               | Replace "tight" with "long-running/infinite" but yeah,
               | otherwise this is correct.
        
         | YZF wrote:
         | There is a ton of context associated with OS/kernel threads.
         | Virtual memory, security, I/O. While there is some hardware
         | acceleration for those in modern processors there isn't
         | anything like LOAD THREAD and even with CPU support it's still
         | very expensive.
         | 
         | You get an interrupt, then the kernel needs to load its own
         | context (the tables it needs to access), then the kernel needs
         | to do the expensive switch.
         | 
         | In user space you have a lot less context. The actual switching
         | is pretty much the cost of a function call. If you need
         | preemption that's a different story and mostly depends on what
         | facilities are available for that. Inserting preemption checks
         | is a little hacky (hello Go ;) ) but what can you do.
         | 
         | EDIT: It's worthwhile noting there's indirect costs like caches
         | being stale. Lightweight/green threads will often work on
         | shared data structures so the caches are more likely to have
         | something useful in them. They may even share the code.
        
         | asien wrote:
         | >I would assume any modern processor would make a context
         | switch a one instruction affair.
         | 
         | Has been the historic assumption, has been proven to be wrong
         | by every possible benchmark.
         | 
         | Consider tech empower[0] for raw stack performance , runtime
         | level threads outperform IO threads since OS thread were
         | designed to be mapped on physicals cores.
         | 
         | This is very expensive and inefficient.
         | 
         | Creating one thread for every request you have ( Apache + PHP )
         | will exhaust the hardware after a few thousands/qps target.
         | 
         | Runtime can indeed have millions of those "lightweight threads"
         | without killing your machine since they create a pool from
         | physical threads and tap into IO events to efficiently switch
         | or resume contexts. This is by far much faster.
         | 
         | [0]
         | https://www.techempower.com/benchmarks/#section=data-r20&hw=...
        
         | gpderetta wrote:
         | Just the interrupt itself is going to cost a couple of order of
         | magnitude more than the whole userspace context switch.
        
       | [deleted]
        
       | yellow_lead wrote:
       | (2020) But so high level it's still relevant.
        
       | bogomipz wrote:
       | In the conclusion the author states:
       | 
       | >"Go run-time scheduler multiplexes goroutines onto threads and
       | when a thread blocks, the run-time moves the blocked goroutines
       | to another runnable kernel thread to achieve the highest
       | efficiency possible."
       | 
       | Why would the Go run-time move the blocked goroutines to another
       | runnable kernel thread? If it is currently blocked it won't be
       | schedulable regardless no?
        
       ___________________________________________________________________
       (page generated 2022-06-05 23:00 UTC)