[HN Gopher] Goroutines Under the Hood (2020) ___________________________________________________________________ Goroutines Under the Hood (2020) Author : s3micolon0 Score : 107 points Date : 2022-06-03 12:21 UTC (2 days ago) (HTM) web link (osmh.dev) (TXT) w3m dump (osmh.dev) | rounakdatta wrote: | While the blog is a great introductory post, | https://www.youtube.com/watch?v=KBZlN0izeiY is a great watch if | you're interested in the magical optimizations in goroutine | scheduling. | qwertox wrote: | I've fallen in love with Python's asyncio for some time now, but | I know that go has coroutines integrated as a first class | citizen. | | This article (which I have not read but just skimmed) made me | search for a simple example, and I landed at "A Tour of Go - | Goroutines"[0] | | That is one of the cleanest examples I've ever seen on this | topic, and it shows just how well integrated they are in the | language. | | [0] https://go.dev/tour/concurrency/1 | throwaway894345 wrote: | Having used both in production for many years, Go's model is | waayyyy better, mostly because Python's model results in a | bunch of runtime bugs. | | The least of which are type error things like forgetting to | await an async function--these can be caught with a type | checker (although this means you need to have a type checker | running in your CI and annotations for all of your | dependencies). | | The most serious are the ones where someone calls a sync or | CPU-heavy function (directly or transitively) and it starves | the event loop causing timeouts in unrelated endpoints and | eventually bringing down the entire application (load shedding | can help mitigate this somewhat). Go dodges these problems by | not having sync functions at all (everything is async under the | covers) and parallelism means CPU-bound workloads don't block | the whole event loop. | aaronbwebber wrote: | I love Go and goroutines, but... | | > A newly minted goroutine is given a few kilobytes | | a line later | | > It is practical to create hundreds of thousands of goroutines | in the same address space | | So it's not practical to create 100s of Ks of goroutines - it's | possible, sure, but because you incur GBs of memory overhead if | you are actually creating that many goroutines means that for any | practical problem you are going to want to stick to a few | thousand goroutines. I can almost guarantee you that you have | something better to do with those GBs of memory than store | goroutine stacks. | | Asking the scheduler to handle scheduling 100s of Ks of | goroutines is also not a great idea in my experience either. | barsonme wrote: | In addition to the other comments about memory usage, I'll | mention that there is a proposal (that's either going to make | it into Go 1.19 or 1.20?) that uses heuristics to determine a | good starting stack size for goroutines. | sumy23 wrote: | My experience is that whatever you're doing with the go routine | is usually a bottleneck before the go routine itself. E.g. if | you make a network request, you become network bound before | memory bound from go routines. | hamburglar wrote: | If you asked me what "a few kb" times "hundreds of thousands" | is, I'd have characterized it as "more than a few hundreds of | thousands of kb", not necessarily "gigabytes," and that doesn't | sound impractical at all. My JVM heaps are usually 16GB. | | And go actually does a pretty good job of scheduling hundreds | of thousands of threads. 6 months ago I did some fairly abusive | high-thread-count experiments solving problems in silly ways | that required all N goroutines to participate and I didn't see | much perf falloff on my laptop until I got 1.5-2 million | goroutines. | uluyol wrote: | Why is spending GB on stack space a bad thing? Ultimately, in a | server, you need to store state for each request. Whether | that's on the stack or heap, it's still memory that necessarily | has to be used. | guenthert wrote: | Despite popular belief, not everything is a (web) server. I | can imagine many threads to be appealing in e.g. simulations. | jjnoakes wrote: | If you need the stack space then there is no difference. The | difference arises because if you preallocate all that stack | space using worst case stack sizes and don't use most of it, | you've wasted lots of memory. | | Also there is a ton of nuance here like overcommitted pages | and large address spaces which mitigate some of those | downsides. | scottlamb wrote: | > So it's not practical to create 100s of Ks of goroutines - | it's possible, sure, but because you incur GBs of memory | overhead if you are actually creating that many goroutines | means that for any practical problem you are going to want to | stick to a few thousand goroutines. I can almost guarantee you | that you have something better to do with those GBs of memory | than store goroutine stacks. | | You lost me in a couple places: | | 1) "GBs of memory overhead" being a lot. A rule of thumb I've | seen in a datacenter situation is that (iirc) 1 hyperthread and | 6 GiB of RAM are roughly equivalent in cost. (I'm sure it | varies over processor/RAM generations, so you should probably | check this on your platform rather than take my word for it.) I | think most engineers are way too stingy with RAM. It often | makes sense to use more of it to reduce CPU, and to just spend | it on developer convenience. Additionally, often one goroutine | matches up to one incoming or outgoing socket connection (see | below). How much RAM are you spending per connection on socket | buffers? Probably a lot more than a few kilobytes... | | 2) The idea that you target a certain number of goroutines. | They model some activity, often a connection or request. I | don't target a certain number of those; I target filling the | machine. (Either the most constrained resource of | CPU/RAM/SSD/disk/network if you're the only thing running | there, or a decent chunk of it with Kubernetes or whatever, | bin-packing to use all dimensions of the machine as best as | possible.) Unless the goroutines' work is exclusively CPU- | bound, of course, then you want them to match the number of | CPUs available, so thousands is too much already. | deepsun wrote: | > I think most engineers are way too stingy with RAM. It | often makes sense to use more of it to reduce CPU, and to | just spend it on developer convenience. | | Hey! That's Java's argument! | morelisp wrote: | I agree that GBs for 100Ks of go routines is not in some | sense "a lot", in that you might still be using memory pretty | effectively. But I don't see that a "6GB vs 1 core" tradeoff | makes any sense to talk about. | | We have HTTP ingress that needs ~100 cores but could | theoretically all fit in 1GB. We have k/v stores that need | only 16 cores but would like 500GB. And we have data points | at most places in-between. We can't give the ingress 600GB | instead, and we can't give the k/v stores 100 cores. So the | fact they're financially interchangeable is meaningless for | capacity planning. | | Arguably, for _most code_ and especially in a GCd language, | using less memory and less CPU go hand-in-hand. | scottlamb wrote: | If you are in aggregate making good use of all the | dimensions of the available machines/VMs, great. I think | often people either leave one dimension unused or (when | buying their own hardware / selecting a VM shape) could be | adding more RAM cheaply. | | > Arguably, for _most code_ and especially in a GCd | language, using less memory and less CPU go hand-in-hand. | | Agreed in general. Even in a non-GC language, less dense | data structures means worse CPU cache utilization. But on | the other hand, memoization and the like can provide a real | trade-off. | | In this case, I don't think it's costing much CPU. The GC | isn't traversing beyond the bounds of the stack, and it | mostly shouldn't end up in the CPU cache either. (Just a | partial cache line at the boundary, and some more after a | goroutine's stack shrinks or the goroutine exits.) | geodel wrote: | I mean it has many diagrams and logical explanation of Goroutines | and concurrency concepts in general but it is definitely not | under the hood descriptions. | geertj wrote: | I came here to write the same thing. Things I had hoped this | would go into is how goroutines grow their stacks, and how they | are preempted. | ignoramous wrote: | There you go: Vicki Niu, _Goroutines: Under the hood_ , | https://www.youtube-nocookie.com/embed/S-MaTH8WpOM (2020). | jeffbee wrote: | It didn't really lift the hood at all, unfortunately. Luckily for | us the runtime is extensively commented, e.g. | https://github.com/golang/go/blob/master/src/runtime/proc.go | guenthert wrote: | So it's just coroutines on top of n:m scheduling, similar to what | SysV offered a while ago? | captainmuon wrote: | One thing that really goes against my intuition is that user | space threads (lightweight treads, goroutines) are faster than | kernel threads. Without knowing too much assembly, I would assume | any modern processor would make a context switch a one | instruction affair. Interrupt -> small scheduler code picks the | thread to run -> LOAD THREAD instruction and the processor swaps | in all the registers and the instruction pointer. | | You probably can't beat that in user space, especially if you | want to preempt threads yourself. You'd have to check after every | step, or profile your own process or something like that. And | indeed, Go's scheduler is cooperative. | | But then, why can't you get the performance of Goroutines with OS | threads? Is it just because of legacy issues? Or does it only | work with cooperative threading, which requires language support? | | One thing I'm missing from that article is how the | cooperativeness is implemented. I think in Go (and in Java's | Project Loom), you have "normal code", but then deep down in | network and IO functions, you have magic "yield" instructions. So | all the layers above can pretend they are running on regular | threads, and you avoid the "colored function problem", but you | get runtime behavior similar to coroutines. Which only works if | really every blocking IO is modified to include yielding | behavior. If you call a blocking OS function, I assume something | bad will happen. | heavyset_go wrote: | I don't have any specific recommendations to give you, but skim | through an operating systems text book, or college course that | puts its slides and whatnot online, when it comes to kernel | context switching. It'll give you an idea of what kind of work | a kernel must do when context switching between threads and | processes, and why userspace multitasking can be more | efficient. | garaetjjte wrote: | One of the issues is that OS schedulers are complex, and | actually much more expensive than context switch itself. You | can mitigate this with user-mode scheduling of kernel threads: | https://www.youtube.com/watch?v=KXuZi9aeGTw | danachow wrote: | > modern processor would make a context switch a one | instruction affair. | | Reduced instruction set (complexity) is a hallmark of modern | processor designs, not the other way around. | | You might want to read about what is involved in a task switch | (either "thread" with same memory mapping, or "process") but it | is not something that is conducive to reasonably carry out in | one instruction. | duped wrote: | > I would assume any modern processor would make a context | switch a one instruction affair. Interrupt -> small scheduler | code picks the thread to run -> LOAD THREAD instruction and the | processor swaps in all the registers and the instruction | pointer. | | It can't be a single instruction, since the details of what a | "context" contains depends on the OS and ABI. For example on | Linux, the signal mask is a part of the OS thread context (but | usually not user thread contexts) which requires a syscall to | retrieve it from kernel memory before saving it in the context. | | The reason why user threads are so much faster than OS threads | is precisely because it can be reduced to a handful of | instructions without caring about all the details that OS | threads need to care about. | | > Which only works if really every blocking IO is modified to | include yielding behavior. If you call a blocking OS function, | I assume something bad will happen. | | That's exactly what Go does, they introduce yield points into | function prologues and i/o ops. You don't have direct FFI calls | in Go so it's not as big of an issue. It's roughly the same | problem as GC safepoints in multithreaded interpreters that | support FFI. | pron wrote: | There are a few reasons why context switching in user mode | could be faster, but that has little if anything to do with the | performance benefits of usermode threads. The performance | benefit of usermode threads is a result of their _quantity_ , | due to Little's law. They're not "faster", just more numerous, | and that's what you need for higher throughput. More precisely, | OS threads, because of their scarcity, introduce an artificial | bound on throughput that's lower than what the hardware can | support, and usermode threads remove that bound. | | More here: https://inside.java/2020/08/07/loom-performance/ | | As to why it's hard for the OS to allow that many threads, the | OS would need to keep thread stacks small and resizable, and | that is hard to do if you don't know the specifics of how the | language uses the stack. For example, to accommodate low-level | languages that allow pointers into the stack you would need to | manipulate virtual memory (to keep addresses valid), but that | only works at a page granularity, or to introduce split stacks, | which would require a new kind of ABI known to compilers (and | would probably have a cost to performance). | ibraheemdev wrote: | > More precisely, OS threads, because of their scarcity, | introduce an artificial bound on throughput that's lower than | what the hardware can support, and usermode threads remove | that bound. | | Why are OS threads scarce? The OS allocates thread stacks | lazily. Given a kernel stack of ~8kb (two pages) and a user | stack of ~4kb, one could spawn 100k threads with just over | 1GB. A userspace runtime will allow you to bring that number | down, but if you're at the scale of concurrency it is | unlikely to matter much. | masklinn wrote: | > And indeed, Go's scheduler is cooperative. | | It hasn't been cooperative for a few versions now, the | scheduler became preemptive in 1.14. And before that there were | yield points at every function prolog (as well as all IO | primitives) so there were relatively few situations where | cooperation was necessary. | | > Without knowing too much assembly, I would assume any modern | processor would make a context switch a one instruction affair. | | Any context switch (to the kernel) is expensive, and way more | than a single operation. The kernel also has a ton of stuff to | do, it's not just "picks the thread to run", you have to | restore the ip and sp, but also may have to restore FPU/SSE/AVX | state (AVX512 is over 2KB of state), traps state. | | Kernel-level context switching costs on the order of 10x what | userland context switching does: | https://eli.thegreenplace.net/2018/measuring-context-switchi... | | > LOAD THREAD | | There is no load thread instruction | jaytaylor wrote: | > It hasn't been cooperative for a few versions now, the | scheduler became preemptive in 1.14. And before that there | were yield points at every function prolog (as well as all IO | primitives) so there were relatively few situations where | cooperation was necessary. | | Since co-op was most unnecessary, do you know why it was | changed to preemptive or what the specific cases were that | are resolved with preemptive scheduling? | mseepgood wrote: | Tight loops without function calls. | kgeist wrote: | IIRC in earlier versions, an infinite loop without | function calls could freeze the entire runtime: GC's stop | the world event is triggered => goroutines are being | suspended cooperatively => but this one goroutine never | enters a function prolog and thus never suspends => all | goroutines except one are suspended and there's no | progress. Preemptive scheduling is much more robust. | Although it's solvable in cooperative scheduling with an | additional suspension check at the end of each loop, but | it adds overhead for all loops. If I remember correctly, | .NET or JVM implement safe points for GC (which can be | used to switch contexts cooperatively as well) by simply | reading a pointer from a special preallocated virtual | memory page which is remapped to nothing when a stop-the- | world event is triggered, so such a read traps into an | invalid memory handler where you can park your thread. | But I'm not sure how costly it is for thousands/millions | of coroutines. | jesboat wrote: | > But I'm not sure how costly it is for | thousands/millions of coroutines. | | Still cheap: you only need to preempt the threads which | are actively running user code. If a coroutine is ready | to run, but not actually running, you don't have to do | anything with it (as long as you check for safepoints | before entering user code.) That means your safepoints | cost is `O(os threads currently running user code)` which | in most runtimes is `O(num cores)` | throwaway894345 wrote: | Replace "tight" with "long-running/infinite" but yeah, | otherwise this is correct. | YZF wrote: | There is a ton of context associated with OS/kernel threads. | Virtual memory, security, I/O. While there is some hardware | acceleration for those in modern processors there isn't | anything like LOAD THREAD and even with CPU support it's still | very expensive. | | You get an interrupt, then the kernel needs to load its own | context (the tables it needs to access), then the kernel needs | to do the expensive switch. | | In user space you have a lot less context. The actual switching | is pretty much the cost of a function call. If you need | preemption that's a different story and mostly depends on what | facilities are available for that. Inserting preemption checks | is a little hacky (hello Go ;) ) but what can you do. | | EDIT: It's worthwhile noting there's indirect costs like caches | being stale. Lightweight/green threads will often work on | shared data structures so the caches are more likely to have | something useful in them. They may even share the code. | asien wrote: | >I would assume any modern processor would make a context | switch a one instruction affair. | | Has been the historic assumption, has been proven to be wrong | by every possible benchmark. | | Consider tech empower[0] for raw stack performance , runtime | level threads outperform IO threads since OS thread were | designed to be mapped on physicals cores. | | This is very expensive and inefficient. | | Creating one thread for every request you have ( Apache + PHP ) | will exhaust the hardware after a few thousands/qps target. | | Runtime can indeed have millions of those "lightweight threads" | without killing your machine since they create a pool from | physical threads and tap into IO events to efficiently switch | or resume contexts. This is by far much faster. | | [0] | https://www.techempower.com/benchmarks/#section=data-r20&hw=... | gpderetta wrote: | Just the interrupt itself is going to cost a couple of order of | magnitude more than the whole userspace context switch. | [deleted] | yellow_lead wrote: | (2020) But so high level it's still relevant. | bogomipz wrote: | In the conclusion the author states: | | >"Go run-time scheduler multiplexes goroutines onto threads and | when a thread blocks, the run-time moves the blocked goroutines | to another runnable kernel thread to achieve the highest | efficiency possible." | | Why would the Go run-time move the blocked goroutines to another | runnable kernel thread? If it is currently blocked it won't be | schedulable regardless no? ___________________________________________________________________ (page generated 2022-06-05 23:00 UTC)