[HN Gopher] C++ Coroutines ___________________________________________________________________ C++ Coroutines Author : todsacerdoti Score : 98 points Date : 2023-02-22 16:31 UTC (6 hours ago) (HTM) web link (nigeltao.github.io) (TXT) w3m dump (nigeltao.github.io) | jmull wrote: | > Asynchronous code (e.g. callbacks) is more efficient (letting | you do other work while waiting for things) but is also more | complicated (manually saving and restoring state). | | This is a persistent misunderstanding about async... it is _not_ | generally more efficient than non-async code. In fact, the | overhead of async may slow things down. (There can be other | downsides as well. E.g. debugging may be more difficult.) | | For it to help, you need to have something better for the current | thread to be doing than waiting for whatever it is you're waiting | for. | | Whatever that is, another worker thread could be doing it, so | what we're really talking about is that async may let you reduce | the number of threads you're using. | | (Async is so useful in JS because it lets you reduce the number | of threads to one, which lets you sidestep all kinds of hairy | issues.) | | That might be good enough, depending on your situation, to | overcome the downsides of async worthwhile. But it may very well | not. | | Anyway: don't use async because it's "more efficient". | the_duke wrote: | You are leaving out an important aspect: context switches are | expensive, especially with all the mitigations introduced after | Spectre et al. | | Doing more work in a single thread saves a lot of context | switches, especially if IO makes up a large portion of the | workload. And even more so with other mechanisms to minimize | syscalls, like io_uring on Linux. | | Async overhead is highly dependent on the way the async runtime | is implemented. | | It's true that async requires scheduling and task switching to | be implemented in userspace, but assuming that async is used as | an alternative to threads, then the OS would be doing a similar | amount of work anyway. | | The OS may be able to do that more efficiently because it has | more information about the system. Or it may be the opposite, | because the application has more information about the | workloads. | | "async = overhead" as a general statement is not correct. | otabdeveloper4 wrote: | Context switches happen regardless of whether you're using | kernel mode ("threads") or user mode ("async") scheduling. | | And kernel mode actually has much more efficient and | performant context switches, because the kernel can access | hardware directly. | | (It is sometimes useful to be able to do custom user mode | scheduling, but certainly not because of "context switches".) | [deleted] | the_duke wrote: | I don't quite follow your argument there. | | This is unrelated to kernel threading. | | If you have 1 thread handling 1000 requests with some async | io mechanism (epoll, io_uring, ...) ,instead of 1000 | threads each handling one request, there are much fewer | threads fighting over CPU cores and the 1 thread can stay | active much longer, hence reducing the amount of context | switches. | | Especially with a mechanism like io_uring, which helps | minimize syscalls (and hence switching to kernel threads). | otherjason wrote: | Do you have a citation for kernel mode having more | efficient context switches? What kind of direct hardware | access are you referring to that would be better than | pushing the register context onto the stack? | | In my experience, the exact opposite is true, particularly | in the era of CPU mitigations that require TLB flushes upon | every kernel-mode context switch. | messe wrote: | You're right, kernel-level context switching is much | slower than user-level context switching. | | User-level can also have the advantage of having more | actual context about the task that is running, meaning | that it's often able to avoid saving/restoring as much | data as a kernel-level switch would. See Go's green | threads for a great example of this kind of cooperation | between runtime and language. | | > Do you have a citation for kernel mode having more | efficient context switches? What kind of direct hardware | access are you referring to that would be better than | pushing the register context onto the stack? | | The closest thing to this that I can think of is on | 32-bit x86 which did have hardware assisted context | switching via TSRs. | | As it happens, everybody stopped using it because it was | too slow, and a bit painful unless you fully bought into | x86's awful segmentation model. Early Linux kernels use | it if you want to see it in action. | jmull wrote: | Thread context switches are about scheduling threads on | physical cores. Async is about executing in a scope on a | thread. There's no direct conversion here. Packing a thread | with async tasks could let you reduce the number of threads | you have, which will probably reduce the thread context | switching if there is contention for cores. Whether that's | significantly better than the cost of async context switches | really depends on specifics (even if async context switch | cost is near zero, though we're no longer talking about C++ | coroutines in that case). But keep in mind: If your threads | are mostly waiting, they aren't triggering a lot of context | switches. More async means fewer busier threads, less async | means more less busy threads. To run into a problem with less | async your threads need to be forcing switching at pretty | fine-grained level. (BTW, most blocking for IO isn't fine- | grained from the perspective of instructions executing on a | core) That kind of thing happens, but has solutions other | than async. | naasking wrote: | > Anyway: don't use async because it's "more efficient". | | I don't think efficiency has only one meaning. For instance, | you can create millions of coroutines while you're limited to | only thousands of threads. Doesn't that mean your program is | making better use of your hardware and thus is more efficient? | otabdeveloper4 wrote: | Coroutines and threads are not that different. | | Threads usually preallocate some stack space, while | coroutines usually don't. But that isn't such a big deal. | josephg wrote: | Coroutines also don't need to flush the TLB when they get | scheduled. This is a pretty big deal for performance, when | you have a lot of them. | naasking wrote: | That's not the only difference. Coroutines also typically | don't have nearly as many concurrency hazards because you | can control when you yield. | dymk wrote: | coroutines can be either cooperative or not, depending on | implementation, so it depends | jmull wrote: | Sure, they _can be_ more efficient in _particular | situations_. | | But coroutines have overhead (so do threads). Your millions | of coroutines might be worse than your thousands of threads | in terms of making efficient use of the cpu cores your app or | service has available. | nmilo wrote: | No, your processor can only do n things at a time, where n is | the number of cores (probably like 8) (simplifying hugely of | course). It's pretty easy to fill out 8 cores with 1000s of | threads, the coroutines only become useful when almost all of | them are blocking on a socket or something, periodically | switching on and off to handle it. | [deleted] | naasking wrote: | Yes, but contexts where you're creating a million | coroutines are almost certainly I/O bound (like a web | server). | KMag wrote: | Agreed in general, that async code is generally less efficient | than epoll/kqueue/poll/select/etc. non-blocking I/O. However, | note that async code is often quite a bit more efficient than | naive Java 1.3-style (pre-NIO) tons-of-blocked-threads code. | | Now, io_uring async code is yet another story. | | I was working at LimeWire when Apple finally shipped a 1.4 JVM | as their system JVM. The whole thing got much lighter weight | when we could use non-blocking I/O. | junon wrote: | Eh? Async code almost always uses the APIs you describe. | KMag wrote: | The async runtime is almost always built on top of non- | blocking I/O APIs, and the extra layer of abstraction adds | overhead. You're rarely calling epoll in your async code, | but rather your runtime is handling all of that for you. | | That is, unless you're using io_uring, POSIX AIO, etc., in | which case your async code might not have the extra layer | of abstraction on top of the system APIs. | mr_00ff00 wrote: | But a single thread with async can, and for many tasks, is more | efficient that a single thread without async, because you | aren't waiting on requests. | | One point also is that until recently most personal computers | did not have multiple cores, aka you could think at the lowest | level that even multiple threads were just more dangerous async | lol | | Even making new threads in the JVM isn't guaranteed to result | in more than a single OS thread. | jcelerier wrote: | > One point also is that until recently most personal | computers did not have multiple cores, | | core 2 duo has been released 17 years ago | jcranmer wrote: | Dual core _started_ coming out ~2005. It took a few more | years for it to dominate the product lineup. Computers tend | to have a replacement cycle of ~5 years. Consequently, it | 's not until about 2012-ish that you could reasonably | expect a personal computer to have a multicore computer. | | (Yeah, that's still a decade ago, but that's a good deal | more recent than 17 years.) | dymk wrote: | A decade is a long time for computers | duped wrote: | Asynchronous execution is more efficient if and only if the | time spent waiting for a call to complete is longer than the | time spent checking if the call has completed plus the time | spent dispatching to another asynchronous call (and waiting vs | checking if it is completed, and so on). | | Remember that concurrency is not parallelism, it's not about | the number of threads being used. | danuker wrote: | Why implement coroutines when you have multithreading? | | Coroutines are to me a brain-twisty way to make the most out of a | single core, because your language doesn't support | multithreading. For example: JavaScript with a single thread in | the browser, Python with the GIL. | | C++ has threads which ACTUALLY run in parallel on the CPU. Why | bother complicating the language further? | eestrada wrote: | Because native multithreading is expensive when it comes to (a) | memory consumption per thread of execution and (b) context | switching. Also, if your program spends a lot of time just | waiting on outside resources (most notably slow/blocking IO), | coroutine concurrency is bother faster and more lightweight. If | your program is CPU bound, then coroutines/async have little- | to-no value over native multithreading. | | That being said, yeah, most async workflows are obtuse and | difficult to follow and they force all consumers of your | interface to also use the async primitives if they want the | benefits. Go seems to be one of the few languages that does | async/coroutines right. | | See: https://eli.thegreenplace.net/2018/go-hits-the- | concurrency-n... | rcme wrote: | What I like about Go is that each co-routine is entirely | synchronous, and the only concurrency primitives you're given | are channels and mutexes. This makes your code so simple and | easy to understand. | | I just looked up Kotlin's co-routines, and there are a number | of annoying hoops you need to jump through: | | 1. Co-routines can only be built by calling runBlocking or | coroutineScope. | | 2. Functions that are async buy should act like blocking | calls in a co-routine need to be marked with `suspend` | | 3. There are a bunch of primitives you need to use like | `launch` or `async`. | | 4. There are built-in options to make the co-routines lazy, | but it makes the API feel sloppy. | | 5. Rather than locks, Kotlin encourages explicit thread | dispatching. Some times you need to do this in Go too, like | when using a C library, but in general it's rare. | ackfoobar wrote: | `launch` is just the `go` keyword. `async` stores the | result in a `Deferred` (or `Promise`), which is an | abstraction for parallel decomposition. | | Other than confining UI work to the main thread, I don't | think I have seen thread confinement as an alternative to | locks. | rcme wrote: | That's kind of my point. Go has the `go` keyword. Kotlin | has runBlocking, coroutineScope, launch, async, and | suspend. Go's simplicity feels nicer to me. | ackfoobar wrote: | `runBlocking` and `suspend` is the coloured function | point. Some hate it, I don't mind it. It has its | benefits. | | --- | | A quick search on SO gives this example of parallel | decomposition. | https://stackoverflow.com/questions/57662739/pattern-for- | fet... | | Starting several `async` tasks, then awaiting them is | cleaner for me. But that's subjective. | | --- | | Kotlin's structured concurrency handles the context stuff | by default. When you handle those concerns (cancellation | for example) in go, it's just as complex, but more | verbose. | rcme wrote: | Go's context object is used for cancellation. I think | this is pretty simple and has the benefit of avoiding any | type of exception handling while still giving you the | ability to clean up anything you were doing. | | In general, if you like Kotlin, I can see why you'd like | their approach to concurrency. Kotlin really likes a | large standard library with generic blocks that act as | syntactic sugar. I used to like that too, but as I've | gotten older I've gotten lazier. Kotlin now how too much | mental overhead for my old brain. | int_19h wrote: | The other consequence, however, is that Go coroutines only | work with other Go code, and bespoke Go threads and stacks | require the runtime to jump through a lot of hoops to do | FFI correctly, with the result that the ecosystem tends to | rewrite rather than reuse existing code in C and other | languages. | | In contrast, the approach with explicit promises and await | can be mapped all the way down to a C struct with a | function and a data pointer representing a callback. That | is, it is C ABI compatible, and thus it can be easily | composed across any boundary that respects that ABI. | rcme wrote: | The only runtime hoop I've ever had to jump through was | to invoke runtime.LockOSThread() inside a go-routine. In | a lot of ways: go func() { | runtime.LockOSThread() // ... }() | | Is simpler than creating a thread in other languages. | int_19h wrote: | The runtime jumps through those hoops for you as needed, | mostly, but there are consequences to that wrt | performance. | PaulDavisThe1st wrote: | The cost of context switching between threads sharing the | same address space is much lower than between tasks, because | there's no need for a TLB flush. It just comes down to the | cost of saving/restoring register state (more or less), and | that's a nice known fixed cost on any given processor. | | Context switching between tasks has a variable cost, because | the implications of the TLB flush depend on the working set | size of the switched-to task. If it doesn't touch much | memory, the variable cost is low; if it touches a lot, the | variable cost can be much, much higher than the register | save/restore (fixed cost). | nordsieck wrote: | > It just comes down to the cost of saving/restoring | register state (more or less), and that's a nice known | fixed cost on any given processor. | | There's also the cost of cache eviction. Not something you | have to manually manage, but it's a cost you pay | nonetheless. Maybe. | PaulDavisThe1st wrote: | Indeed, I ought to have mentioned the cost of cache | misses post-context switch too. It too is working set | size dependent, but the relationship looks different from | the cost of TLB misses. However, my understanding is that | an inter-task thread switch would also not cause cache | invalidation (there's no need; the address space remains | the same). | GuB-42 wrote: | Coroutines are when you don't want to run in parallel, because | sequence matters. A typical example is iterators, you want the | loop and iterator (routine and coroutine) to run in lockstep. | | You could use threads instead, but then you risk getting in all | sorts of concurrency problems. And if you synchronize | everything, then great, you have reinvented coroutines. In | fact, you can think of coroutines as threads with built-in | synchronization, which may be how they are implemented. | [deleted] | layer8 wrote: | Coroutines are for when you would normally use callbacks or a | state machine, but handling the callback's state or the state | machine is complex enough that it becomes much simpler when | expressed as a coroutine (simpler to reason about). It isn't | really an alternative to multithreading. | okamiueru wrote: | To more elegantly write code that is async in nature like | IO/Web requests? Let stuff happen and just await where needed? | | Seems like different type of concern than what you need | multiprocessing for. | injidup wrote: | Coroutines don't really have anything to do with threads. They | are just a way for you to model suspending and resuming | functions. | | If you have every tried to implement a complicated iterator you | will find that you model a for loop in your head but that the | iteration of the loop is controlled from outside using the ++ | operator. The ++ operator allows the iterator loop to go one | more time round. Normally you have to manage this with state | stored in some object. With coroutines you can actually write a | loop and co_yield from the middle of the loop to implement the | ++ operator. | grogers wrote: | This is actually really convenient. | | In our code we have an abstraction that wraps a set using a | variant. Instead of implementing an iterator/begin/end based | on which data the variant holds, we have an iter() method | that unwraps the variant and loops through and co_yields. | Since you can consume a generator in a for-each loop, you use | it the exact same way (except calling iter). | | In our case it doesn't save that much complexity, but it's | definitely some. I could imagine more complicated cases where | it is even more convenient. | 112233 wrote: | Coroutines This is another can be used It as another way is | still possible to language construct like manage object | lifetime, and make some to use threads control flow much easier | if you have coroutines to write. lambdas. | [deleted] | [deleted] | bmurphy1976 wrote: | Good for embedded devices (e.g. a single core ESP32). | [deleted] | jcelerier wrote: | even without any threads or async, coroutines are very nice for | writing generators and super useful when you want generic code | which can iterate, without leaking the container implementation | to client code | steveklabnik wrote: | Why use a hammer when you already have screwdrivers? | com2kid wrote: | > Why implement coroutines when you have multithreading? | | Because coroutines are easier to think about than threads, and | avoid entire (large) classes of bugs. | | Because coroutines (typically) have a lower overhead than | threads. | | Because if you are IO bound and not CPU bound, and therefore | you don't need to spread work across cores, threads may not | provide a given workload with much, or any, benefit. | demindiro wrote: | For an I/O heavy application (not in C++) I avoid threads for | the main "loop" as there is a central cache that is accessed | often. Synchronizing access to it is likely more expensive and | error-prone than using just a single thread. It is also | possible a few hundred or thousand tasks are running | concurrently, which becomes very expensive very quickly with | threads. | | There are a few CPU-intensive tasks but those are easy to | isolate and send to a threadpool. | akmittal wrote: | Real threads can too expensive sometimes. A single coroutine in | Go just use few kb of memory. So we can easily spawn hundred of | thousands of coroutines but not possible using threads | dragontamer wrote: | Your point is correct but your magnitudes are off. | | IIRC, a thread takes up ~32kB or so worth of data in | pthreads/Linux, or at least... somewhere around there. So on | a rather cheap server with 64GB of RAM, you can easily fit | 1-million threads. More expensive servers have 512GB, 1TB or | more RAM and can easily handle this kind of load. | | Of course, coroutines are even smaller/more efficient and can | go into 10-million, 100-million or higher. | Matheus28 wrote: | A thread on Windows is MUCH heavier than Linux. 1 MB stack | from what I remember plus takes a lot longer to create. | dragontamer wrote: | dwStackSize can be set as low as 64kB though. So if | you're in the habit of making millions-of-threads, maybe | make them smaller on Windows? | | It does default to 1MB, but these sorts of things are | configurable. | com2kid wrote: | Ugh I really need to finish my blog post where we had async | with only a few dozen bytes of overhead per task. | | Cooperative async runtimes are obscenely efficient. | dragontamer wrote: | > C++ has threads which ACTUALLY run in parallel on the CPU. | Why bother complicating the language further? | | Actual parallelism is surprisingly slow and resource heavy in | practice. When everything is on one core, you keep things local | in L3, L2, L1. | | However, add on a 2nd core, then you suddenly have "ping | ponging". Lets say core#0 has data in L1 cache, and now Core#1 | needs to read-modify-write to it. That means the cores need to: | | 1: Core#1 begins to mark that cache-line as exclusive. | | 2. Core#0 needs to then respond by marking the line as invalid. | Then Core#0 ejects the data from L1, and passes it to Core#1. | | 3. Core#1 can finally begin to work on the data. | | 4. If Core#0 uses the data again, Core#1 must do step#2 again. | | -------- | | This is called "Ping-ponging". L1 cache read/writes is | 1-nanosecond, but ping-pongs can be 30nanoseconds or slower, | 30x slower for no reason. | | You don't want to add another core to your problem unless | you're _actually_ getting a speed benefit. Its more complex | than it may seem at first glance. You very well could add a | bunch of cores and then suddenly your program is way slower | because of these kinds of issues. | | Another problem: False sharing. Core#0 is working on "int x", | and Core#1 is working on "int y", but x and y are on the same | cacheline. So they have to ping-pong even though they never | actually touch the same data. | PaulDavisThe1st wrote: | Your example implies some combination of (a) excessive data | sharing between threads (b) possible need for core pinning. | | If thread A is going to need to read _all_ the data touched | by thread B, then it 's unclear why you've split the task | across threads. If there are still good reasons, then | probably pin A and B to core N (never pin anything to core 0, | unrelated story), and let them interleave there. | | If that doesn't make sense, then yep, you'll have to face the | cost of ping-ponging. | Night_Thastus wrote: | EDIT: I think I follow how this works now. I didn't understand | the sieve itself, so the co-routines was throwing me for a loop. | | We get all the numbers from 2 to 40, print out 2, then eliminate | all the entries divisible by 2. Now we're left with 3, 5, 7, 9, | etc. | | We print out 3, then remove all the entries divisible by 3. | | This repeats until we've hit the end (40). The next number after | a round of eliminations will always be a prime. | layer8 wrote: | Yeah, the sieve use-case is a bit too clever IMO to be a good | introductory example. | zabzonk wrote: | perhaps the take here is that many features of modern c++ are not | intended for application writers, but more for library writers? | ghostwriter wrote: | They serve both to a great extent, as long as code owners are | willing to refactor their codebases. The changes do actually | make code significantly safer and more pleasant to work with. | My favourite ones are the concepts and class template argument | deductions. | steveklabnik wrote: | I can't speak to modern C++ specifically, but in many, many | languages, more complex features are often for library writers | than application writers. | corysama wrote: | That's definitely a thing. A whole lot of the features added | since C++11 have been specifically intended to enable library | writers to do very complex things under the hood so they can | present powerful yet simple interfaces to users. | mrfox321 wrote: | Users of coroutine-friendly libraries will have to adopt the | async function coloring (co_await, co_yield, co_return, | Task<T>). | | So saying that coroutines are low-level / not application code | is misguided. | nly wrote: | Not necessarily. Coroutines are invisible at the ABI level | (that is coroutines are just functions). | | It's pretty easy to write a library with them (using your own | executor) and not reveal their use to the user. | | Obviously sharing an executor between application and library | is another matter. | Joker_vD wrote: | > sharing an executor between application and library is | another matter. | | Especially when there is not one, but several libraries, | and the application itself doesn't actually has or uses any | executors, now that's a fun challenge. | mrfox321 wrote: | Good point. | | Although doesn't that defeat the purpose of structured | concurrency? | | i.e. creating tasks that are plumbed to some background | thread/executor. | | By directly invoking the coroutine, lifetime issues become | much easier to avoid. | artemonster wrote: | The fact that they went for stackless is a testament to how bad | committee-driven design process is. just a bunch of bureaucrats | playing important actors, similarly to stackoverflow mods. | colejohnson66 wrote: | Can you ELI5 why stackless is bad? | artemonster wrote: | For coroutines to be _really_ useful, they have to be | stackful. The guy that originally did the proposal for C++ is | also an author of multiple coro libraries and did research on | this. You can emulate stackful coros with stackless via | trampoline, but heck, you can emulate stackless coros with | closures and trampoline too! The requirement of this extra | trampoline makes their use extra convoluted and negates the | very slight advantages that such functionality may really | bring. Making stackful "right" was hard, so a useless | compromise was made, which is basically like adding a | syntactic sugar to the language. | [deleted] | int_19h wrote: | It may be a compromise, but what exactly makes it | "useless"? A similar style of async has been in use in e.g. | C# for over a decade now, and it has been very successful | there. Yes, it is syntactic sugar - but so are e.g. | lambdas. What matters in practice is that it makes some | kinds of code much shorter and easier to read. | artemonster wrote: | stackful coroutines is a feature that is on par with | power of delimited continuations (proven fact in | academia, you can easily find somewhat easy to follow | papers on this topic), stackless coroutines is a stupid | gimmick. You see the "compromise"? You have argued to | have a case to add a car, but after long debate, politics | and "compomsie" you got a _TOY_ car instead. | [deleted] | malkia wrote: | I'm only sitting here, waiting for my thread_locals to get | broken.... or maybe they are handled magically somehow | jcelerier wrote: | .. why would they be broken ? coroutines are not related to | threads | grogers wrote: | I think the point is that if the coroutine suspends it could | be resumed on a different thread. So you don't really want to | mix thread locals and coroutines. | jcelerier wrote: | I mean, for sure but that's not a property of coroutines, | any callback system can do that | | e.g. if you have this code with boost: | #include <boost/asio.hpp> #include <chrono> | #include <vector> #include <thread> | #include <iostream> std::mutex print_mutex; | int main() { using namespace boost::asio; | using namespace std::literals; | io_context io_context(20); auto work_guard = | make_work_guard(io_context); for(int i = 0; i < | 100; i++) { post(io_context, [&] { | thread_local int x = 0; int z = x++; | std::lock_guard _{print_mutex}; std::cout << | z << "\n"; }); } | std::vector<std::jthread> v; for(int i = 0; i < | 20; i++) v.emplace_back([&] { | io_context.run_for(2s); }); } | | i hope you don't expect the output to be "0, 1, 2, ..., 99" ___________________________________________________________________ (page generated 2023-02-22 23:01 UTC)