[HN Gopher] Cost of a thread in C++ under Linux ___________________________________________________________________ Cost of a thread in C++ under Linux Author : eaguyhn Score : 144 points Date : 2020-03-01 12:41 UTC (10 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | brainscdf wrote: | My personal best practice is to always create a thread pool on | program startup and distribute your tasks among the thread pool. | I use the same best practice in all other languages too. Is this | best practice sound or can it lead to problems in some corner | cases? | cjfd wrote: | The thing I would worry about here is that perhaps not all of | your tasks have the same performance demands. There may be | tasks related to RPC that should run as quickly as possible and | tasks related to computation that could take a long time. If | all of the threads in the threadpool are busy with an expensive | computation there could not be left any to quickly handle RPC | requests. | | I personally prefer to do as much as possible just in one | thread, where you can run things asynchronously with a single | threaded message loop and then have a thread pool next to that | for expensive computations. This also tends to reduce the | number of things that need to be protected with a mutex. | hinoki wrote: | There are lots of details that might cause problems: | | * Do your tasks block? How many threads do you need to make | sure you can use all your CPUs. | | * Do your tasks access different sets of memory? Would keeping | similar tasks on the same CPUs reduce cache misses. | | * Do your tasks have different priorities? You might need a | pool for each priority. | | For a UI program that isn't doing anything really intensive or | real-time, having a common thread pool makes a lot of sense, | and can reduce resource use (stacks add up once you get to many | 10s or 100s of threads...), and improve latency (a work queue | with many threads will get more CPU than another with the same | amount of work but fewer threads) | londons_explore wrote: | Case in point: | | I used nodejs for a project, and assumed that "it's all | javascript on one thread" would leave threading issues | behind. | | My application curiously stopped responding whenever I had 5 | or more users. Connected users could continue to do anything, | but new users couldn't connect, and existing users sessions | would hang when executing any code that wrote to a logfile, | making debugging even harder. Using the nodejs debugger, the | internals of write(...., cb) were just never calling the done | callback. | | After hours of head scratching I found that most IO from | nodejs is _not_ asynchronous and callback based as the docs | suggest, but is in fact blocking IO done from worker threads. | My process was using pipes to communicate with other | processes, and those pipes were doing blocking writes, and | when blocked, the worker thread was blocked. | | There are 4 worker threads by default, so whenever 5 users | were using the system, all worker threads were tied up and it | would fail. It would have been nice for nodejs to at least | have printed to the console "All worker threads busy for | >1000ms. See nodejs.com/troubleshooting/blockingfileio.htm" | or something. | dirtydroog wrote: | As far as I'm aware, node.js is a wrapper over libuv which | is a truly asynchronous socket IO library. It fakes file IO | async ops with thread pools because on Linux file IO isn't | async at all. | clarry wrote: | Also: | | * Do you have sufficiently large batches that you can | efficiently assign to one thread? | | If not, then you're just wasting a lot of time waking up to | receive inputs, assigning them to threads (-> put them on a | work queue or similar, with all the locking / atomics), and | waking up a thread to pull an item (locking / atomics), | process it, go to sleep... | | It's easy to end up spending more time juggling tasks and | switching tasks than performing any useful work. | maayank wrote: | Why the relative high cost of threads on ARM? If anything, I'd | imagine it is more geared towards "massive parallel" scenarios | (i.e. dozens of cores). | Koshkin wrote: | Intel's excellent TBB library is the answer to all your worries | about threads in C++. (IMHO it should be made part of the | standard library.) | ncmncm wrote: | All your worries, if throughput is all you worry about, and not | latency. Or, if you have interaction between threads. Or, if | you might need to run on other archs. | | An equivalent to TBB or GCD will be in C++23 std libraries, but | you can often do better with coroutines, in 20. | | TBB and GCD still need to sychronize sometimes, and they | randomize workload assignment, which is bad for cache locality | (i.e. bad). If you can arrange static assignment and avoid need | to synchronize, you can do better, sometimes much better. | pjmlp wrote: | The problem with C++23, is that it will be mostly usable | around 2025, and C++20 co-routines still don't have a co- | routine aware standard library, right? | Koshkin wrote: | > _other archs_ | | See, for instance, | https://www.theimpossiblecode.com/blog/intel-tbb-on- | raspberr.... | saagarjha wrote: | Is a std::thread a thin wrapper around pthreads on Linux? | signa11 wrote: | yes. | abjKT26nO8 wrote: | With the caveat that the destructor crashes your program if you | neither join nor explicitly detach the thread. | ncmncm wrote: | But, don't detach the thread. | pjmlp wrote: | C++20 has apparently a fix for it with std::jthread, though. | | With all possible the learnings from Java, .NET, Erlang, TBB, | Concurrency Runtime, and yet ISO C++ did not manage to get a | proper concurrency story, and it full of traps like the one | you mention. | | Another one is std::async, which might actually be | synchronous, depending on a set of factors. | foo101 wrote: | A related question if anyone knows good answers here. | | What programming languages' de-facto thread implementations are | not wrappers around pthreads? I think Go has its own thread | implementation? Or am I mistaken? | signa11 wrote: | erlang has its own process/thread implementation with, iirc, | 64b per process. | toast0 wrote: | The docs [1] say: | | > A newly spawned Erlang process uses 309 words of memory | in the non-SMP emulator without HiPE support. (SMP support | and HiPE support both add to this size.) | | And a word is the native register size, so 4 or 8 bytes | these days, so fairly small, but not 64 bytes small. | | [1] http://erlang.org/doc/efficiency_guide/processes.html | [deleted] | saagarjha wrote: | Right, Go uses green (userspace) threads. | daurnimator wrote: | zig optionally uses pthreads (depending on if you link | against libc or not) | ghostwriter wrote: | GHC Haskell's runtime has a "default" light-weight thread | system (forkIO) that schedules logical threads on the | available operating system threads and parallelises them | across available CPUs: | | - https://wiki.haskell.org/Parallelism#Multicore_GHC | | - https://stackoverflow.com/a/41485705 | | - https://www.aosabook.org/en/posa/warp.html | pjmlp wrote: | Java does not specify the actual threading model, so you can | get green threads (user space) or red threads (kernel | threads). | | The upcoming Project Loom, intends to make it so that green | threads become the default (aka virtual threads on Loom), but | you can still ask for kernel threads, given that is what most | JVM implementations have converged into. | fwsgonzo wrote: | Yep. And all the serialization is futex wait/wake. | isatty wrote: | Why is there such a big difference in timing between Skylake and | Rome? Something compiler specific? The number of steps required | to create a thread should be identical. | | I'll also be interested to see the same benchmark but using | pthread_create directly. | [deleted] | thedance wrote: | Could be as basic as clock speed differences. | sys_64738 wrote: | Benchmarking in C++. Who knew! | saagarjha wrote: | Daniel does most of his benchmarks in C++; it's fairly well- | suited for the task. | hrgiger wrote: | Using taskset pinning my numbers improves: | | $taskset --cpu-list 8 ./costofthread avg: 11000~ | | $taskset --cpu-list 8,11 ./costofthread avg: 33000~ | | $./costofthread avg: 60000~ | known wrote: | On any architecture, you may need to reduce the amount of stack | space allocated for each thread to avoid running out of virtual | memory | | http://www.kegel.com/c10k.html#limits.threads | CJefferson wrote: | Is this even possible on a 64 bit architexture? The default | stack size is, I think, 2mb, and i have previously allocated | terabytes of VM space without issues. | wbkang wrote: | No this is more of a 32bit issue. | nurettin wrote: | I can't believe this link is still relevant after more than 15 | years. | Koshkin wrote: | Not making a jab at what you are saying, but to me "running out | of virtual memory" has always sounded like a crazy thing, like | running out of address space. Sure, given enough disk space, | your program might get (quite) a bit slower, but it should | still chug along just fine. Yet, running out of virtual memory | is indeed still a thing, especially in Windows (a workaround | being using memory-mapped files). | shin_lao wrote: | Great reminder. | | Even if you pre-create a thread (thread pool), when the task is | small enough (less than 1,000 cycles), it is less expensive to do | it in place (for example, with fibers), because of the cost of | context switching. | iforgotpassword wrote: | Agree. A few years ago I noticed a C program we used in | production spawned a new thread for each incoming connection. | Since the vast majority of these just served two small requests | (think two HTTP gets) I tried adding a very simple thread pool | that would keep up to four idle threads around. To make a | thread wait for work I used an eventfd (Linux). I tried a | linked list and an array for the idle threads. I tried | protecting the get/return code with a mutex and spin lock, and | then made it lock free with C11s atomics. Two days later I | still couldn't get this to be faster than just spawning a new | thread every time, so I gave up this experiment. | | It seems at least the Linux folks optimized the crap out of | clone() over the last years. | cesarb wrote: | > It seems at least the Linux folks optimized the crap out of | clone() over the last years. | | The most essential Linux benchmark is compiling the Linux | kernel (since it's something the Linux kernel developers do | all the time, so they really feel the impact). The clone() | system call is used both to create new threads and to create | new processes, and the Linux kernel compilation uses a large | amount of short-lived processes (each C file is a new C | compiler process). It's only natural that clone() is heavily | optimized, together with the filesystem caches (each new C | compiler process reads the source code files from scratch). | proverbialbunny wrote: | I imagine this is why coroutines and the like are often used | as threads within a thread pool. | mac01021 wrote: | Why is the cost of switching threads so much higher than the | cost of switching fibers? | gpderetta wrote: | Switching threads require entering the kernel which costs | from a few hundreds to thousands of clock cycles (and it got | worse from all the spectre/meltdown mitigations). | | A fiber switch can be done in less than 10 clock cycles. | fwsgonzo wrote: | Because threads are traditionally created and scheduled by | the OS, so it inevitably involves a costly context switch | both first into the kernel and then back again into the next | thread, if one is ready. | | Userspace threads are more light-weight, but probably still | worse than just using fibers and co-routines. Depends on your | needs, I suppose. | boulos wrote: | I find Eli Bendersky's writeup [1] more useful as it actually | goes closer to the details. For readers less familiar, it also | makes it more clear what the time spent will depend on (how much | state there is to copy). Eli's post is actually a sub-post of his | "cost of context switching" post [2] which is more often | applicable (and helps answer all the questions below about | threadpools). | | [1] https://eli.thegreenplace.net/2018/launching-linux- | threads-a... | | [2] https://eli.thegreenplace.net/2018/measuring-context- | switchi... | drmeister wrote: | Threads are very expensive if you start throwing C++ exceptions | within them in parallel. You see the overall time to join the | threads increases with each thread you add. There is a mutex in | the unwinding code and as the threads grab the mutex they | invalidate each other's cache line. I wrote a demo to illustrate | the problem https://github.com/clasp-developers/ctak | | MacOS doesn't have this problem but Linux and FreeBSD do. | ajross wrote: | Did gcc/libstdc++ have the same problem? The report you link is | for clang, it looks like. | drmeister wrote: | Yes, gcc/libstdc++ have the same problem. That's where I saw | it first and then I tried llvm/libunwind and saw the same | thing. | monocasa wrote: | What's the actual lock on? The unwinding shouldn't involve | shared mutable state (as someone who's been deep into the | DWARF unwinding VM bytecode). | brandmeyer wrote: | I think that this is a lock around dlopen(). dlopen | changes the list of mapped objects (and therefore, the | mapping from instruction pointer to unwind information). | drmeister wrote: | In libgcc it's in Unwind_Find_FDE - we think it's a lock | around walking the loaded dynamic libraries. I haven't | personally dug much deeper into it but my folks here and | the llvm engineers seem to be pretty certain that's the | problem (this: https://github.com/gcc- | mirror/gcc/blob/master/libgcc/unwind-...). Right now we | are rearranging our compiler so we throw fewer exceptions | because you don't have to optimize things that you don't | do :-). | viraptor wrote: | Looks to me like it only tries to protect the building of | the shared / sorted "seen_objects". You don't want two | threads rebuilding it at the same time. Although there | must be a way to work around this. Maybe something like | optimistically walk through the seen list, then grab the | lock to update and walk again without a lock? You should | be able to safely walk a linked list forwards even with | another thread inserting into it, right? | gpderetta wrote: | There is some work libc size to make the lock optional | and enabling it only at the first dlopen. | | Edit: last time I investigated the issue ended up here: | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 | Koshkin wrote: | > _start throwing C++ exceptions_ | | Well, don't. "Exception-based programming" is an anti-pattern. | Exceptions should be thrown in, well, exceptional situations. | monocasa wrote: | They were using it in another language's runtime. Idiomatic | advice like that doesn't always transfer. | rumanator wrote: | Limiting exceptions to handle exceptional events is not | idiomatic advice, it's stating their usecase,and the whole | reason they exist. | gumby wrote: | There's an easy optimization to avoid inspecting every frame | when unwinding which c++ could not implement (for policy | reasons) though a platform could: add a pointer to the next | frame that needs unwinding to the frame setup. This is like | move elision. | | If my caller has destructors to run or a catch clause this | pointer is null and inspection proceeds as normal. If It does | not it stores the value from _its_ frame there. Then if I throw | an exception I jump to the next frame that needs inspection; if | I don't then any throw further down the call stack won't even | look at me. | | The C++ standard can't call for this because of the "zero cost | if you don't use it" rule. But a Linux ABI could. The MacOS | takes advantage of this kind of freedom. | ajross wrote: | To be fair: very few C++ applications are limited by | exception performance, it's a feature that's very much out of | favor at the moment. So penalizing everyone else (despite the | fact that most new code doesn't use them, it's not at all | uncommon to find projects with exception generation enabled | for the benefit of one library or two) to make parallel | exceptions faster actually does seem like a bad trade to me | in the broad sense. | | Apple does indeed have more freedom, and it may be that | specific MacOS components need this in ways that the general | community doesn't seem to. But I'd want to see numbers from a | bunch of real world environments before declaring this a | uniformly good optimization. | gumby wrote: | > To be fair: very few C++ applications are limited by | exception performance, it's a feature that's very much out | of favor at the moment. | | It is distressing that Sutter's survey showed that half the | respondents had to disable exceptions for part of all the | code. I've often heard the argument "well google's coding | standard prohibits exceptions" which is bizarre, as | google's standard says "exceptions are great but we have | some legacy code that can't use them, so we're stuck" | | The biggest argument seems to be that they are expensive, | which is crazy because there's no cost if you don't raise | one and if you do you're already in trouble and generally | have plenty of time to deal with it (this is different | from, say, Lisp signalling which not only permits | continuing (!) but is on theory supposed to be common. | Probably a mistake in retrospect). But they allow you to | make the uncommon stuff uncommon (as opposed to error codes | which must be sprayed like shrapnel through your code). | | There are two legit arguments against exceptions: one is | when you are constrained in space (e.g. embedded systems) | and/or time (hard realtime systems that need predictable | timing, even if it is slower). The other is a philosophical | argument that it embodies a second, parallel flow of | control. Since C++'s exception system is an error system | only, and since destructors are run automatically, it's | hard for me to find this second argument convincing. | pjmlp wrote: | Sometimes I miss C++'s flexibility from the managed | languages that usually use, then I remember that the | community is now driven by the performance at all costs | crowd, without exceptions, RTTI, STL and let that thought | go. | | That is not the C++ I enjoy using, rather the language I | got to love via Turbo Vision, OWL, VCL, MFC, Qt, which is | not what drives the language nowadays. | gumby wrote: | I wouldn't characterize that group as "the community". | True there are a lot of such people, mostly clustered In | the game industry where superstition is rife. | | Take a look At C++ (or c++ 20!) as if it were a brand new | language you'd never seen before and forgetting that it's | name includes "c". _That_ language is a pretty clean, | expressive and straightforward language IMHO. I like | programming in it. | | It's not claiming it's unicorns farting rainbows, but | it's definitely pretty good. | pjmlp wrote: | If the community wasn't busy discussing those issues, and | constexpr of all things, we would already have | reflection, with a concurrency and networking story that | isn't put to shame for what Java 5 already had, let alone | in modern managed languages. | | Yeah, if everyone plays ball, it might come in 5 years | from now, assuming C++23 gets done on time, plus the | compiler support stabilization. | | Right now SG14 seems to drive some of those decisions, at | least from outside. | gumby wrote: | Those are important issues and people who care about them | work on them and come to committee meetings. There is | less consensus on the concurrency and networking side | which I also find frustrating but as I'm not pushing | those balls forward I can't complain. I do think at least | that the direction they're moving in is a fruitful one. | | The standard can move quickly: consider formatted output | which lingered unchanged with a broken model but was | rapidly reformed when someone with a good model _and_ | implementation was encouraged to come forward. Admittedly | a smaller topic than concurrency or networking! | pjmlp wrote: | Right now, the way I see it, I rather help the managed | languages I work on reach the point where binding to C++ | is kind of last option when nothing else helps. | | The other language communities manage to drive language | progress over the Internet, which apparently ISO has yet | to get in touch how it goes. | gumby wrote: | C++ does this as well, for example with boost, where | several things that entered the standard got their start. | And the format example I gave. It's the ISO blessing that | is complex, but also acts as a forcing function | tromtrhnto make new features as orthogonal as possible. | Sure, it's not to everybody's taste, but you don't need | to follow ISO if you don't wish to. | detaro wrote: | That seems like a misconception about the driving forces | behind C++ today. | pjmlp wrote: | Not when one looks into the recent ABI discussions. | gumby wrote: | You mean a refusal to break ABI? | pjmlp wrote: | Yes, as means to achieve performance improvements that | aren't that relevant to average Joe C++ dev. | | The one doing application stuff in Qt, MFC, wxWidgets. | | Or those like myself, where C++ only matters as means to | implement native bindings to system libraries, or GPGPU | shading languages based on C++. | gumby wrote: | Well I can afford a complete ABI break (complete) given | the kind of code I work on. Most people cannot. Binary | incompatibilities are very hard for most people to | manage. | | So I would benefit from any number of abi-breaking | proposals but can understand the committees reticence. | barrkel wrote: | Yeah, when the moderates move out, the hard core that | remains swings towards what makes C++ unique, and that's | not general purpose application programming and language | features that support it. | pjmlp wrote: | Which looking from its use in mainstream OS SDKs means | drivers, composition engine, shaders and real time audio | engines, the SQL of systems programming, kind of. | fpoling wrote: | Writing exceptions-safe code is not free in itself. | Surely one can do it, but it requires more mental energy | to write and even more efforts to review the code. | keldaris wrote: | > which is crazy because there's no cost if you don't | raise one | | This is just false in the general case. The presence | (potential or actual) of exceptions often just serves as | an optimization barrier in current compilers. That's not | to even invoke bizarre but not infrequent issues like | this [1]. I too have had codebases that miraculously sped | up upon disabling exceptions despite not throwing | anything. Identifying the exact causes of these | situations is hard and typically not done, because it's | far easier to just add a compiler switch and pretend | there are no exceptions in C++ and get back to work. | | Many people in performance sensitive domains just don't | find it remotely worthwhile to care about features that | have these sorts of difficult to predict and debug costs. | When your workflow already consists of writing highly | explicit, simple to reason about code that you frequently | inspect in disassembled form, exceptions (and RTTI for a | host of obvious reasons) are the last thing you'd want to | enable. At best it's just extraneous noise in the | assembly, at worst you take a sizable perf hit and have | no idea why. | | [1] https://twitter.com/timsweeneyepic/status/12230774046 | 6037145... | throwaway17_17 wrote: | This description of C++ usage you gave: | | A workflow consisting of writing highly explicit, simple | to reason about code that you frequently inspect in | disassembled form | | Is possibly the most descriptive and succinct description | of my coding practice. I really like this formulation and | I am going to shamelessly steal it in the future, | repeatedly. | mehrdadn wrote: | We need sample code for stuff like this so they can be | referred back to as canonical examples. Frequently | asserted C++ misconceptions? | acqq wrote: | > it's a feature that's very much out of favor at the | moment. | | I'm glad if it is so. Exceptions should actually be | "exceptional" and not the part of the normal execution | flow. Whoever has other ideas has the wrong model of what, | at the lower levels, exceptions actually do. | brandmeyer wrote: | This sounds a lot like the SJLJ runtime model that was used | in G++ for years. | | > The C++ standard can't call for this because of the "zero | cost if you don't use it" rule. But a Linux ABI could. The | MacOS takes advantage of this kind of freedom. | | That's not really true. It has nothing to do with the | standard. It has everything to do with the compiler's users | complaining about the performance hit relative to DWARF EH. | It is part of the social contract between the standards body, | the compiler author community, and the user community that | unused features don't cost us in runtime performance. | gumby wrote: | Yes, it's similar I suppose, though much lower overhead. I | was a bigger advocate for frame inspection than Michael was | in the early years because of my Lisp background. He was | (correctly) more concerned with performance. | | As for "policy" vs "social contract" I think we basically | agree. | drmeister wrote: | This sounds interesting - thank you! We need to interoperate | with C++ - so we couldn't use this, could we? We could add | this pointer to our own frames but C++ frames won't have this | info - so I'm not sure how they would interoperate. We need | to be able to throw an exception and invoke both Clasp frame | cleanups and C++ frame cleanups up the stack. We have a crazy | mix of C++ and CL frames on the stack at any time. | gumby wrote: | Sure you could. You presumably already have a mechanism for | doing frame unwinding either via compatibility with the C++ | runtime's throw() implementation or by supplying your own. | So for your own stack frames you can do what you like. | Exceptions raised by c++ code called from lisp would also | work the same way as they do now. | | And when you are unwinding a lisp->C++ boundary (that is, | lisp code called by a C++ function) you are free to do what | you like until you get to the first Lisp frame; if it | doesn't have an unwind-protect then its "ignore me" pointer | just points up to its caller, which is examined by the C++ | runtime anyway. | | The nice part of that second paragraph is that if that | first lisp callee was called by a non-c++ function (say a | fortran function) you might even have an opportunity to set | the "parent frame for inspection pointer" to skip over all | the fortran frames and point directly to the lowest C++ | function below you...which you could manage via a small | change to gold or llvm-ld. | ampdepolymerase wrote: | What about Windows? | lachlan-sneff wrote: | Unrelated to this thread, but are you still working on the | matter compiler project? | drmeister wrote: | That's the point of everything. We have spun up a company and | big things are in the works: http://www.thirdlaw.tech/ | lachlan-sneff wrote: | Are you calling the software the "matter compiler" or is a | "matter compiler"/"nanofactory" an eventual end goal of the | project? | wwarner wrote: | Wow, I wouldn't have guessed that! Nice data! | drmeister wrote: | Thank you. We have developed a Common Lisp implementation | that uses LLVM and interoperates with C++ and uses C++ | exception handling to unwind the stack. Common Lisp code | relies on stack unwinding a fair bit. Imagine my surprise | when my fancy multi-threaded compiler can't get out of first | gear (tops out at ~150% cpu) on Linux. Sheesh. | gumby wrote: | I've been really interested in clasp; what's the current | state? Can't see any recent posts that summarize it. | drmeister wrote: | It's going well. I'll post something soon. We've just | been working on it quietly. We have multithreading, | unicode, cffi etc, good debugging support, cross-language | profiling and more. | yjftsjthsd-h wrote: | Any idea why Darwin is beating Linux and FreeBSD here? Are they | doing something different that could/should be implemented in | the others? | signa11 wrote: | imho, if _cost_ of thread creation is where the bottleneck is, | then more likely than not, you are doing things wrong. | Ensorceled wrote: | This is just another way of saying what the article just said. | bluetomcat wrote: | For CPU-bound tasks, it is best to pre-create a number of threads | whose count roughly corresponds to the number logical execution | cores. Every thread is then a worker with a main loop and not | just spawn on-demand. Pin their affinity to a specific core and | you are as close as possible to the "perfect" arrangement with | minimized context switches and core-local cache data being there | most of the time. | emilfihlman wrote: | A valid reason to have more is IO, and particularly file IO. | inetknght wrote: | File IO should be using asynchronous methods too. If your OS | doesn't support asynchronous file IO then you're not using | any of the big 3. | tomlu wrote: | Exactly this. You want one thread pool of size ~= core | count, then you want to have a completely _different_ "max | number of IO jobs" type deal that doesn't use threads at | all. | | C# async/await is pretty good for this (IO operations do | not count towards CPU task count). | fwsgonzo wrote: | Yep. We do this on real hardware and never preempt. There is | nothing more performant than that. Using IPIs (inter-processor | interrupts) you can trigger events like "more work has been | added" on each CPUs queue. Additionally, when you put all | interrupts of a device solely on one specific CPU you won't | have to lock anything. | | Some other things: pthreads generally have high cost, and that | means C++ threads do too. pthreads have quite a few features | that you regularly don't use, which you can skip completely | using fibers and coroutines. | londons_explore wrote: | How many percent performance do you think you gain by never | pre-empting? | | If we're talking 50%, the complexity sounds worth it, but if | it's 1% I think I'd prefer to stick with standard scheduling | and know my program will 'just work' on any CPU or OS, and | with any libraries I choose to use. | fwsgonzo wrote: | Probably a lot on bare metal, but we are a special case | here. It really brings down the latency to respond to | network events. A single context switch is in the range of | 100K-1M CPU cycles. | | The reason why is because we avoid all the indirect cost of | context switching, which is all the various caches that has | to be flushed. And also the context switching itself, of | course. | | However, you can still do a lot on Linux to equalize things | if you really want to get down to it. For anything but | special cases Linux really does a good job with scheduling. | After all, you are likely not running much else other than | your intended service. | | That said, for me this thread was a slight wakeup-call that | made me look more into fibers and co-routines. I have been | wanting to use these for a long time for some things. | boulos wrote: | > A single context switch is in the range of 100K-1M CPU | cycles. | | 1M cycles is roughly 300 microseconds (assume 3 GHz | processor, so 3 cycles is 1 nanosecond). Eli's graph from | the post I referenced above, has a context switch in the | 1-3 microsecond range [1] depending on taskset/core | pinning. The high end (3 microseconds) is about 10000 | cycles then. | | Maybe you mean fork() or pthread_create for your 1M | cycles? | | [1] https://eli.thegreenplace.net/images/2018/plot- | launch-switch... | jandrewrogers wrote: | The difference can be quite large, details are workload and | software dependent. It isn't just the context-switching | overhead (which is prohibitively high these days), it also | significantly improves average cache locality, which is the | bottleneck for many high-performance codes. | | Some types of software optimizations require the ability to | correctly infer local CPU cache contents, which is | difficult when arbitrary processes are semi-randomly | stepping all over that cache. | Koshkin wrote: | > _never preempt_ | | I don't think this is a realistic expectation on Linux and, | especially, Windows which runs hundreds of threads of its own | you don't want to know about. (Besides, we must remember that | multithreading was invented and found quite useful in the era | of "single-core" processors.) | cma wrote: | Mask off all those other processes to only run on core 0. | emidln wrote: | Linux/CFS has an isolcpus feature that can be used to tell | the kernel to never schedule things on a given CPU. This is | useful when latency matters. | ncmncm wrote: | I learned recently that even with isolcpu, _and_ nohz, | _and_ interrupts directed elsewhere, the kernel will | still pause the thread on the isolcpu if it has mmapped a | file (e.g., to report stats) and the kernel decides it 's | time to copy the bits to disk. If you don't want stalls, | only map writable files on a tmpfs volume. To snapshot | the file, copy it to another file on the same volume, and | then snapshot the copy. | gpderetta wrote: | Yes, it is called TLB shootdown and it is required to | preserve the integrity of the TLB across CPUs on an umap | or a dirty bit change. If latency is impprtant, don't use | disk backed writeable mappings. | | Edit: to clarify: any writeable mapping or any unmap will | cause TLB shutdown interrupts to be broadcasted to all | currently running threads of a process. | ncmncm wrote: | Of course one never unmaps the file, or any mapped | memory, so this has nothing to do with TLB problems | (which are also a thing -- another reason processes are | better than threads). | | These pauses happen even without unmapping. Despite that | no synchronization is available, so you are right in the | middle of whatever, the kernel decides a static snapshot | of the pages' state must be written, so write-protects | the pages first, and blocks your process until the write | is done. It's just rude. | bluetomcat wrote: | On a server running heterogeneous CPU-bound tasks of | various users it is hardly a realistic expectation, but on | a single-user device with a single application in the | foreground I would say it is realistic, since most of these | running processes are blocked on something most of the | time. Hundreds of mostly-idle resident processes are | insignificant to a one that puts the CPU through its paces. | fwsgonzo wrote: | This thread is about the performance of threads, which we | have established can be a high cost for some. In our case | we don't run on Linux or Windows, but you can still do the | same on Linux afaik, although you will have to write a | kernel object for some things. | hinoki wrote: | One thing to worry about is that you're effectively taking over | the job of the OS scheduler. This can be a good thing since you | know more about your workload than the generic heuristics the | scheduler uses, but it also means that you might need to | reimplement some things. | | Like only scheduling work on logical cores that share a | physical core after all physical cores have a busy logical core | (I.e. fill up the even cores first). | kccqzy wrote: | > One thing to worry about is that you're effectively taking | over the job of the OS scheduler. | | Exactly. And some apps automatically do that by default, as | if they are so arrogant as to think they must be the only | program running on that machine. Maybe good for a server app, | terrible advice for a general desktop app. | jcelerier wrote: | Thankfully we have very nice task-based schedulers nowadays | such as https://github.com/cpp-taskflow/cpp-taskflow or TBB | flowgraph | jandrewrogers wrote: | Taking over the job of the OS scheduler is explicitly the | reason for doing it, there are some classes of macro- | optimization that have this as a prerequisite. It is done for | the same reasons that high-performance database kernels | replace the I/O scheduler too. | | To your point, it is a double-edged sword. Writing your own | schedulers requires a much higher degree of sophistication | than using the one in the OS. It is a skill that takes a long | time to develop and requires a lot of first principles | thinking, there is loads of subtlety, you can't just copy | something you found on a blog. It also isn't just about being | able to predict the behavior of your workload better than the | OS, you can also adapt your workload to the schedule state | since it is exposed to your application, the latter being a | greatly overlooked capability. | | Once you know how to design software this way, it not only | generates large increases in throughput but also enables many | elegant solutions to difficult software design problems that | simply aren't possible any other way. While the learning | curve is steep, once you are accustomed to writing software | this way it becomes pretty mechanical. | Koshkin wrote: | Paraphrasing the famous Greenspun's Tenth Rule, any home-made | threading library for C++ always ends up being "an ad hoc, | informally-specified, bug-ridden, slow implementation of half | of" Intel TBB. (Been there, done that.) | bluetomcat wrote: | This is not about coming up with a thread library. The | described scenario can be realized entirely with standard | Pthread primitives and calls like pthread_setaffinity. | Koshkin wrote: | Well, a good library would take the issue of the "cost" | into account for you. ___________________________________________________________________ (page generated 2020-03-01 23:00 UTC)