[HN Gopher] Cost of a thread in C++ under Linux
       ___________________________________________________________________
        
       Cost of a thread in C++ under Linux
        
       Author : eaguyhn
       Score  : 144 points
       Date   : 2020-03-01 12:41 UTC (10 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | brainscdf wrote:
       | My personal best practice is to always create a thread pool on
       | program startup and distribute your tasks among the thread pool.
       | I use the same best practice in all other languages too. Is this
       | best practice sound or can it lead to problems in some corner
       | cases?
        
         | cjfd wrote:
         | The thing I would worry about here is that perhaps not all of
         | your tasks have the same performance demands. There may be
         | tasks related to RPC that should run as quickly as possible and
         | tasks related to computation that could take a long time. If
         | all of the threads in the threadpool are busy with an expensive
         | computation there could not be left any to quickly handle RPC
         | requests.
         | 
         | I personally prefer to do as much as possible just in one
         | thread, where you can run things asynchronously with a single
         | threaded message loop and then have a thread pool next to that
         | for expensive computations. This also tends to reduce the
         | number of things that need to be protected with a mutex.
        
         | hinoki wrote:
         | There are lots of details that might cause problems:
         | 
         | * Do your tasks block? How many threads do you need to make
         | sure you can use all your CPUs.
         | 
         | * Do your tasks access different sets of memory? Would keeping
         | similar tasks on the same CPUs reduce cache misses.
         | 
         | * Do your tasks have different priorities? You might need a
         | pool for each priority.
         | 
         | For a UI program that isn't doing anything really intensive or
         | real-time, having a common thread pool makes a lot of sense,
         | and can reduce resource use (stacks add up once you get to many
         | 10s or 100s of threads...), and improve latency (a work queue
         | with many threads will get more CPU than another with the same
         | amount of work but fewer threads)
        
           | londons_explore wrote:
           | Case in point:
           | 
           | I used nodejs for a project, and assumed that "it's all
           | javascript on one thread" would leave threading issues
           | behind.
           | 
           | My application curiously stopped responding whenever I had 5
           | or more users. Connected users could continue to do anything,
           | but new users couldn't connect, and existing users sessions
           | would hang when executing any code that wrote to a logfile,
           | making debugging even harder. Using the nodejs debugger, the
           | internals of write(...., cb) were just never calling the done
           | callback.
           | 
           | After hours of head scratching I found that most IO from
           | nodejs is _not_ asynchronous and callback based as the docs
           | suggest, but is in fact blocking IO done from worker threads.
           | My process was using pipes to communicate with other
           | processes, and those pipes were doing blocking writes, and
           | when blocked, the worker thread was blocked.
           | 
           | There are 4 worker threads by default, so whenever 5 users
           | were using the system, all worker threads were tied up and it
           | would fail. It would have been nice for nodejs to at least
           | have printed to the console "All worker threads busy for
           | >1000ms. See nodejs.com/troubleshooting/blockingfileio.htm"
           | or something.
        
             | dirtydroog wrote:
             | As far as I'm aware, node.js is a wrapper over libuv which
             | is a truly asynchronous socket IO library. It fakes file IO
             | async ops with thread pools because on Linux file IO isn't
             | async at all.
        
           | clarry wrote:
           | Also:
           | 
           | * Do you have sufficiently large batches that you can
           | efficiently assign to one thread?
           | 
           | If not, then you're just wasting a lot of time waking up to
           | receive inputs, assigning them to threads (-> put them on a
           | work queue or similar, with all the locking / atomics), and
           | waking up a thread to pull an item (locking / atomics),
           | process it, go to sleep...
           | 
           | It's easy to end up spending more time juggling tasks and
           | switching tasks than performing any useful work.
        
       | maayank wrote:
       | Why the relative high cost of threads on ARM? If anything, I'd
       | imagine it is more geared towards "massive parallel" scenarios
       | (i.e. dozens of cores).
        
       | Koshkin wrote:
       | Intel's excellent TBB library is the answer to all your worries
       | about threads in C++. (IMHO it should be made part of the
       | standard library.)
        
         | ncmncm wrote:
         | All your worries, if throughput is all you worry about, and not
         | latency. Or, if you have interaction between threads. Or, if
         | you might need to run on other archs.
         | 
         | An equivalent to TBB or GCD will be in C++23 std libraries, but
         | you can often do better with coroutines, in 20.
         | 
         | TBB and GCD still need to sychronize sometimes, and they
         | randomize workload assignment, which is bad for cache locality
         | (i.e. bad). If you can arrange static assignment and avoid need
         | to synchronize, you can do better, sometimes much better.
        
           | pjmlp wrote:
           | The problem with C++23, is that it will be mostly usable
           | around 2025, and C++20 co-routines still don't have a co-
           | routine aware standard library, right?
        
           | Koshkin wrote:
           | > _other archs_
           | 
           | See, for instance,
           | https://www.theimpossiblecode.com/blog/intel-tbb-on-
           | raspberr....
        
       | saagarjha wrote:
       | Is a std::thread a thin wrapper around pthreads on Linux?
        
         | signa11 wrote:
         | yes.
        
         | abjKT26nO8 wrote:
         | With the caveat that the destructor crashes your program if you
         | neither join nor explicitly detach the thread.
        
           | ncmncm wrote:
           | But, don't detach the thread.
        
           | pjmlp wrote:
           | C++20 has apparently a fix for it with std::jthread, though.
           | 
           | With all possible the learnings from Java, .NET, Erlang, TBB,
           | Concurrency Runtime, and yet ISO C++ did not manage to get a
           | proper concurrency story, and it full of traps like the one
           | you mention.
           | 
           | Another one is std::async, which might actually be
           | synchronous, depending on a set of factors.
        
         | foo101 wrote:
         | A related question if anyone knows good answers here.
         | 
         | What programming languages' de-facto thread implementations are
         | not wrappers around pthreads? I think Go has its own thread
         | implementation? Or am I mistaken?
        
           | signa11 wrote:
           | erlang has its own process/thread implementation with, iirc,
           | 64b per process.
        
             | toast0 wrote:
             | The docs [1] say:
             | 
             | > A newly spawned Erlang process uses 309 words of memory
             | in the non-SMP emulator without HiPE support. (SMP support
             | and HiPE support both add to this size.)
             | 
             | And a word is the native register size, so 4 or 8 bytes
             | these days, so fairly small, but not 64 bytes small.
             | 
             | [1] http://erlang.org/doc/efficiency_guide/processes.html
        
           | [deleted]
        
           | saagarjha wrote:
           | Right, Go uses green (userspace) threads.
        
           | daurnimator wrote:
           | zig optionally uses pthreads (depending on if you link
           | against libc or not)
        
           | ghostwriter wrote:
           | GHC Haskell's runtime has a "default" light-weight thread
           | system (forkIO) that schedules logical threads on the
           | available operating system threads and parallelises them
           | across available CPUs:
           | 
           | - https://wiki.haskell.org/Parallelism#Multicore_GHC
           | 
           | - https://stackoverflow.com/a/41485705
           | 
           | - https://www.aosabook.org/en/posa/warp.html
        
           | pjmlp wrote:
           | Java does not specify the actual threading model, so you can
           | get green threads (user space) or red threads (kernel
           | threads).
           | 
           | The upcoming Project Loom, intends to make it so that green
           | threads become the default (aka virtual threads on Loom), but
           | you can still ask for kernel threads, given that is what most
           | JVM implementations have converged into.
        
         | fwsgonzo wrote:
         | Yep. And all the serialization is futex wait/wake.
        
       | isatty wrote:
       | Why is there such a big difference in timing between Skylake and
       | Rome? Something compiler specific? The number of steps required
       | to create a thread should be identical.
       | 
       | I'll also be interested to see the same benchmark but using
       | pthread_create directly.
        
         | [deleted]
        
         | thedance wrote:
         | Could be as basic as clock speed differences.
        
       | sys_64738 wrote:
       | Benchmarking in C++. Who knew!
        
         | saagarjha wrote:
         | Daniel does most of his benchmarks in C++; it's fairly well-
         | suited for the task.
        
       | hrgiger wrote:
       | Using taskset pinning my numbers improves:
       | 
       | $taskset --cpu-list 8 ./costofthread avg: 11000~
       | 
       | $taskset --cpu-list 8,11 ./costofthread avg: 33000~
       | 
       | $./costofthread avg: 60000~
        
       | known wrote:
       | On any architecture, you may need to reduce the amount of stack
       | space allocated for each thread to avoid running out of virtual
       | memory
       | 
       | http://www.kegel.com/c10k.html#limits.threads
        
         | CJefferson wrote:
         | Is this even possible on a 64 bit architexture? The default
         | stack size is, I think, 2mb, and i have previously allocated
         | terabytes of VM space without issues.
        
           | wbkang wrote:
           | No this is more of a 32bit issue.
        
         | nurettin wrote:
         | I can't believe this link is still relevant after more than 15
         | years.
        
         | Koshkin wrote:
         | Not making a jab at what you are saying, but to me "running out
         | of virtual memory" has always sounded like a crazy thing, like
         | running out of address space. Sure, given enough disk space,
         | your program might get (quite) a bit slower, but it should
         | still chug along just fine. Yet, running out of virtual memory
         | is indeed still a thing, especially in Windows (a workaround
         | being using memory-mapped files).
        
       | shin_lao wrote:
       | Great reminder.
       | 
       | Even if you pre-create a thread (thread pool), when the task is
       | small enough (less than 1,000 cycles), it is less expensive to do
       | it in place (for example, with fibers), because of the cost of
       | context switching.
        
         | iforgotpassword wrote:
         | Agree. A few years ago I noticed a C program we used in
         | production spawned a new thread for each incoming connection.
         | Since the vast majority of these just served two small requests
         | (think two HTTP gets) I tried adding a very simple thread pool
         | that would keep up to four idle threads around. To make a
         | thread wait for work I used an eventfd (Linux). I tried a
         | linked list and an array for the idle threads. I tried
         | protecting the get/return code with a mutex and spin lock, and
         | then made it lock free with C11s atomics. Two days later I
         | still couldn't get this to be faster than just spawning a new
         | thread every time, so I gave up this experiment.
         | 
         | It seems at least the Linux folks optimized the crap out of
         | clone() over the last years.
        
           | cesarb wrote:
           | > It seems at least the Linux folks optimized the crap out of
           | clone() over the last years.
           | 
           | The most essential Linux benchmark is compiling the Linux
           | kernel (since it's something the Linux kernel developers do
           | all the time, so they really feel the impact). The clone()
           | system call is used both to create new threads and to create
           | new processes, and the Linux kernel compilation uses a large
           | amount of short-lived processes (each C file is a new C
           | compiler process). It's only natural that clone() is heavily
           | optimized, together with the filesystem caches (each new C
           | compiler process reads the source code files from scratch).
        
           | proverbialbunny wrote:
           | I imagine this is why coroutines and the like are often used
           | as threads within a thread pool.
        
         | mac01021 wrote:
         | Why is the cost of switching threads so much higher than the
         | cost of switching fibers?
        
           | gpderetta wrote:
           | Switching threads require entering the kernel which costs
           | from a few hundreds to thousands of clock cycles (and it got
           | worse from all the spectre/meltdown mitigations).
           | 
           | A fiber switch can be done in less than 10 clock cycles.
        
           | fwsgonzo wrote:
           | Because threads are traditionally created and scheduled by
           | the OS, so it inevitably involves a costly context switch
           | both first into the kernel and then back again into the next
           | thread, if one is ready.
           | 
           | Userspace threads are more light-weight, but probably still
           | worse than just using fibers and co-routines. Depends on your
           | needs, I suppose.
        
       | boulos wrote:
       | I find Eli Bendersky's writeup [1] more useful as it actually
       | goes closer to the details. For readers less familiar, it also
       | makes it more clear what the time spent will depend on (how much
       | state there is to copy). Eli's post is actually a sub-post of his
       | "cost of context switching" post [2] which is more often
       | applicable (and helps answer all the questions below about
       | threadpools).
       | 
       | [1] https://eli.thegreenplace.net/2018/launching-linux-
       | threads-a...
       | 
       | [2] https://eli.thegreenplace.net/2018/measuring-context-
       | switchi...
        
       | drmeister wrote:
       | Threads are very expensive if you start throwing C++ exceptions
       | within them in parallel. You see the overall time to join the
       | threads increases with each thread you add. There is a mutex in
       | the unwinding code and as the threads grab the mutex they
       | invalidate each other's cache line. I wrote a demo to illustrate
       | the problem https://github.com/clasp-developers/ctak
       | 
       | MacOS doesn't have this problem but Linux and FreeBSD do.
        
         | ajross wrote:
         | Did gcc/libstdc++ have the same problem? The report you link is
         | for clang, it looks like.
        
           | drmeister wrote:
           | Yes, gcc/libstdc++ have the same problem. That's where I saw
           | it first and then I tried llvm/libunwind and saw the same
           | thing.
        
             | monocasa wrote:
             | What's the actual lock on? The unwinding shouldn't involve
             | shared mutable state (as someone who's been deep into the
             | DWARF unwinding VM bytecode).
        
               | brandmeyer wrote:
               | I think that this is a lock around dlopen(). dlopen
               | changes the list of mapped objects (and therefore, the
               | mapping from instruction pointer to unwind information).
        
               | drmeister wrote:
               | In libgcc it's in Unwind_Find_FDE - we think it's a lock
               | around walking the loaded dynamic libraries. I haven't
               | personally dug much deeper into it but my folks here and
               | the llvm engineers seem to be pretty certain that's the
               | problem (this: https://github.com/gcc-
               | mirror/gcc/blob/master/libgcc/unwind-...). Right now we
               | are rearranging our compiler so we throw fewer exceptions
               | because you don't have to optimize things that you don't
               | do :-).
        
               | viraptor wrote:
               | Looks to me like it only tries to protect the building of
               | the shared / sorted "seen_objects". You don't want two
               | threads rebuilding it at the same time. Although there
               | must be a way to work around this. Maybe something like
               | optimistically walk through the seen list, then grab the
               | lock to update and walk again without a lock? You should
               | be able to safely walk a linked list forwards even with
               | another thread inserting into it, right?
        
               | gpderetta wrote:
               | There is some work libc size to make the lock optional
               | and enabling it only at the first dlopen.
               | 
               | Edit: last time I investigated the issue ended up here:
               | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
        
         | Koshkin wrote:
         | > _start throwing C++ exceptions_
         | 
         | Well, don't. "Exception-based programming" is an anti-pattern.
         | Exceptions should be thrown in, well, exceptional situations.
        
           | monocasa wrote:
           | They were using it in another language's runtime. Idiomatic
           | advice like that doesn't always transfer.
        
             | rumanator wrote:
             | Limiting exceptions to handle exceptional events is not
             | idiomatic advice, it's stating their usecase,and the whole
             | reason they exist.
        
         | gumby wrote:
         | There's an easy optimization to avoid inspecting every frame
         | when unwinding which c++ could not implement (for policy
         | reasons) though a platform could: add a pointer to the next
         | frame that needs unwinding to the frame setup. This is like
         | move elision.
         | 
         | If my caller has destructors to run or a catch clause this
         | pointer is null and inspection proceeds as normal. If It does
         | not it stores the value from _its_ frame there. Then if I throw
         | an exception I jump to the next frame that needs inspection; if
         | I don't then any throw further down the call stack won't even
         | look at me.
         | 
         | The C++ standard can't call for this because of the "zero cost
         | if you don't use it" rule. But a Linux ABI could. The MacOS
         | takes advantage of this kind of freedom.
        
           | ajross wrote:
           | To be fair: very few C++ applications are limited by
           | exception performance, it's a feature that's very much out of
           | favor at the moment. So penalizing everyone else (despite the
           | fact that most new code doesn't use them, it's not at all
           | uncommon to find projects with exception generation enabled
           | for the benefit of one library or two) to make parallel
           | exceptions faster actually does seem like a bad trade to me
           | in the broad sense.
           | 
           | Apple does indeed have more freedom, and it may be that
           | specific MacOS components need this in ways that the general
           | community doesn't seem to. But I'd want to see numbers from a
           | bunch of real world environments before declaring this a
           | uniformly good optimization.
        
             | gumby wrote:
             | > To be fair: very few C++ applications are limited by
             | exception performance, it's a feature that's very much out
             | of favor at the moment.
             | 
             | It is distressing that Sutter's survey showed that half the
             | respondents had to disable exceptions for part of all the
             | code. I've often heard the argument "well google's coding
             | standard prohibits exceptions" which is bizarre, as
             | google's standard says "exceptions are great but we have
             | some legacy code that can't use them, so we're stuck"
             | 
             | The biggest argument seems to be that they are expensive,
             | which is crazy because there's no cost if you don't raise
             | one and if you do you're already in trouble and generally
             | have plenty of time to deal with it (this is different
             | from, say, Lisp signalling which not only permits
             | continuing (!) but is on theory supposed to be common.
             | Probably a mistake in retrospect). But they allow you to
             | make the uncommon stuff uncommon (as opposed to error codes
             | which must be sprayed like shrapnel through your code).
             | 
             | There are two legit arguments against exceptions: one is
             | when you are constrained in space (e.g. embedded systems)
             | and/or time (hard realtime systems that need predictable
             | timing, even if it is slower). The other is a philosophical
             | argument that it embodies a second, parallel flow of
             | control. Since C++'s exception system is an error system
             | only, and since destructors are run automatically, it's
             | hard for me to find this second argument convincing.
        
               | pjmlp wrote:
               | Sometimes I miss C++'s flexibility from the managed
               | languages that usually use, then I remember that the
               | community is now driven by the performance at all costs
               | crowd, without exceptions, RTTI, STL and let that thought
               | go.
               | 
               | That is not the C++ I enjoy using, rather the language I
               | got to love via Turbo Vision, OWL, VCL, MFC, Qt, which is
               | not what drives the language nowadays.
        
               | gumby wrote:
               | I wouldn't characterize that group as "the community".
               | True there are a lot of such people, mostly clustered In
               | the game industry where superstition is rife.
               | 
               | Take a look At C++ (or c++ 20!) as if it were a brand new
               | language you'd never seen before and forgetting that it's
               | name includes "c". _That_ language is a pretty clean,
               | expressive and straightforward language IMHO. I like
               | programming in it.
               | 
               | It's not claiming it's unicorns farting rainbows, but
               | it's definitely pretty good.
        
               | pjmlp wrote:
               | If the community wasn't busy discussing those issues, and
               | constexpr of all things, we would already have
               | reflection, with a concurrency and networking story that
               | isn't put to shame for what Java 5 already had, let alone
               | in modern managed languages.
               | 
               | Yeah, if everyone plays ball, it might come in 5 years
               | from now, assuming C++23 gets done on time, plus the
               | compiler support stabilization.
               | 
               | Right now SG14 seems to drive some of those decisions, at
               | least from outside.
        
               | gumby wrote:
               | Those are important issues and people who care about them
               | work on them and come to committee meetings. There is
               | less consensus on the concurrency and networking side
               | which I also find frustrating but as I'm not pushing
               | those balls forward I can't complain. I do think at least
               | that the direction they're moving in is a fruitful one.
               | 
               | The standard can move quickly: consider formatted output
               | which lingered unchanged with a broken model but was
               | rapidly reformed when someone with a good model _and_
               | implementation was encouraged to come forward. Admittedly
               | a smaller topic than concurrency or networking!
        
               | pjmlp wrote:
               | Right now, the way I see it, I rather help the managed
               | languages I work on reach the point where binding to C++
               | is kind of last option when nothing else helps.
               | 
               | The other language communities manage to drive language
               | progress over the Internet, which apparently ISO has yet
               | to get in touch how it goes.
        
               | gumby wrote:
               | C++ does this as well, for example with boost, where
               | several things that entered the standard got their start.
               | And the format example I gave. It's the ISO blessing that
               | is complex, but also acts as a forcing function
               | tromtrhnto make new features as orthogonal as possible.
               | Sure, it's not to everybody's taste, but you don't need
               | to follow ISO if you don't wish to.
        
               | detaro wrote:
               | That seems like a misconception about the driving forces
               | behind C++ today.
        
               | pjmlp wrote:
               | Not when one looks into the recent ABI discussions.
        
               | gumby wrote:
               | You mean a refusal to break ABI?
        
               | pjmlp wrote:
               | Yes, as means to achieve performance improvements that
               | aren't that relevant to average Joe C++ dev.
               | 
               | The one doing application stuff in Qt, MFC, wxWidgets.
               | 
               | Or those like myself, where C++ only matters as means to
               | implement native bindings to system libraries, or GPGPU
               | shading languages based on C++.
        
               | gumby wrote:
               | Well I can afford a complete ABI break (complete) given
               | the kind of code I work on. Most people cannot. Binary
               | incompatibilities are very hard for most people to
               | manage.
               | 
               | So I would benefit from any number of abi-breaking
               | proposals but can understand the committees reticence.
        
               | barrkel wrote:
               | Yeah, when the moderates move out, the hard core that
               | remains swings towards what makes C++ unique, and that's
               | not general purpose application programming and language
               | features that support it.
        
               | pjmlp wrote:
               | Which looking from its use in mainstream OS SDKs means
               | drivers, composition engine, shaders and real time audio
               | engines, the SQL of systems programming, kind of.
        
               | fpoling wrote:
               | Writing exceptions-safe code is not free in itself.
               | Surely one can do it, but it requires more mental energy
               | to write and even more efforts to review the code.
        
               | keldaris wrote:
               | > which is crazy because there's no cost if you don't
               | raise one
               | 
               | This is just false in the general case. The presence
               | (potential or actual) of exceptions often just serves as
               | an optimization barrier in current compilers. That's not
               | to even invoke bizarre but not infrequent issues like
               | this [1]. I too have had codebases that miraculously sped
               | up upon disabling exceptions despite not throwing
               | anything. Identifying the exact causes of these
               | situations is hard and typically not done, because it's
               | far easier to just add a compiler switch and pretend
               | there are no exceptions in C++ and get back to work.
               | 
               | Many people in performance sensitive domains just don't
               | find it remotely worthwhile to care about features that
               | have these sorts of difficult to predict and debug costs.
               | When your workflow already consists of writing highly
               | explicit, simple to reason about code that you frequently
               | inspect in disassembled form, exceptions (and RTTI for a
               | host of obvious reasons) are the last thing you'd want to
               | enable. At best it's just extraneous noise in the
               | assembly, at worst you take a sizable perf hit and have
               | no idea why.
               | 
               | [1] https://twitter.com/timsweeneyepic/status/12230774046
               | 6037145...
        
               | throwaway17_17 wrote:
               | This description of C++ usage you gave:
               | 
               | A workflow consisting of writing highly explicit, simple
               | to reason about code that you frequently inspect in
               | disassembled form
               | 
               | Is possibly the most descriptive and succinct description
               | of my coding practice. I really like this formulation and
               | I am going to shamelessly steal it in the future,
               | repeatedly.
        
               | mehrdadn wrote:
               | We need sample code for stuff like this so they can be
               | referred back to as canonical examples. Frequently
               | asserted C++ misconceptions?
        
             | acqq wrote:
             | > it's a feature that's very much out of favor at the
             | moment.
             | 
             | I'm glad if it is so. Exceptions should actually be
             | "exceptional" and not the part of the normal execution
             | flow. Whoever has other ideas has the wrong model of what,
             | at the lower levels, exceptions actually do.
        
           | brandmeyer wrote:
           | This sounds a lot like the SJLJ runtime model that was used
           | in G++ for years.
           | 
           | > The C++ standard can't call for this because of the "zero
           | cost if you don't use it" rule. But a Linux ABI could. The
           | MacOS takes advantage of this kind of freedom.
           | 
           | That's not really true. It has nothing to do with the
           | standard. It has everything to do with the compiler's users
           | complaining about the performance hit relative to DWARF EH.
           | It is part of the social contract between the standards body,
           | the compiler author community, and the user community that
           | unused features don't cost us in runtime performance.
        
             | gumby wrote:
             | Yes, it's similar I suppose, though much lower overhead. I
             | was a bigger advocate for frame inspection than Michael was
             | in the early years because of my Lisp background. He was
             | (correctly) more concerned with performance.
             | 
             | As for "policy" vs "social contract" I think we basically
             | agree.
        
           | drmeister wrote:
           | This sounds interesting - thank you! We need to interoperate
           | with C++ - so we couldn't use this, could we? We could add
           | this pointer to our own frames but C++ frames won't have this
           | info - so I'm not sure how they would interoperate. We need
           | to be able to throw an exception and invoke both Clasp frame
           | cleanups and C++ frame cleanups up the stack. We have a crazy
           | mix of C++ and CL frames on the stack at any time.
        
             | gumby wrote:
             | Sure you could. You presumably already have a mechanism for
             | doing frame unwinding either via compatibility with the C++
             | runtime's throw() implementation or by supplying your own.
             | So for your own stack frames you can do what you like.
             | Exceptions raised by c++ code called from lisp would also
             | work the same way as they do now.
             | 
             | And when you are unwinding a lisp->C++ boundary (that is,
             | lisp code called by a C++ function) you are free to do what
             | you like until you get to the first Lisp frame; if it
             | doesn't have an unwind-protect then its "ignore me" pointer
             | just points up to its caller, which is examined by the C++
             | runtime anyway.
             | 
             | The nice part of that second paragraph is that if that
             | first lisp callee was called by a non-c++ function (say a
             | fortran function) you might even have an opportunity to set
             | the "parent frame for inspection pointer" to skip over all
             | the fortran frames and point directly to the lowest C++
             | function below you...which you could manage via a small
             | change to gold or llvm-ld.
        
         | ampdepolymerase wrote:
         | What about Windows?
        
         | lachlan-sneff wrote:
         | Unrelated to this thread, but are you still working on the
         | matter compiler project?
        
           | drmeister wrote:
           | That's the point of everything. We have spun up a company and
           | big things are in the works: http://www.thirdlaw.tech/
        
             | lachlan-sneff wrote:
             | Are you calling the software the "matter compiler" or is a
             | "matter compiler"/"nanofactory" an eventual end goal of the
             | project?
        
         | wwarner wrote:
         | Wow, I wouldn't have guessed that! Nice data!
        
           | drmeister wrote:
           | Thank you. We have developed a Common Lisp implementation
           | that uses LLVM and interoperates with C++ and uses C++
           | exception handling to unwind the stack. Common Lisp code
           | relies on stack unwinding a fair bit. Imagine my surprise
           | when my fancy multi-threaded compiler can't get out of first
           | gear (tops out at ~150% cpu) on Linux. Sheesh.
        
             | gumby wrote:
             | I've been really interested in clasp; what's the current
             | state? Can't see any recent posts that summarize it.
        
               | drmeister wrote:
               | It's going well. I'll post something soon. We've just
               | been working on it quietly. We have multithreading,
               | unicode, cffi etc, good debugging support, cross-language
               | profiling and more.
        
         | yjftsjthsd-h wrote:
         | Any idea why Darwin is beating Linux and FreeBSD here? Are they
         | doing something different that could/should be implemented in
         | the others?
        
       | signa11 wrote:
       | imho, if _cost_ of thread creation is where the bottleneck is,
       | then more likely than not, you are doing things wrong.
        
         | Ensorceled wrote:
         | This is just another way of saying what the article just said.
        
       | bluetomcat wrote:
       | For CPU-bound tasks, it is best to pre-create a number of threads
       | whose count roughly corresponds to the number logical execution
       | cores. Every thread is then a worker with a main loop and not
       | just spawn on-demand. Pin their affinity to a specific core and
       | you are as close as possible to the "perfect" arrangement with
       | minimized context switches and core-local cache data being there
       | most of the time.
        
         | emilfihlman wrote:
         | A valid reason to have more is IO, and particularly file IO.
        
           | inetknght wrote:
           | File IO should be using asynchronous methods too. If your OS
           | doesn't support asynchronous file IO then you're not using
           | any of the big 3.
        
             | tomlu wrote:
             | Exactly this. You want one thread pool of size ~= core
             | count, then you want to have a completely _different_ "max
             | number of IO jobs" type deal that doesn't use threads at
             | all.
             | 
             | C# async/await is pretty good for this (IO operations do
             | not count towards CPU task count).
        
         | fwsgonzo wrote:
         | Yep. We do this on real hardware and never preempt. There is
         | nothing more performant than that. Using IPIs (inter-processor
         | interrupts) you can trigger events like "more work has been
         | added" on each CPUs queue. Additionally, when you put all
         | interrupts of a device solely on one specific CPU you won't
         | have to lock anything.
         | 
         | Some other things: pthreads generally have high cost, and that
         | means C++ threads do too. pthreads have quite a few features
         | that you regularly don't use, which you can skip completely
         | using fibers and coroutines.
        
           | londons_explore wrote:
           | How many percent performance do you think you gain by never
           | pre-empting?
           | 
           | If we're talking 50%, the complexity sounds worth it, but if
           | it's 1% I think I'd prefer to stick with standard scheduling
           | and know my program will 'just work' on any CPU or OS, and
           | with any libraries I choose to use.
        
             | fwsgonzo wrote:
             | Probably a lot on bare metal, but we are a special case
             | here. It really brings down the latency to respond to
             | network events. A single context switch is in the range of
             | 100K-1M CPU cycles.
             | 
             | The reason why is because we avoid all the indirect cost of
             | context switching, which is all the various caches that has
             | to be flushed. And also the context switching itself, of
             | course.
             | 
             | However, you can still do a lot on Linux to equalize things
             | if you really want to get down to it. For anything but
             | special cases Linux really does a good job with scheduling.
             | After all, you are likely not running much else other than
             | your intended service.
             | 
             | That said, for me this thread was a slight wakeup-call that
             | made me look more into fibers and co-routines. I have been
             | wanting to use these for a long time for some things.
        
               | boulos wrote:
               | > A single context switch is in the range of 100K-1M CPU
               | cycles.
               | 
               | 1M cycles is roughly 300 microseconds (assume 3 GHz
               | processor, so 3 cycles is 1 nanosecond). Eli's graph from
               | the post I referenced above, has a context switch in the
               | 1-3 microsecond range [1] depending on taskset/core
               | pinning. The high end (3 microseconds) is about 10000
               | cycles then.
               | 
               | Maybe you mean fork() or pthread_create for your 1M
               | cycles?
               | 
               | [1] https://eli.thegreenplace.net/images/2018/plot-
               | launch-switch...
        
             | jandrewrogers wrote:
             | The difference can be quite large, details are workload and
             | software dependent. It isn't just the context-switching
             | overhead (which is prohibitively high these days), it also
             | significantly improves average cache locality, which is the
             | bottleneck for many high-performance codes.
             | 
             | Some types of software optimizations require the ability to
             | correctly infer local CPU cache contents, which is
             | difficult when arbitrary processes are semi-randomly
             | stepping all over that cache.
        
           | Koshkin wrote:
           | > _never preempt_
           | 
           | I don't think this is a realistic expectation on Linux and,
           | especially, Windows which runs hundreds of threads of its own
           | you don't want to know about. (Besides, we must remember that
           | multithreading was invented and found quite useful in the era
           | of "single-core" processors.)
        
             | cma wrote:
             | Mask off all those other processes to only run on core 0.
        
             | emidln wrote:
             | Linux/CFS has an isolcpus feature that can be used to tell
             | the kernel to never schedule things on a given CPU. This is
             | useful when latency matters.
        
               | ncmncm wrote:
               | I learned recently that even with isolcpu, _and_ nohz,
               | _and_ interrupts directed elsewhere, the kernel will
               | still pause the thread on the isolcpu if it has mmapped a
               | file (e.g., to report stats) and the kernel decides it 's
               | time to copy the bits to disk. If you don't want stalls,
               | only map writable files on a tmpfs volume. To snapshot
               | the file, copy it to another file on the same volume, and
               | then snapshot the copy.
        
               | gpderetta wrote:
               | Yes, it is called TLB shootdown and it is required to
               | preserve the integrity of the TLB across CPUs on an umap
               | or a dirty bit change. If latency is impprtant, don't use
               | disk backed writeable mappings.
               | 
               | Edit: to clarify: any writeable mapping or any unmap will
               | cause TLB shutdown interrupts to be broadcasted to all
               | currently running threads of a process.
        
               | ncmncm wrote:
               | Of course one never unmaps the file, or any mapped
               | memory, so this has nothing to do with TLB problems
               | (which are also a thing -- another reason processes are
               | better than threads).
               | 
               | These pauses happen even without unmapping. Despite that
               | no synchronization is available, so you are right in the
               | middle of whatever, the kernel decides a static snapshot
               | of the pages' state must be written, so write-protects
               | the pages first, and blocks your process until the write
               | is done. It's just rude.
        
             | bluetomcat wrote:
             | On a server running heterogeneous CPU-bound tasks of
             | various users it is hardly a realistic expectation, but on
             | a single-user device with a single application in the
             | foreground I would say it is realistic, since most of these
             | running processes are blocked on something most of the
             | time. Hundreds of mostly-idle resident processes are
             | insignificant to a one that puts the CPU through its paces.
        
             | fwsgonzo wrote:
             | This thread is about the performance of threads, which we
             | have established can be a high cost for some. In our case
             | we don't run on Linux or Windows, but you can still do the
             | same on Linux afaik, although you will have to write a
             | kernel object for some things.
        
         | hinoki wrote:
         | One thing to worry about is that you're effectively taking over
         | the job of the OS scheduler. This can be a good thing since you
         | know more about your workload than the generic heuristics the
         | scheduler uses, but it also means that you might need to
         | reimplement some things.
         | 
         | Like only scheduling work on logical cores that share a
         | physical core after all physical cores have a busy logical core
         | (I.e. fill up the even cores first).
        
           | kccqzy wrote:
           | > One thing to worry about is that you're effectively taking
           | over the job of the OS scheduler.
           | 
           | Exactly. And some apps automatically do that by default, as
           | if they are so arrogant as to think they must be the only
           | program running on that machine. Maybe good for a server app,
           | terrible advice for a general desktop app.
        
           | jcelerier wrote:
           | Thankfully we have very nice task-based schedulers nowadays
           | such as https://github.com/cpp-taskflow/cpp-taskflow or TBB
           | flowgraph
        
           | jandrewrogers wrote:
           | Taking over the job of the OS scheduler is explicitly the
           | reason for doing it, there are some classes of macro-
           | optimization that have this as a prerequisite. It is done for
           | the same reasons that high-performance database kernels
           | replace the I/O scheduler too.
           | 
           | To your point, it is a double-edged sword. Writing your own
           | schedulers requires a much higher degree of sophistication
           | than using the one in the OS. It is a skill that takes a long
           | time to develop and requires a lot of first principles
           | thinking, there is loads of subtlety, you can't just copy
           | something you found on a blog. It also isn't just about being
           | able to predict the behavior of your workload better than the
           | OS, you can also adapt your workload to the schedule state
           | since it is exposed to your application, the latter being a
           | greatly overlooked capability.
           | 
           | Once you know how to design software this way, it not only
           | generates large increases in throughput but also enables many
           | elegant solutions to difficult software design problems that
           | simply aren't possible any other way. While the learning
           | curve is steep, once you are accustomed to writing software
           | this way it becomes pretty mechanical.
        
         | Koshkin wrote:
         | Paraphrasing the famous Greenspun's Tenth Rule, any home-made
         | threading library for C++ always ends up being "an ad hoc,
         | informally-specified, bug-ridden, slow implementation of half
         | of" Intel TBB. (Been there, done that.)
        
           | bluetomcat wrote:
           | This is not about coming up with a thread library. The
           | described scenario can be realized entirely with standard
           | Pthread primitives and calls like pthread_setaffinity.
        
             | Koshkin wrote:
             | Well, a good library would take the issue of the "cost"
             | into account for you.
        
       ___________________________________________________________________
       (page generated 2020-03-01 23:00 UTC)