[HN Gopher] C Runtime Overhead (2015)
       ___________________________________________________________________
        
       C Runtime Overhead (2015)
        
       Author : Zababa
       Score  : 128 points
       Date   : 2022-01-03 17:41 UTC (5 hours ago)
        
 (HTM) web link (ryanhileman.info)
 (TXT) w3m dump (ryanhileman.info)
        
       | cozzyd wrote:
       | So, on my system (Fedora 34, glibc 2.33, gcc11, Ryzen 5 3600) a
       | trivial program takes only about 1 ms according to time:
       | $ cat trivial.c          #include <stdio.h>          int main(int
       | nargs, char ** args)          {           FILE * f =
       | fopen("/dev/null","w");            for (int i = 0; i < nargs;
       | i++)            {             fprintf(f,"%s\n",args[i]);
       | }           return 0;          }              $ make trivial #
       | just default CFLAGS          $ time ./trivial               real
       | 0m0.001s         user 0m0.000s         sys  0m0.001s
       | 
       | But if I add strace -tt, it does indeed take 9 ms, even if I
       | redirect strace output                   $ time strace -tt
       | ./trivial 2> /dev/null              real 0m0.009s         user
       | 0m0.005s         sys  0m0.004s
       | 
       | So, is author just measuring strace overhead?
        
         | justicezyx wrote:
         | But the 1ms number is also measured with strace.
         | 
         | Plus, the article is in 2015, and the author did not mention
         | the CPU and other configuration. So that also make things
         | muddy.
        
           | cozzyd wrote:
           | Yes, but if there are almost no syscalls, then strace is
           | doing a lot less.
           | 
           | Profiling my trivial program with callgrind shows
           | (unsurprisingly) that the majority of the time is in dynamic
           | library relocation and whatever __GI__tunables_init does:
           | https://i.imgur.com/Yligh7S.png
        
         | matu3ba wrote:
         | strace can have a significant overhead:
         | https://www.brendangregg.com/blog/2014-05-11/strace-wow-much...
         | What does the output of `perf trace` say?
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _C Runtime Overhead_ -
       | https://news.ycombinator.com/item?id=8958867 - Jan 2015 (31
       | comments)
        
       | ginko wrote:
       | I never quite understood why assembly wasn't part of programming
       | language shootout competitions. C was always treated as the speed
       | of light for computing, which didn't quite make sense to me.
        
         | pjscott wrote:
         | In my experience there's usually not much speed advantage to be
         | had from assembly _unless_ you have a specific thesis about how
         | you 're going to do better than the code that a good compiler
         | would generate; e.g. the article author skipping C runtime
         | setup overhead, or Mike Pall dissecting the ARM implementation
         | of one of the bytecodes in LuaJIT (recommended!):
         | 
         | https://www.reddit.com/r/programming/comments/hkzg8/author_o...
         | 
         | Unless you can think of something you can do in assembly that a
         | compiler just wouldn't be able to think of, their ability to
         | generate good machine code is really quite impressive:
         | 
         | https://ridiculousfish.com/blog/posts/will-it-optimize.html
        
           | jart wrote:
           | Normally it's better to focus on high-level time-complexity
           | improving optimizations but there's still a whole lot of
           | things where you really need assembly micro-optimizations,
           | since they usually offer a 10x or 100x speedup. It's just
           | that those things usually only concern core libraries. Stuff
           | like crc32. But the rest of the time we're just writing code
           | that glues all the assembly optimized subroutines together.
        
       | eatonphil wrote:
       | Awesome post! And (2015) for the title maybe.
        
         | Zababa wrote:
         | Thanks, added!
        
       | 2ton_jeff wrote:
       | A couple of years ago I did a "Terminal Video Series" segment
       | that also highlights the runtime overhead of 12 other languages,
       | C of course actually one of the best ones:
       | https://2ton.com.au/videos/tvs_part1/
        
         | tialaramex wrote:
         | I was actually surprised for C how expensive that is, I'd
         | expected maybe a dozen or so syscalls to set up the environment
         | the provided runtime wants, but that's a lot of syscalls for
         | not very much value. Both C++ and Rust are more expensive but
         | they're setting up lots of stuff I might use in real programs,
         | even if it's wasted for printing "hello" - but what on Earth
         | can C need all those calls for when the C runtime environment
         | is so impoverished anyway?
         | 
         | Go was surprising to me in the opposite direction, they're
         | setting up a pretty nice hosted environment, and they're not
         | using many instructions or system calls to do that compared to
         | languages that would say they're much more focused on
         | performance. I'd have expected to find it closer to Perl or
         | something, and it's really much lighter.
        
           | aidenn0 wrote:
           | glibc isn't particularly optimized for startup-time. It also
           | defaults to dynamic linkage which adds several syscalls, and
           | even if it is statically linked, it may dynamically load NSS
           | plugins. (musl gets around the latter by hardcoding support
           | for NSCD, a service that can cache the NSS results).
           | 
           | C11 adds several things that benefit from early
           | initialization (like everything around threads), but GCC had
           | some level of support for most of them before that.
        
       | stabbles wrote:
       | What's kinda neat about musl libc is that the interpreter/runtime
       | linker and libc.so are the same executable, so that if you just
       | dynamically link to libc, it does not need to open ld.so cache or
       | locate and load libc.so.
        
         | averne_ wrote:
         | More importantly this uses the devkitPro/libnx homebrew
         | toolchain, while the OP project uses the official SDK (the
         | "complicated legal reasons" behind the absence of code probably
         | being an NDA signature).
        
       | josefx wrote:
       | > I happened to run strace -tt against my solution (which
       | provides microsecond-accurate timing information for syscalls)
       | 
       | Weirdly I always found the timing results of strace at under one
       | millisecond generally unreliable, it just seemed to add too much
       | overhead itself.
       | 
       | Also making system calls from your own code varies between
       | prohibited and badly supported on most OSes. Some see calls that
       | didn't pass through the system libc as a security issue and will
       | intercept them, while Linux may just silently corrupt your
       | process memory if you try something fancy as the Go team had to
       | find out.
        
         | mananaysiempre wrote:
         | Umm, what's the story with Go on Linux? My understanding is
         | that Linux explicitly supports making syscalls from your own
         | code, it's just that on platforms where the syscall story is a
         | giant mess (32-bit x86) the prescription is to either use the
         | slow historical interface (INT 80h) or to jump to the vDSO
         | (which will do SYSENTER or SYSCALL or whatever--SYSENTER in
         | particular has the lovely property of _not saving the return
         | address_ , so a common stub is pretty much architecturally
         | required[1]).
         | 
         | If I'm guessing correctly about what you're referring to, the
         | Go people did something in between, got broken in a Linux
         | kernel release, complained at the kernel people, the kernel got
         | a patch to unbreak them. The story on MacOS or OpenBSD, where
         | the kernel developers will cheerfully tell you to take a hike
         | if you make a syscall yourself, seems much worse to me.
         | 
         | (And yes, I'd say there _is_ a meaningful difference between a
         | couple of instructions in the vDSO and Glibc's endless
         | wrappers.)
         | 
         | [1]:
         | https://lore.kernel.org/lkml/Pine.LNX.4.44.0212172225410.136...
         | 
         | ETA: Wait, no, I was thinking about the Android breakage[2].
         | What's the Go story then?
         | 
         | [2]:
         | https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...
        
           | aw1621107 wrote:
           | This gets rather over my head, but my best guess is this bug
           | having to do with the Go runtime underestimating how much
           | stack space vDSO code may need (blog post at [0], fix at
           | [1]).
           | 
           | [0]: https://marcan.st/2017/12/debugging-an-evil-go-runtime-
           | bug/
           | 
           | [1]: https://github.com/golang/go/commit/a158382b1c9c0b95a7d4
           | 1865...
        
             | mananaysiempre wrote:
             | > _I built 32 kernels, one for each bit of the SHA-1
             | prefix, which only took 29 minutes._
             | 
             | Oh, that _is_ evil, thank you. I even encountered a link to
             | this article a couple of months ago[1], but wasn't hooked
             | enough to go and read it.
             | 
             | Though the conclusion sounds somewhat dubious: the kernel
             | doesn't document or control stack usage limits for the
             | vDSO, they happen to blow way up on a system built with
             | obscure (if arguably well-motivated) compiler options, a
             | language runtime that tries to minimize stack usage crashes
             | as a result, and somehow the fault is with the runtime in
             | question for not going via libc (which happens to virtually
             | always run with a large stack and a guard page, thus
             | turning this insidious concurrent memory corruption bug
             | into a mere extremely unlikely crash)?
             | 
             | More like we're collectively garbage at accounting for our
             | stack usage. To be fair to the kernel developers, I would
             | also never guess, looking at this implementation of
             | clock_gettime() [2], that you could compile it in such a
             | way that it ends up requiring 4K of stack space on pain of
             | memory corruption _in concurrent programs only_ (it
             | originating in the kernel source tree has little to do with
             | the bug, it's just weirdly-compiled userspace C code
             | executing on top of a small unguarded stack).
             | 
             | [1] https://utcc.utoronto.ca/~cks/space/blog/unix/UnixAPIAn
             | dCRun... via <https://utcc.utoronto.ca/~cks/space/blog/prog
             | ramming/CStackS...>, <https://utcc.utoronto.ca/~cks/space/b
             | log/unix/StackSizeLimit...>, and <https://utcc.utoronto.ca/
             | ~cks/space/blog/tech/StackSizeLimit...>.
             | 
             | [2] https://elixir.bootlin.com/linux/latest/C/ident/__cvdso
             | _cloc...
        
         | badsectoracula wrote:
         | > Also making system calls from your own code varies between
         | prohibited and badly supported on most OSes.
         | 
         | This is not the case on Linux though, the official API of the
         | kernel is its system calls, not any specific C library.
        
         | mgaunard wrote:
         | That's the difference between precision and accuracy.
         | 
         | Having more precision doesn't magically make the data accurate.
        
         | commandlinefan wrote:
         | > making system calls from your own code varies between
         | prohibited and badly supported
         | 
         | Which is probably at least part of the reason that libc itself
         | ends up being slower than a specific, targeted solution.
        
         | scottlamb wrote:
         | josefx> Also making system calls from your own code varies
         | between prohibited and badly supported on most OSes. Some see
         | calls that didn't pass through the system libc as a security
         | issue and will intercept them, while Linux may just silently
         | corrupt your process memory if you try something fancy as the
         | Go team had to find out.
         | 
         | On Linux, making syscalls directly is fine. Good point about
         | other platforms, but many people only care about Linux, for
         | better or worse. And the author's last paragraph (quoted below)
         | suggests using an alternate/static-linking-friendly libc, not
         | making direct syscalls yourself. Presumably on platforms where
         | those alternate libcs aren't available, you continue using the
         | standard libc.
         | 
         | ryanhileman> If you're running into process startup time issues
         | in a real world scenario and ever actually need to do this, it
         | might be worth your time to profile and try one of the
         | alternative libc implementations (like musl libc or diet libc).
         | 
         | IMHO, the Go vDSO problem [1] wasn't due to making direct
         | syscalls but basically calling a userspace library without
         | meeting its assumptions. I'd describe Linux's vDSO as a
         | userspace library for making certain syscalls with less
         | overhead. (If you don't care about the overhead, you can call
         | them as you'd call any other syscall instead.) It assumed the
         | standard ABI, in which there's a guard page as well as
         | typically a generous amount of stack to begin with. Golang
         | called into it from a thread that used Go's non-standard ABI
         | with less stack space available (Go's own functions check the
         | stack size on entry and copy the stack if necessary) and no
         | guard page. On some Linux builds (with -fstack-check,
         | apparently used by Gentoo Hardened ), it actually used enough
         | stack to overflow. Without a guard page, that caused memory
         | corruption.
         | 
         | [1] https://github.com/golang/go/issues/20427
        
       ___________________________________________________________________
       (page generated 2022-01-03 23:00 UTC)