[HN Gopher] C Runtime Overhead (2015) ___________________________________________________________________ C Runtime Overhead (2015) Author : Zababa Score : 128 points Date : 2022-01-03 17:41 UTC (5 hours ago) (HTM) web link (ryanhileman.info) (TXT) w3m dump (ryanhileman.info) | cozzyd wrote: | So, on my system (Fedora 34, glibc 2.33, gcc11, Ryzen 5 3600) a | trivial program takes only about 1 ms according to time: | $ cat trivial.c #include <stdio.h> int main(int | nargs, char ** args) { FILE * f = | fopen("/dev/null","w"); for (int i = 0; i < nargs; | i++) { fprintf(f,"%s\n",args[i]); | } return 0; } $ make trivial # | just default CFLAGS $ time ./trivial real | 0m0.001s user 0m0.000s sys 0m0.001s | | But if I add strace -tt, it does indeed take 9 ms, even if I | redirect strace output $ time strace -tt | ./trivial 2> /dev/null real 0m0.009s user | 0m0.005s sys 0m0.004s | | So, is author just measuring strace overhead? | justicezyx wrote: | But the 1ms number is also measured with strace. | | Plus, the article is in 2015, and the author did not mention | the CPU and other configuration. So that also make things | muddy. | cozzyd wrote: | Yes, but if there are almost no syscalls, then strace is | doing a lot less. | | Profiling my trivial program with callgrind shows | (unsurprisingly) that the majority of the time is in dynamic | library relocation and whatever __GI__tunables_init does: | https://i.imgur.com/Yligh7S.png | matu3ba wrote: | strace can have a significant overhead: | https://www.brendangregg.com/blog/2014-05-11/strace-wow-much... | What does the output of `perf trace` say? | dang wrote: | Discussed at the time: | | _C Runtime Overhead_ - | https://news.ycombinator.com/item?id=8958867 - Jan 2015 (31 | comments) | ginko wrote: | I never quite understood why assembly wasn't part of programming | language shootout competitions. C was always treated as the speed | of light for computing, which didn't quite make sense to me. | pjscott wrote: | In my experience there's usually not much speed advantage to be | had from assembly _unless_ you have a specific thesis about how | you 're going to do better than the code that a good compiler | would generate; e.g. the article author skipping C runtime | setup overhead, or Mike Pall dissecting the ARM implementation | of one of the bytecodes in LuaJIT (recommended!): | | https://www.reddit.com/r/programming/comments/hkzg8/author_o... | | Unless you can think of something you can do in assembly that a | compiler just wouldn't be able to think of, their ability to | generate good machine code is really quite impressive: | | https://ridiculousfish.com/blog/posts/will-it-optimize.html | jart wrote: | Normally it's better to focus on high-level time-complexity | improving optimizations but there's still a whole lot of | things where you really need assembly micro-optimizations, | since they usually offer a 10x or 100x speedup. It's just | that those things usually only concern core libraries. Stuff | like crc32. But the rest of the time we're just writing code | that glues all the assembly optimized subroutines together. | eatonphil wrote: | Awesome post! And (2015) for the title maybe. | Zababa wrote: | Thanks, added! | 2ton_jeff wrote: | A couple of years ago I did a "Terminal Video Series" segment | that also highlights the runtime overhead of 12 other languages, | C of course actually one of the best ones: | https://2ton.com.au/videos/tvs_part1/ | tialaramex wrote: | I was actually surprised for C how expensive that is, I'd | expected maybe a dozen or so syscalls to set up the environment | the provided runtime wants, but that's a lot of syscalls for | not very much value. Both C++ and Rust are more expensive but | they're setting up lots of stuff I might use in real programs, | even if it's wasted for printing "hello" - but what on Earth | can C need all those calls for when the C runtime environment | is so impoverished anyway? | | Go was surprising to me in the opposite direction, they're | setting up a pretty nice hosted environment, and they're not | using many instructions or system calls to do that compared to | languages that would say they're much more focused on | performance. I'd have expected to find it closer to Perl or | something, and it's really much lighter. | aidenn0 wrote: | glibc isn't particularly optimized for startup-time. It also | defaults to dynamic linkage which adds several syscalls, and | even if it is statically linked, it may dynamically load NSS | plugins. (musl gets around the latter by hardcoding support | for NSCD, a service that can cache the NSS results). | | C11 adds several things that benefit from early | initialization (like everything around threads), but GCC had | some level of support for most of them before that. | stabbles wrote: | What's kinda neat about musl libc is that the interpreter/runtime | linker and libc.so are the same executable, so that if you just | dynamically link to libc, it does not need to open ld.so cache or | locate and load libc.so. | averne_ wrote: | More importantly this uses the devkitPro/libnx homebrew | toolchain, while the OP project uses the official SDK (the | "complicated legal reasons" behind the absence of code probably | being an NDA signature). | josefx wrote: | > I happened to run strace -tt against my solution (which | provides microsecond-accurate timing information for syscalls) | | Weirdly I always found the timing results of strace at under one | millisecond generally unreliable, it just seemed to add too much | overhead itself. | | Also making system calls from your own code varies between | prohibited and badly supported on most OSes. Some see calls that | didn't pass through the system libc as a security issue and will | intercept them, while Linux may just silently corrupt your | process memory if you try something fancy as the Go team had to | find out. | mananaysiempre wrote: | Umm, what's the story with Go on Linux? My understanding is | that Linux explicitly supports making syscalls from your own | code, it's just that on platforms where the syscall story is a | giant mess (32-bit x86) the prescription is to either use the | slow historical interface (INT 80h) or to jump to the vDSO | (which will do SYSENTER or SYSCALL or whatever--SYSENTER in | particular has the lovely property of _not saving the return | address_ , so a common stub is pretty much architecturally | required[1]). | | If I'm guessing correctly about what you're referring to, the | Go people did something in between, got broken in a Linux | kernel release, complained at the kernel people, the kernel got | a patch to unbreak them. The story on MacOS or OpenBSD, where | the kernel developers will cheerfully tell you to take a hike | if you make a syscall yourself, seems much worse to me. | | (And yes, I'd say there _is_ a meaningful difference between a | couple of instructions in the vDSO and Glibc's endless | wrappers.) | | [1]: | https://lore.kernel.org/lkml/Pine.LNX.4.44.0212172225410.136... | | ETA: Wait, no, I was thinking about the Android breakage[2]. | What's the Go story then? | | [2]: | https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/... | aw1621107 wrote: | This gets rather over my head, but my best guess is this bug | having to do with the Go runtime underestimating how much | stack space vDSO code may need (blog post at [0], fix at | [1]). | | [0]: https://marcan.st/2017/12/debugging-an-evil-go-runtime- | bug/ | | [1]: https://github.com/golang/go/commit/a158382b1c9c0b95a7d4 | 1865... | mananaysiempre wrote: | > _I built 32 kernels, one for each bit of the SHA-1 | prefix, which only took 29 minutes._ | | Oh, that _is_ evil, thank you. I even encountered a link to | this article a couple of months ago[1], but wasn't hooked | enough to go and read it. | | Though the conclusion sounds somewhat dubious: the kernel | doesn't document or control stack usage limits for the | vDSO, they happen to blow way up on a system built with | obscure (if arguably well-motivated) compiler options, a | language runtime that tries to minimize stack usage crashes | as a result, and somehow the fault is with the runtime in | question for not going via libc (which happens to virtually | always run with a large stack and a guard page, thus | turning this insidious concurrent memory corruption bug | into a mere extremely unlikely crash)? | | More like we're collectively garbage at accounting for our | stack usage. To be fair to the kernel developers, I would | also never guess, looking at this implementation of | clock_gettime() [2], that you could compile it in such a | way that it ends up requiring 4K of stack space on pain of | memory corruption _in concurrent programs only_ (it | originating in the kernel source tree has little to do with | the bug, it's just weirdly-compiled userspace C code | executing on top of a small unguarded stack). | | [1] https://utcc.utoronto.ca/~cks/space/blog/unix/UnixAPIAn | dCRun... via <https://utcc.utoronto.ca/~cks/space/blog/prog | ramming/CStackS...>, <https://utcc.utoronto.ca/~cks/space/b | log/unix/StackSizeLimit...>, and <https://utcc.utoronto.ca/ | ~cks/space/blog/tech/StackSizeLimit...>. | | [2] https://elixir.bootlin.com/linux/latest/C/ident/__cvdso | _cloc... | badsectoracula wrote: | > Also making system calls from your own code varies between | prohibited and badly supported on most OSes. | | This is not the case on Linux though, the official API of the | kernel is its system calls, not any specific C library. | mgaunard wrote: | That's the difference between precision and accuracy. | | Having more precision doesn't magically make the data accurate. | commandlinefan wrote: | > making system calls from your own code varies between | prohibited and badly supported | | Which is probably at least part of the reason that libc itself | ends up being slower than a specific, targeted solution. | scottlamb wrote: | josefx> Also making system calls from your own code varies | between prohibited and badly supported on most OSes. Some see | calls that didn't pass through the system libc as a security | issue and will intercept them, while Linux may just silently | corrupt your process memory if you try something fancy as the | Go team had to find out. | | On Linux, making syscalls directly is fine. Good point about | other platforms, but many people only care about Linux, for | better or worse. And the author's last paragraph (quoted below) | suggests using an alternate/static-linking-friendly libc, not | making direct syscalls yourself. Presumably on platforms where | those alternate libcs aren't available, you continue using the | standard libc. | | ryanhileman> If you're running into process startup time issues | in a real world scenario and ever actually need to do this, it | might be worth your time to profile and try one of the | alternative libc implementations (like musl libc or diet libc). | | IMHO, the Go vDSO problem [1] wasn't due to making direct | syscalls but basically calling a userspace library without | meeting its assumptions. I'd describe Linux's vDSO as a | userspace library for making certain syscalls with less | overhead. (If you don't care about the overhead, you can call | them as you'd call any other syscall instead.) It assumed the | standard ABI, in which there's a guard page as well as | typically a generous amount of stack to begin with. Golang | called into it from a thread that used Go's non-standard ABI | with less stack space available (Go's own functions check the | stack size on entry and copy the stack if necessary) and no | guard page. On some Linux builds (with -fstack-check, | apparently used by Gentoo Hardened ), it actually used enough | stack to overflow. Without a guard page, that caused memory | corruption. | | [1] https://github.com/golang/go/issues/20427 ___________________________________________________________________ (page generated 2022-01-03 23:00 UTC)