[HN Gopher] C performance mystery: delete unused string constant ___________________________________________________________________ C performance mystery: delete unused string constant Author : dolmen Score : 42 points Date : 2020-06-24 20:40 UTC (2 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | ufo wrote: | When weird performance things like this happens, it can be | helpful to test in other environments. If the performance change | does not happen in other operating systems or hardware then it | suggests that the weird performance you are observing could be | due to an unusual coincidence in your particular system. | | That said, if you do want to figure out why deleting that global | variable made a difference then using Linux's perf tool might | give you more informationto work with. One time I had a weird | program where inserting a NOP instruction in a certain location | made it run twice as fast. After investigation we found out that | the difference was with branch prediction. The presence or | absense of that NOP instruction affected the addresses of the | jump targets the inner loop's switch statement and the version | without the NOP for whatever reason led to lots of branch | mispredictions that particular moder of Intel CPU. Perhaps | because of a collision in the hash tables used by the branch | predictor. | rurban wrote: | He should experiment with shortening his ENV vars one byte by | one, to see similar benchmarking effects. It process affects | alignment, and if you unlucky bad alignment can cost a few | percent. | moonchild wrote: | > deleting this one line of code can have a dramatic effect on | seemingly unrelated performance micro-benchmarks. Some numbers | get better (e.g. +5%), some numbers get worse (e.g. -10%). The | same micro-benchmark can get faster on one C compiler but slower | on another | | There's a fundamental disconnect that makes it difficult for | humans to reason about performance in computer programs. Because | the speed of light is so slow, computer architecture as we know | it will always rely on cache and OoO to be fast. The human brain | does seem to work out of order, but it's only used to thinking | about a world that runs in order. When we use theory of mind, we | don't model other people's minds, we use our own as a model for | theirs; see mirror neurons[1]. | | Because of this, standard code benchmarks are not very useful, | unless they can demonstrate order-of-magnitude speedups. Even | something like a causal profiler[2][3][4], which attempts to | control for the volatile aspects of performance, is of limited | use; it cannot control for all variables and its results will | likely be invalidated by the same architectural variation it | tries to control for. Instead (with respect to performance) we | should focus on three factors: | | - Code maintainability | | - Algorithmic complexity | | - Cache coherency | | Everything else is a distraction. | | 1. https://en.wikipedia.org/wiki/Mirror_neuron | | 2. https://www.youtube.com/watch?v=r-TLSBdHe1A | | 3. https://arxiv.org/pdf/1608.03676v1.pdf | | 4. https://github.com/plasma-umass/coz | viraptor wrote: | I'm not sure I understand why they revert the change. You get | random small +/- changes in performance with the change, but you | get slightly cleaner code. What's the reason not to want that? | SomeoneFromCA wrote: | I remember changing a _constant_ int into a _volatile global | variable_ , and my code became 4 times faster. It was on Ubuntu | 16.04. I might actually find code and post here later. | sfblah wrote: | I worked on some code once for an app that would crash if you | added a comment in a specific place. We removed the comment and | then added a comment asking developers not to re-add the | comment. | IAmLiterallyAB wrote: | Was it a language where the comment would be compiled out | ahead of time? Or still present at runtime like JS | dzdt wrote: | Lots of performance mysteries, particularly in micro-benchmarks, | are explained by how either code or data aligns to memory cache | lines. A seemingly unrelated change will cause memory locations | to move. No idea if that is the case here or not but it looks | like a possibility. | nwallin wrote: | This is probably the answer. | | I watched a great talk about this, and how to avoid problems | that result from it. The presenter built a framework that made | random changes to alignment of various blocks of code with two | different versions of the same microbenchmark until it had high | statistical confidence that either one was faster than the | other, or it would eject and say that the difference between | the two was statistically insignificant. Or something along | those lines. | | I'm looking for the video now, but I can't find it. Pretty sure | it was CppCon but I'm not positive. | kristjansson wrote: | https://news.ycombinator.com/item?id=23634519 :) | moonchild wrote: | Possibly 'Performance Matters'? | (https://www.youtube.com/watch?v=r-TLSBdHe1A) | dkersten wrote: | This was the very first thing I thought of when I saw this | submission title! Great talk, well worth watching. | kristjansson wrote: | This[1] talk from Emery Berger at last year's Strange Loop does | a good job discussing the impact of alignment and other 'small' | changes on performance, and how to estimate the actual impact | of changes on performance | | [1]: https://www.youtube.com/watch?v=r-TLSBdHe1A | codezero wrote: | Yep, I'm not even a hobbyist in this area, and immediately | wondered if there was an alignment change because of it. | | With that said, it sounds like at least with GCC, a global | constant would go in the TEXT section of the binary and I can't | think of why that would affect performance, especially since | the variable was seemingly unused. Neat, I hope someone finds | the thread to pull here :) | kijiki wrote: | > ...a global constant would go in the TEXT section of the | binary... | | Wouldn't it go into the .rodata section, not .text? | codezero wrote: | I'm not sure, I am parroting what I thought I read on SO - | so you are much more likely to be right, sorry about that! | vivekseth wrote: | clang seems to work this way, and macOS aliases gcc -> | clang by default. Wrote more about this here: | https://news.ycombinator.com/item?id=23634495 | tom_mellior wrote: | > it sounds like at least with GCC, a global constant would | go in the TEXT section of the binary | | I don't know why you think it sounds like that, but GCC (on | Linux) puts global constants into the .rodata section (read- | only data) and other global variables into the .data section. | | > I can't think of why that would affect performance, | especially since the variable was seemingly unused. | | I'm guessing data alignment and corresponding changes in | cache hits and misses. But I wonder if there's a scenario | where this change changes the offsets of other variables from | the base of the data section, which changes the encodings and | therefore the sizes of the instructions accessing those | variables, which overall changes code size and might cause | code alignment issues. Probably too far-fetched. | vivekseth wrote: | On macOS, "gcc" puts strings that don't include a null byte | in the TEXT section. I put quotes around gcc, since it | seems that by default macOS aliases gcc to clang. | | You can run the program I built here to see for your self: | https://github.com/vivekseth/blog-posts/tree/master/Jump- | Add... | | Since the string in the TEXT section, we can actually | execute it as if it were code! | | After you build the program you can run `otool -t ./a.out` | to verify that the string `execString` is indeed in the | TEXT section. | tom_mellior wrote: | Interesting. And if you put a null byte at the end of | that character array, it would put it in a different | section? I don't have access to MacOS, but I would be | interested in seeing the output of clang -S on that code | (with and without a null byte), maybe you could add it to | the repository? | codezero wrote: | Thanks for the correction, hopefully my statement that I'm | hardly a hobbyist makes it clear I'm not super sure :) | | I guess I don't understand enough about how this variable | sitting in memory might affect hit/miss/alignment | especially if it's not used in the program as it's running. | | Would it be that some variable that _is_ used is pushed off | of a page in memory or something by the unused variable, so | access it is slower? I guess that makes sense. | muth02446 wrote: | here is a related paper: | | https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.... | AdmiralAsshat wrote: | Isn't this usually where someone sufficiently versed in Assembly | would look at the generated assembler code and figure out what | changed? | moonchild wrote: | The presence (or absence) of a global variable has no effect on | the rest of the generated code. Nothing changed. You could | possibly look at the alignment of the in-memory representation | generated by the relocator,runtime dynamic linker. Likely, some | bits of code are now on the same cache line that were | previously separate, and vice versa. | pascal_cuoq wrote: | You need to look at the disassembly of the generated binary to | make sense of this sort of performance variation (paying | attention to line cache boundaries for code and data), and even | so, it is highly non-trivial. The performance counters found in | modern processors sometimes help | (https://en.wikipedia.org/wiki/Hardware_performance_counter ). | | https://www.agner.org/optimize/microarchitecture.pdf contains | the sort of information you need to have absorbed before you | even start investigating. In most cases, it's not worth | acquiring the expertise for 5% one way or the other in micro- | benchmarks. If you care about these 5%, you shouldn't be | programming in C in the first place. | | And then there is this anecdote: | | My job is to make tools to detect subtle undefined behaviors in | C programs. I once had the opportunity to report a signed | arithmetic overflow in a library that its authors considered, | rightly or wrongly, to be performance-critical. My suggestion | was: | | ... this is not one of the subtle undefined behaviors that we | are the only ones to detect, UBSan would also have told you | that the library was doing something wrong with "x + y" where x | and y are ints. The good news is that you can write | "(int)((unsigned)x + y)", this _is_ defined and it behaves | exactly like you expected "x + y" to behave (but had no right | to). | | And the answer was "Ah, no, sorry, we can't apply this change, | I ran the benchmarks and the library was 2% slower with it. | It's a no, I'm afraid". | | The thing is, I am pretty sure that any modern optimizing C | compiler (the interlocutor was using Clang) has been generating | the exact same binary code for the two constructs for years | (unless it applies an optimization that relies on the addition | not overflowing in the "x + y" case, but then the authors would | have noticed). I would bet a house that the binary that was 2% | slower in benchmarks was byte-identical to the reference one. | gnufx wrote: | Performance counters are vital, and you don't need to grovel | the disassembly yourself in association with profiling, even | if it's feasible. Get a tool to do it for you; MAQAO is one | in that area (x86-specific, and unfortunately part- | proprietary). | | Anyway, yes, measurements are what you need, rather than | guesses, along with correctness checks, indeed. | voldacar wrote: | If I may ask, what was the use case for this code that they | cared so much about a 2% difference in benchmarks? Aerospace? | Game engine? Packet routing? | acqq wrote: | As far as I see, the source code of the library is not plain C | but .go, which is then transpiled to .c | | Anyway, even if what we see in the resulting .c is "just some | missing string" it's not necessarily obvious how the strings are | supposed to be handled in the whole .c domain. Specifically, it | can be that some other constants (even not necessarily sting | constants, but the constants in the same segment where the string | constants are!) are being used very frequently by the code which | is measured (let's call them "hotter" constants), and the removal | of that one string constant simply affected the placement of the | "hotter" constants). | | In short, I suspect that the deletion of the constant is not the | only way to get different results, but that the different results | would also happen even when changing the order of the definition | of constants, without removing any of them. | | So the way I would attack that problem, if it would be desired to | fix the performance issues, is: I'd measure the access of all the | constants in the same segment, and identify which are used during | the whole run of these benchmarks -- those are "hot." The | solution is then to modify the code to be less dependent on such | constants inside of the "hot" loops. Often the number of these | that have to be fixed is low, but it's also possible it's | otherwise. | | So I believe it's not that much a mystery as it appears to be. (I | have even more specific experiences and also suggestions. And I'm | also looking for a new dream _remote_ job. Anybody needs this | kind of expertise, for some reasonable longer term, or shorter | term but seriously paid?) ___________________________________________________________________ (page generated 2020-06-24 23:00 UTC)