[HN Gopher] C performance mystery: delete unused string constant
       ___________________________________________________________________
        
       C performance mystery: delete unused string constant
        
       Author : dolmen
       Score  : 42 points
       Date   : 2020-06-24 20:40 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ufo wrote:
       | When weird performance things like this happens, it can be
       | helpful to test in other environments. If the performance change
       | does not happen in other operating systems or hardware then it
       | suggests that the weird performance you are observing could be
       | due to an unusual coincidence in your particular system.
       | 
       | That said, if you do want to figure out why deleting that global
       | variable made a difference then using Linux's perf tool might
       | give you more informationto work with. One time I had a weird
       | program where inserting a NOP instruction in a certain location
       | made it run twice as fast. After investigation we found out that
       | the difference was with branch prediction. The presence or
       | absense of that NOP instruction affected the addresses of the
       | jump targets the inner loop's switch statement and the version
       | without the NOP for whatever reason led to lots of branch
       | mispredictions that particular moder of Intel CPU. Perhaps
       | because of a collision in the hash tables used by the branch
       | predictor.
        
       | rurban wrote:
       | He should experiment with shortening his ENV vars one byte by
       | one, to see similar benchmarking effects. It process affects
       | alignment, and if you unlucky bad alignment can cost a few
       | percent.
        
       | moonchild wrote:
       | > deleting this one line of code can have a dramatic effect on
       | seemingly unrelated performance micro-benchmarks. Some numbers
       | get better (e.g. +5%), some numbers get worse (e.g. -10%). The
       | same micro-benchmark can get faster on one C compiler but slower
       | on another
       | 
       | There's a fundamental disconnect that makes it difficult for
       | humans to reason about performance in computer programs. Because
       | the speed of light is so slow, computer architecture as we know
       | it will always rely on cache and OoO to be fast. The human brain
       | does seem to work out of order, but it's only used to thinking
       | about a world that runs in order. When we use theory of mind, we
       | don't model other people's minds, we use our own as a model for
       | theirs; see mirror neurons[1].
       | 
       | Because of this, standard code benchmarks are not very useful,
       | unless they can demonstrate order-of-magnitude speedups. Even
       | something like a causal profiler[2][3][4], which attempts to
       | control for the volatile aspects of performance, is of limited
       | use; it cannot control for all variables and its results will
       | likely be invalidated by the same architectural variation it
       | tries to control for. Instead (with respect to performance) we
       | should focus on three factors:
       | 
       | - Code maintainability
       | 
       | - Algorithmic complexity
       | 
       | - Cache coherency
       | 
       | Everything else is a distraction.
       | 
       | 1. https://en.wikipedia.org/wiki/Mirror_neuron
       | 
       | 2. https://www.youtube.com/watch?v=r-TLSBdHe1A
       | 
       | 3. https://arxiv.org/pdf/1608.03676v1.pdf
       | 
       | 4. https://github.com/plasma-umass/coz
        
       | viraptor wrote:
       | I'm not sure I understand why they revert the change. You get
       | random small +/- changes in performance with the change, but you
       | get slightly cleaner code. What's the reason not to want that?
        
       | SomeoneFromCA wrote:
       | I remember changing a _constant_ int into a _volatile global
       | variable_ , and my code became 4 times faster. It was on Ubuntu
       | 16.04. I might actually find code and post here later.
        
         | sfblah wrote:
         | I worked on some code once for an app that would crash if you
         | added a comment in a specific place. We removed the comment and
         | then added a comment asking developers not to re-add the
         | comment.
        
           | IAmLiterallyAB wrote:
           | Was it a language where the comment would be compiled out
           | ahead of time? Or still present at runtime like JS
        
       | dzdt wrote:
       | Lots of performance mysteries, particularly in micro-benchmarks,
       | are explained by how either code or data aligns to memory cache
       | lines. A seemingly unrelated change will cause memory locations
       | to move. No idea if that is the case here or not but it looks
       | like a possibility.
        
         | nwallin wrote:
         | This is probably the answer.
         | 
         | I watched a great talk about this, and how to avoid problems
         | that result from it. The presenter built a framework that made
         | random changes to alignment of various blocks of code with two
         | different versions of the same microbenchmark until it had high
         | statistical confidence that either one was faster than the
         | other, or it would eject and say that the difference between
         | the two was statistically insignificant. Or something along
         | those lines.
         | 
         | I'm looking for the video now, but I can't find it. Pretty sure
         | it was CppCon but I'm not positive.
        
           | kristjansson wrote:
           | https://news.ycombinator.com/item?id=23634519 :)
        
           | moonchild wrote:
           | Possibly 'Performance Matters'?
           | (https://www.youtube.com/watch?v=r-TLSBdHe1A)
        
             | dkersten wrote:
             | This was the very first thing I thought of when I saw this
             | submission title! Great talk, well worth watching.
        
         | kristjansson wrote:
         | This[1] talk from Emery Berger at last year's Strange Loop does
         | a good job discussing the impact of alignment and other 'small'
         | changes on performance, and how to estimate the actual impact
         | of changes on performance
         | 
         | [1]: https://www.youtube.com/watch?v=r-TLSBdHe1A
        
         | codezero wrote:
         | Yep, I'm not even a hobbyist in this area, and immediately
         | wondered if there was an alignment change because of it.
         | 
         | With that said, it sounds like at least with GCC, a global
         | constant would go in the TEXT section of the binary and I can't
         | think of why that would affect performance, especially since
         | the variable was seemingly unused. Neat, I hope someone finds
         | the thread to pull here :)
        
           | kijiki wrote:
           | > ...a global constant would go in the TEXT section of the
           | binary...
           | 
           | Wouldn't it go into the .rodata section, not .text?
        
             | codezero wrote:
             | I'm not sure, I am parroting what I thought I read on SO -
             | so you are much more likely to be right, sorry about that!
        
               | vivekseth wrote:
               | clang seems to work this way, and macOS aliases gcc ->
               | clang by default. Wrote more about this here:
               | https://news.ycombinator.com/item?id=23634495
        
           | tom_mellior wrote:
           | > it sounds like at least with GCC, a global constant would
           | go in the TEXT section of the binary
           | 
           | I don't know why you think it sounds like that, but GCC (on
           | Linux) puts global constants into the .rodata section (read-
           | only data) and other global variables into the .data section.
           | 
           | > I can't think of why that would affect performance,
           | especially since the variable was seemingly unused.
           | 
           | I'm guessing data alignment and corresponding changes in
           | cache hits and misses. But I wonder if there's a scenario
           | where this change changes the offsets of other variables from
           | the base of the data section, which changes the encodings and
           | therefore the sizes of the instructions accessing those
           | variables, which overall changes code size and might cause
           | code alignment issues. Probably too far-fetched.
        
             | vivekseth wrote:
             | On macOS, "gcc" puts strings that don't include a null byte
             | in the TEXT section. I put quotes around gcc, since it
             | seems that by default macOS aliases gcc to clang.
             | 
             | You can run the program I built here to see for your self:
             | https://github.com/vivekseth/blog-posts/tree/master/Jump-
             | Add...
             | 
             | Since the string in the TEXT section, we can actually
             | execute it as if it were code!
             | 
             | After you build the program you can run `otool -t ./a.out`
             | to verify that the string `execString` is indeed in the
             | TEXT section.
        
               | tom_mellior wrote:
               | Interesting. And if you put a null byte at the end of
               | that character array, it would put it in a different
               | section? I don't have access to MacOS, but I would be
               | interested in seeing the output of clang -S on that code
               | (with and without a null byte), maybe you could add it to
               | the repository?
        
             | codezero wrote:
             | Thanks for the correction, hopefully my statement that I'm
             | hardly a hobbyist makes it clear I'm not super sure :)
             | 
             | I guess I don't understand enough about how this variable
             | sitting in memory might affect hit/miss/alignment
             | especially if it's not used in the program as it's running.
             | 
             | Would it be that some variable that _is_ used is pushed off
             | of a page in memory or something by the unused variable, so
             | access it is slower? I guess that makes sense.
        
         | muth02446 wrote:
         | here is a related paper:
         | 
         | https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13....
        
       | AdmiralAsshat wrote:
       | Isn't this usually where someone sufficiently versed in Assembly
       | would look at the generated assembler code and figure out what
       | changed?
        
         | moonchild wrote:
         | The presence (or absence) of a global variable has no effect on
         | the rest of the generated code. Nothing changed. You could
         | possibly look at the alignment of the in-memory representation
         | generated by the relocator,runtime dynamic linker. Likely, some
         | bits of code are now on the same cache line that were
         | previously separate, and vice versa.
        
         | pascal_cuoq wrote:
         | You need to look at the disassembly of the generated binary to
         | make sense of this sort of performance variation (paying
         | attention to line cache boundaries for code and data), and even
         | so, it is highly non-trivial. The performance counters found in
         | modern processors sometimes help
         | (https://en.wikipedia.org/wiki/Hardware_performance_counter ).
         | 
         | https://www.agner.org/optimize/microarchitecture.pdf contains
         | the sort of information you need to have absorbed before you
         | even start investigating. In most cases, it's not worth
         | acquiring the expertise for 5% one way or the other in micro-
         | benchmarks. If you care about these 5%, you shouldn't be
         | programming in C in the first place.
         | 
         | And then there is this anecdote:
         | 
         | My job is to make tools to detect subtle undefined behaviors in
         | C programs. I once had the opportunity to report a signed
         | arithmetic overflow in a library that its authors considered,
         | rightly or wrongly, to be performance-critical. My suggestion
         | was:
         | 
         | ... this is not one of the subtle undefined behaviors that we
         | are the only ones to detect, UBSan would also have told you
         | that the library was doing something wrong with "x + y" where x
         | and y are ints. The good news is that you can write
         | "(int)((unsigned)x + y)", this _is_ defined and it behaves
         | exactly like you expected "x + y" to behave (but had no right
         | to).
         | 
         | And the answer was "Ah, no, sorry, we can't apply this change,
         | I ran the benchmarks and the library was 2% slower with it.
         | It's a no, I'm afraid".
         | 
         | The thing is, I am pretty sure that any modern optimizing C
         | compiler (the interlocutor was using Clang) has been generating
         | the exact same binary code for the two constructs for years
         | (unless it applies an optimization that relies on the addition
         | not overflowing in the "x + y" case, but then the authors would
         | have noticed). I would bet a house that the binary that was 2%
         | slower in benchmarks was byte-identical to the reference one.
        
           | gnufx wrote:
           | Performance counters are vital, and you don't need to grovel
           | the disassembly yourself in association with profiling, even
           | if it's feasible. Get a tool to do it for you; MAQAO is one
           | in that area (x86-specific, and unfortunately part-
           | proprietary).
           | 
           | Anyway, yes, measurements are what you need, rather than
           | guesses, along with correctness checks, indeed.
        
           | voldacar wrote:
           | If I may ask, what was the use case for this code that they
           | cared so much about a 2% difference in benchmarks? Aerospace?
           | Game engine? Packet routing?
        
       | acqq wrote:
       | As far as I see, the source code of the library is not plain C
       | but .go, which is then transpiled to .c
       | 
       | Anyway, even if what we see in the resulting .c is "just some
       | missing string" it's not necessarily obvious how the strings are
       | supposed to be handled in the whole .c domain. Specifically, it
       | can be that some other constants (even not necessarily sting
       | constants, but the constants in the same segment where the string
       | constants are!) are being used very frequently by the code which
       | is measured (let's call them "hotter" constants), and the removal
       | of that one string constant simply affected the placement of the
       | "hotter" constants).
       | 
       | In short, I suspect that the deletion of the constant is not the
       | only way to get different results, but that the different results
       | would also happen even when changing the order of the definition
       | of constants, without removing any of them.
       | 
       | So the way I would attack that problem, if it would be desired to
       | fix the performance issues, is: I'd measure the access of all the
       | constants in the same segment, and identify which are used during
       | the whole run of these benchmarks -- those are "hot." The
       | solution is then to modify the code to be less dependent on such
       | constants inside of the "hot" loops. Often the number of these
       | that have to be fixed is low, but it's also possible it's
       | otherwise.
       | 
       | So I believe it's not that much a mystery as it appears to be. (I
       | have even more specific experiences and also suggestions. And I'm
       | also looking for a new dream _remote_ job. Anybody needs this
       | kind of expertise, for some reasonable longer term, or shorter
       | term but seriously paid?)
        
       ___________________________________________________________________
       (page generated 2020-06-24 23:00 UTC)