[HN Gopher] Bubble sort slower with -O3 than -O2 with GCC
       ___________________________________________________________________
        
       Bubble sort slower with -O3 than -O2 with GCC
        
       Author : asicsp
       Score  : 110 points
       Date   : 2021-10-17 12:13 UTC (10 hours ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | codr7 wrote:
       | I've written a lot of interpreters in C/++ lately, and spent a
       | lot of time profiling and optimizing them.
       | 
       | I can't remember -O3 ever being faster than -O2, it's usually
       | slightly slower for me.
        
         | DannyBee wrote:
         | For interpreters, that's highly likely to be true, especially
         | if they are mainly spending time in switch tables or executing
         | basic opcodes.
        
           | gnufx wrote:
           | Python, specifically, is supposed to get most benefit from
           | -fno-semantic-interposition
        
           | infberg wrote:
           | Care to explain? -O3 generates larger code than -O2?
        
             | pclmulqdq wrote:
             | Yes, -O3 tends to include a lot of features that increase
             | code size, like aggressive loop unrolling. If you are
             | jumping around a large amount of code, -O3 generally
             | performs more poorly than -O2, but if you are running a
             | tight loop (like HPC code), -O3 is better.
             | 
             | In the past, at a time when I worked on a very performance
             | sensitive codebase that was also limited in scope, we
             | compiled with -Osize and did all the loop optimizations we
             | wanted manually (and with pragmas). That produced faster
             | code than -O2 or -O3.
        
               | gnufx wrote:
               | Regarding unrolling, -O3 contains -funroll-and-jam but
               | not -funroll-loops. You may want one or the other, maybe
               | both, depending on circumstances. I don't see much
               | benefit from the available pragmas on HPC-type code
               | unless for OpenMP, and "omp simd" isn't necessary to get
               | vectorization in the places I've seen people say it is.
               | Mileage always varies somewhat, of course. (Before
               | second-guessing anything, use -fopt-info.)
        
               | jleahy wrote:
               | It's a shame that Osize can sometimes produce truly awful
               | code. There are a few optimisations in there that trade a
               | byte for a massive slowdown.
        
               | userbinator wrote:
               | You asked for minimum size, and that's what you got. I'd
               | say that's working as it should.
               | 
               | A more granular control over optimisation would be good,
               | however.
        
               | jhgb wrote:
               | Surely a profile-guided build should be able to only
               | apply -Os to those functions where it doesn't cause a lot
               | of problems.
        
         | nly wrote:
         | What about when you factor in PGO? (-fprofile-generate &
         | -fprofile-use)
        
           | mhh__ wrote:
           | This is the rub. the compiler is flying completely blind
           | without PGO.
           | 
           | I turned on PGO for the D compiler recently and it compiled
           | it's tests 15% faster. Boom.
        
           | codr7 wrote:
           | Haven't really been down that rabbit hole yet, thanks for
           | reminding me.
        
       | mgraczyk wrote:
       | To me the more interesting meta-point here is that some
       | algorithms are less microarchitecture friendly than others.
       | Algorithms with larger distances between data dependencies are
       | less likely to stall.
       | 
       | For example I compared an insertion sort and see the same
       | performance between O2 and O3:                   void
       | insertionsort(int *buf)         {           for (int i = 1; i <
       | n; ++i) {             const int x = buf[i];             int j = i
       | - 1;             for (; j >= 0 && buf[j] > x; --j) buf[j + 1] =
       | buf[j];             buf[j + 1] = x;           }         }
       | 
       | bubble O2: 1.208s
       | 
       | bubble O3: 1.469s
       | 
       | insertion O2: 0.126s
       | 
       | insertion O3: 0.126s
        
       | gnufx wrote:
       | I see something rather different with that example, using GCC 10
       | on SKX with that code and program argument, bound to a core, +/-
       | ~2%:
       | 
       | -O2 -g: 1.07s
       | 
       | -O3 -g: 1.17s
       | 
       | -O3 -g -fprofile-generate/use: 0.84s
       | 
       | (maqao cqa from https://maqao.org would give comprehensive info
       | on the generated code.)
       | 
       | --                  /* Tom, don't let me ever catch you using
       | bubblesort again! -- Don */
       | 
       | -- Knuth in dvips/afm2tfm.c
        
       | jeffbee wrote:
       | Lots of things are slower at higher -O levels, many things get
       | slower if you compile them for your native instruction
       | extensions, and there are other surprises lurking all over the
       | compiler. The only way to be sure is to sweep out the parameter
       | space of your compiler, using benchmark results as your objective
       | function.
        
       | glandium wrote:
       | Relevant talk by Emery Berger: Performance matters
       | https://www.youtube.com/watch?v=r-TLSBdHe1A I'd recommend
       | everyone interested in the subject to watch it.
        
       | rkagerer wrote:
       | _this is pretty clearly an anti-optimization you should report
       | onhttps://gcc.gnu.org/bugzilla/_
       | 
       | Was it ever reported?
       | 
       | I did a quick search on https://gcc.gnu.org/bugzilla for any
       | issues, of any status, tagged with _missed-optimization_ and
       | containing the term  "store-forwarding" or "bubble sort" and
       | didn't see it (maybe I'm not searching right?)
        
       | mhh__ wrote:
       | How to keep the O3 goodness without pessimizations: Use
       | likely/unlikely liberally, unreachable(), and PGO.
       | 
       | If the code slows down, it's usually because the compiler has
       | generated a bunch of code because it doesn't know what hot path
       | to schedule for.
       | 
       | They will generate hundreds to thousands of instructions for
       | factorial if you let them because it turns it into a loop, then
       | vectorizes the loop.
       | 
       | Undefined Behaviour pub quiz question in there ^^
        
         | unwind wrote:
         | Just to clarify, in this context PGO = Profile-Guided
         | Optimization [1].
         | 
         | [1]: https://en.m.wikipedia.org/wiki/Profile-
         | guided_optimization
        
           | jhgb wrote:
           | ...as opposed to? Off the top of my head I can't seem to
           | remember a single acronym I could possibly confuse it with,
           | and now you got me wondering which common acronym I
           | completely failed to learn.
        
             | toxik wrote:
             | Pose graph optimization
        
               | mhh__ wrote:
               | That seems like a bit of a reach given that the thread is
               | obviously about GCC.
        
         | MauranKilom wrote:
         | Not sure if that would have an impact here. GCC is just unaware
         | of the latency implications of store forwarding. I mean, it's
         | definitely worth a shot, but you'd just be more or less hoping
         | that your mentioned techniques disable the right optimization
         | pass.
        
         | gnufx wrote:
         | Yes to -fprofile-use. That brought gfortran to essentially the
         | same geometric mean as ifort with reasonable "good" flags for
         | me, running a benchmark set which I think is supposed to show
         | off proprietary compilers. ifort got little benefit from its
         | equivalent.
        
         | [deleted]
        
       | userbinator wrote:
       | My experience with MSVC and ICC is similar - O3 is often worse
       | than O2, and a lot of the time Os is actually the fastest (and
       | smallest).
        
         | StephanTLavavej wrote:
         | MSVC has /O1 and /O2 (and /Os), but not /O3. See
         | https://docs.microsoft.com/en-us/cpp/build/reference/compile...
         | . (I work on MSVC's STL.)
        
       | flatiron wrote:
       | Dude spent more time on the explanation than I spend on my "good"
       | projects at work!
        
         | mcraiha wrote:
         | Mandatory xkcd https://xkcd.com/664/
        
       | MauranKilom wrote:
       | If you've spent any amount of time on StackOverflow around
       | C/C++/assembly, you could probably tell within the first three
       | paragraphs of the answer who authored it. :)
        
       | digitallyfree wrote:
       | From my experience, -O2 includes optimizations that improve
       | performance in the majority of cases. Whereas -O3 contains
       | additional optimizations that may improve performance but the
       | results are less clear-cut - hence why they're included in -O3 as
       | opposed to -O2. I've seen cases where -O3 is faster than -O2, and
       | vice versa.
        
         | amelius wrote:
         | Can you turn on optimizations on a per-function basis?
        
           | gnufx wrote:
           | Yes, and that's sometimes a good idea. GCC has a pragma and
           | function attribute (unfortunately not for Fortran, at least).
        
         | phkahler wrote:
         | That's why its infuriating that vectorization was not enabled
         | at -O2. Since x86-64 supports SSE2 as a baseline, every program
         | has been under performing at -O2 for over 15 years. Without
         | knowing this it turns out the recent introduction of
         | -march=x86-64-v2 and v3 dont benefit either when both have AVX2
         | which goes largely unused. See pathetic test results here:
         | https://github.com/solvespace/solvespace/issues/972
         | 
         | Notice that v3 is no better and may even be a hair slower.
        
           | gnufx wrote:
           | See https://gcc.gnu.org/gcc-12/changes.html It's still not
           | necessarily a guaranteed win, I think.
        
           | jleahy wrote:
           | Vectorisation is not a clear win, and so doesn't belong in
           | O2. Certainly it can deliver massive improvements in some
           | cases (as loop unrolling could in Pentium III days), yet can
           | cause significant binary size increases (think icache costs)
           | and makes loops with small iteration counts potentially much
           | slower.
           | 
           | If for some reason you want O2 but with vectorisation you can
           | enable it, and if you want O3 without inlining or loop
           | unrolling (the same thing), you're welcome to specify that.
        
             | userbinator wrote:
             | Did you mean the P4? An extremely long pipeline meant loop
             | unrolling helps it far more than it does x86 processors
             | before or after. PIII is more similar to Core/2/i in
             | performance characteristics.
             | 
             | ...and then there's Atom, which is also an odd one - I
             | believe it's more like a classic Pentium (P54).
        
               | jleahy wrote:
               | Well I did want to say the P4, but I never actually had
               | one so didn't want to say something that turned out to be
               | wrong.
               | 
               | Certainly on earlier microarchitectures it did help a lot
               | more than it does now, and there were some nice wins on
               | the PIII for it
        
             | phkahler wrote:
             | The fact that they are changing to include some
             | vectorization options by default for -O2 suggests
             | otherwise. In particular SLP vectorization should have been
             | there all along. Every graphics library 2d and 3d defines
             | vector classes that will easily go faster with the simplest
             | vectorization.
             | 
             | Arguments around code size have been largely irrelevant for
             | quite some time, so long as performance increases.
             | 
             | If I put on a tinfoil hat, I'd suspect Intel was at least
             | an influence. They're the ones adding AVX, AVX2, 512 and
             | providing their own compiler to take advantage of them,
             | while crippling AMD parts. That is until they switched to
             | LLVM.
        
             | kzrdude wrote:
             | Maybe it could use a different heuristic in O2? Depending
             | on the function it's vital to have autovectorization for
             | performance. Neither -O3 nor -O2 are perfect as a blanket
             | option for all the functions in a big program.
        
       | torginus wrote:
       | I think bubble sort is kind of an optimization nightmare - a loop
       | where subsequent iterations depend on each other - and the
       | dependency is through memory. I'm pretty sure the compiler tries
       | to extract some level of parallelism from the loop and fails
       | miserably.
        
       | kiryin wrote:
       | I've always been under the impression that -03, -0fast and
       | related "performance" tweaks are not recommended, and shouldn't
       | be used unless the software in question is specifically written
       | with them in mind.
        
       ___________________________________________________________________
       (page generated 2021-10-17 23:00 UTC)