[HN Gopher] Bubble sort slower with -O3 than -O2 with GCC ___________________________________________________________________ Bubble sort slower with -O3 than -O2 with GCC Author : asicsp Score : 110 points Date : 2021-10-17 12:13 UTC (10 hours ago) (HTM) web link (stackoverflow.com) (TXT) w3m dump (stackoverflow.com) | codr7 wrote: | I've written a lot of interpreters in C/++ lately, and spent a | lot of time profiling and optimizing them. | | I can't remember -O3 ever being faster than -O2, it's usually | slightly slower for me. | DannyBee wrote: | For interpreters, that's highly likely to be true, especially | if they are mainly spending time in switch tables or executing | basic opcodes. | gnufx wrote: | Python, specifically, is supposed to get most benefit from | -fno-semantic-interposition | infberg wrote: | Care to explain? -O3 generates larger code than -O2? | pclmulqdq wrote: | Yes, -O3 tends to include a lot of features that increase | code size, like aggressive loop unrolling. If you are | jumping around a large amount of code, -O3 generally | performs more poorly than -O2, but if you are running a | tight loop (like HPC code), -O3 is better. | | In the past, at a time when I worked on a very performance | sensitive codebase that was also limited in scope, we | compiled with -Osize and did all the loop optimizations we | wanted manually (and with pragmas). That produced faster | code than -O2 or -O3. | gnufx wrote: | Regarding unrolling, -O3 contains -funroll-and-jam but | not -funroll-loops. You may want one or the other, maybe | both, depending on circumstances. I don't see much | benefit from the available pragmas on HPC-type code | unless for OpenMP, and "omp simd" isn't necessary to get | vectorization in the places I've seen people say it is. | Mileage always varies somewhat, of course. (Before | second-guessing anything, use -fopt-info.) | jleahy wrote: | It's a shame that Osize can sometimes produce truly awful | code. There are a few optimisations in there that trade a | byte for a massive slowdown. | userbinator wrote: | You asked for minimum size, and that's what you got. I'd | say that's working as it should. | | A more granular control over optimisation would be good, | however. | jhgb wrote: | Surely a profile-guided build should be able to only | apply -Os to those functions where it doesn't cause a lot | of problems. | nly wrote: | What about when you factor in PGO? (-fprofile-generate & | -fprofile-use) | mhh__ wrote: | This is the rub. the compiler is flying completely blind | without PGO. | | I turned on PGO for the D compiler recently and it compiled | it's tests 15% faster. Boom. | codr7 wrote: | Haven't really been down that rabbit hole yet, thanks for | reminding me. | mgraczyk wrote: | To me the more interesting meta-point here is that some | algorithms are less microarchitecture friendly than others. | Algorithms with larger distances between data dependencies are | less likely to stall. | | For example I compared an insertion sort and see the same | performance between O2 and O3: void | insertionsort(int *buf) { for (int i = 1; i < | n; ++i) { const int x = buf[i]; int j = i | - 1; for (; j >= 0 && buf[j] > x; --j) buf[j + 1] = | buf[j]; buf[j + 1] = x; } } | | bubble O2: 1.208s | | bubble O3: 1.469s | | insertion O2: 0.126s | | insertion O3: 0.126s | gnufx wrote: | I see something rather different with that example, using GCC 10 | on SKX with that code and program argument, bound to a core, +/- | ~2%: | | -O2 -g: 1.07s | | -O3 -g: 1.17s | | -O3 -g -fprofile-generate/use: 0.84s | | (maqao cqa from https://maqao.org would give comprehensive info | on the generated code.) | | -- /* Tom, don't let me ever catch you using | bubblesort again! -- Don */ | | -- Knuth in dvips/afm2tfm.c | jeffbee wrote: | Lots of things are slower at higher -O levels, many things get | slower if you compile them for your native instruction | extensions, and there are other surprises lurking all over the | compiler. The only way to be sure is to sweep out the parameter | space of your compiler, using benchmark results as your objective | function. | glandium wrote: | Relevant talk by Emery Berger: Performance matters | https://www.youtube.com/watch?v=r-TLSBdHe1A I'd recommend | everyone interested in the subject to watch it. | rkagerer wrote: | _this is pretty clearly an anti-optimization you should report | onhttps://gcc.gnu.org/bugzilla/_ | | Was it ever reported? | | I did a quick search on https://gcc.gnu.org/bugzilla for any | issues, of any status, tagged with _missed-optimization_ and | containing the term "store-forwarding" or "bubble sort" and | didn't see it (maybe I'm not searching right?) | mhh__ wrote: | How to keep the O3 goodness without pessimizations: Use | likely/unlikely liberally, unreachable(), and PGO. | | If the code slows down, it's usually because the compiler has | generated a bunch of code because it doesn't know what hot path | to schedule for. | | They will generate hundreds to thousands of instructions for | factorial if you let them because it turns it into a loop, then | vectorizes the loop. | | Undefined Behaviour pub quiz question in there ^^ | unwind wrote: | Just to clarify, in this context PGO = Profile-Guided | Optimization [1]. | | [1]: https://en.m.wikipedia.org/wiki/Profile- | guided_optimization | jhgb wrote: | ...as opposed to? Off the top of my head I can't seem to | remember a single acronym I could possibly confuse it with, | and now you got me wondering which common acronym I | completely failed to learn. | toxik wrote: | Pose graph optimization | mhh__ wrote: | That seems like a bit of a reach given that the thread is | obviously about GCC. | MauranKilom wrote: | Not sure if that would have an impact here. GCC is just unaware | of the latency implications of store forwarding. I mean, it's | definitely worth a shot, but you'd just be more or less hoping | that your mentioned techniques disable the right optimization | pass. | gnufx wrote: | Yes to -fprofile-use. That brought gfortran to essentially the | same geometric mean as ifort with reasonable "good" flags for | me, running a benchmark set which I think is supposed to show | off proprietary compilers. ifort got little benefit from its | equivalent. | [deleted] | userbinator wrote: | My experience with MSVC and ICC is similar - O3 is often worse | than O2, and a lot of the time Os is actually the fastest (and | smallest). | StephanTLavavej wrote: | MSVC has /O1 and /O2 (and /Os), but not /O3. See | https://docs.microsoft.com/en-us/cpp/build/reference/compile... | . (I work on MSVC's STL.) | flatiron wrote: | Dude spent more time on the explanation than I spend on my "good" | projects at work! | mcraiha wrote: | Mandatory xkcd https://xkcd.com/664/ | MauranKilom wrote: | If you've spent any amount of time on StackOverflow around | C/C++/assembly, you could probably tell within the first three | paragraphs of the answer who authored it. :) | digitallyfree wrote: | From my experience, -O2 includes optimizations that improve | performance in the majority of cases. Whereas -O3 contains | additional optimizations that may improve performance but the | results are less clear-cut - hence why they're included in -O3 as | opposed to -O2. I've seen cases where -O3 is faster than -O2, and | vice versa. | amelius wrote: | Can you turn on optimizations on a per-function basis? | gnufx wrote: | Yes, and that's sometimes a good idea. GCC has a pragma and | function attribute (unfortunately not for Fortran, at least). | phkahler wrote: | That's why its infuriating that vectorization was not enabled | at -O2. Since x86-64 supports SSE2 as a baseline, every program | has been under performing at -O2 for over 15 years. Without | knowing this it turns out the recent introduction of | -march=x86-64-v2 and v3 dont benefit either when both have AVX2 | which goes largely unused. See pathetic test results here: | https://github.com/solvespace/solvespace/issues/972 | | Notice that v3 is no better and may even be a hair slower. | gnufx wrote: | See https://gcc.gnu.org/gcc-12/changes.html It's still not | necessarily a guaranteed win, I think. | jleahy wrote: | Vectorisation is not a clear win, and so doesn't belong in | O2. Certainly it can deliver massive improvements in some | cases (as loop unrolling could in Pentium III days), yet can | cause significant binary size increases (think icache costs) | and makes loops with small iteration counts potentially much | slower. | | If for some reason you want O2 but with vectorisation you can | enable it, and if you want O3 without inlining or loop | unrolling (the same thing), you're welcome to specify that. | userbinator wrote: | Did you mean the P4? An extremely long pipeline meant loop | unrolling helps it far more than it does x86 processors | before or after. PIII is more similar to Core/2/i in | performance characteristics. | | ...and then there's Atom, which is also an odd one - I | believe it's more like a classic Pentium (P54). | jleahy wrote: | Well I did want to say the P4, but I never actually had | one so didn't want to say something that turned out to be | wrong. | | Certainly on earlier microarchitectures it did help a lot | more than it does now, and there were some nice wins on | the PIII for it | phkahler wrote: | The fact that they are changing to include some | vectorization options by default for -O2 suggests | otherwise. In particular SLP vectorization should have been | there all along. Every graphics library 2d and 3d defines | vector classes that will easily go faster with the simplest | vectorization. | | Arguments around code size have been largely irrelevant for | quite some time, so long as performance increases. | | If I put on a tinfoil hat, I'd suspect Intel was at least | an influence. They're the ones adding AVX, AVX2, 512 and | providing their own compiler to take advantage of them, | while crippling AMD parts. That is until they switched to | LLVM. | kzrdude wrote: | Maybe it could use a different heuristic in O2? Depending | on the function it's vital to have autovectorization for | performance. Neither -O3 nor -O2 are perfect as a blanket | option for all the functions in a big program. | torginus wrote: | I think bubble sort is kind of an optimization nightmare - a loop | where subsequent iterations depend on each other - and the | dependency is through memory. I'm pretty sure the compiler tries | to extract some level of parallelism from the loop and fails | miserably. | kiryin wrote: | I've always been under the impression that -03, -0fast and | related "performance" tweaks are not recommended, and shouldn't | be used unless the software in question is specifically written | with them in mind. ___________________________________________________________________ (page generated 2021-10-17 23:00 UTC)