[HN Gopher] Exploring Clang/LLVM optimization on programming horror
       ___________________________________________________________________
        
       Exploring Clang/LLVM optimization on programming horror
        
       Author : maattdd
       Score  : 167 points
       Date   : 2021-08-17 07:37 UTC (15 hours ago)
        
 (HTM) web link (blog.matthieud.me)
 (TXT) w3m dump (blog.matthieud.me)
        
       | woodruffw wrote:
       | Great writeup! This is a nice overview of how LLVM's optimization
       | passes compose to produce nicely optimized assembly.
       | 
       | One nitpicky observation: mem2reg doesn't "move values from
       | memory [...] to registers (directly inside the CPU)." It's an SSA
       | expansion pass, taking the minimal SSA form produced by the C/C++
       | frontend to turning it into a more aggressive SSA form by
       | eliminating as many `alloca`s as possible. SSA is characterized
       | by the presence of infinitely many "fresh" _abstract_ registers,
       | none of which correspond to actual CPU registers -- LLVM is
       | responsible at a later stage for performing efficient lowering
       | and allocation of registers /stack slots for the SSA registers.
        
         | slavik81 wrote:
         | Could you give an example of an alloca mem2reg would eliminate?
        
           | dragontamer wrote:
           | I don't think that would be helpful in learning what's going
           | on.
           | 
           | Instead, I'll point you out to the keyword "Dominator
           | Frontier", and "SSA Construction Algorithm".
           | 
           | https://en.wikipedia.org/wiki/Static_single_assignment_form#.
           | ..
           | 
           | You should read that only after you understand SSA and Phi
           | functions in particular. Once you know how Phi works, the
           | next question is the algorithm for how they're constructed.
           | 
           | https://en.wikipedia.org/wiki/Dominator_(graph_theory)
           | 
           | This page may also help.
        
         | dragontamer wrote:
         | https://llvm.org/docs/Passes.html#passes-mem2reg
         | 
         | > -mem2reg: Promote Memory to Register
         | 
         | > This file promotes memory references to be register
         | references. It promotes alloca instructions which only have
         | loads and stores as uses. An alloca is transformed by using
         | dominator frontiers to place phi nodes, then traversing the
         | function in depth-first order to rewrite loads and stores as
         | appropriate. This is just the standard SSA construction
         | algorithm to construct "pruned" SSA form.
         | 
         | -----------
         | 
         | https://llvm.org/docs/LangRef.html#alloca-instruction
         | 
         | > The 'alloca' instruction allocates memory on the stack frame
         | of the currently executing function, to be automatically
         | released when this function returns to its caller. If the
         | address space is not explicitly specified, the object is
         | allocated in the alloca address space from the datalayout
         | string.
         | 
         | --------
         | 
         | Based on my reading and understanding of alloca and mem2reg
         | above, I have to disagree with your assessment. It seems like
         | alloca roughly corresponds to a "push" in your typical x86
         | assembly language, or maybe a "add esp, #ofBytes".
         | 
         | By removing alloca instructions, the mem2reg step is turning
         | stack-memory into registers.
        
           | woodruffw wrote:
           | > By removing alloca instructions, the mem2reg step is
           | turning stack-memory into registers.
           | 
           | This is true, but it's also misleading: the transformation of
           | stack slots into machine registers is a beneficial _side
           | effect_ of mem2reg. Reducing stack use is an important
           | optimization, but producing an optimal SSA form is even
           | _more_ important, since key passes like folding, DCE and SRoA
           | rely _heavily_ on the SSA form. The  "reg" in "mem2reg"
           | refers explicitly to the latter (SSA registers), as the text
           | you've excerpted directly says.
           | 
           | You can also prove this to yourself by contriving an IR
           | function that contains more SSA registers than machine
           | registers: you'll see that, in addition to any ABI
           | constraints, LLVM will be forced to spill any excess SSA
           | registers back onto the stack.
           | 
           | Edit: but also yes, to confirm: `alloca` corresponds more or
           | less directly to `add ESP, <adjust>` within the x86 stack
           | model. But it doesn't have to! LLVM's semantics are target
           | independent even when the IR isn't.
        
             | dragontamer wrote:
             | Gotcha. I think I see what you're talking about now.
             | 
             | I have crude familiarity with SSA (no personal experience
             | but I've read a book once) but not with LLVM's terminology.
             | I'm looking at the example in the blogpost, and I can see
             | that its clearly adding in the important Phi functions
             | needed
             | 
             | Strange that llvm names their aggressive SSA step in this
             | manner. But yes, adding the phi functions in this manner is
             | a very important step before we can think about
             | optimizations.
             | 
             | The "inductive variable" is far easier to detect once its
             | in a "Phi-form" so to speak.
        
               | woodruffw wrote:
               | It's an extremely confusing naming choice! I don't know
               | why they did it, but to take a guess: a lot of compiler
               | literature refers to SSA values as "registers," since the
               | SSA model of a computer program directly corresponds to
               | the concept of an (infinite) register machine in abstract
               | computation.
               | 
               | The end result being that we have the same word
               | "register" referring to two _opposed_ resources (one
               | infinite, one extremely finite) :-).
        
               | gsnedders wrote:
               | At least historically LLVM has itself been inconsistent
               | in the naming of SSA "values", and mem2reg is arguably
               | just the most prominent example of the "register" name
               | being used.
        
       | eatonphil wrote:
       | The core simplification is due to LLVM's induction-detection
       | step. Even after doing induction-based proofs in school, until
       | reading this I had the sense that induction is kinda magical and
       | kinda not real.
       | 
       | I still cannot fathom how you can encode induction detection into
       | an algorithm, any pointers welcome (keeping it simple, please,
       | and ideally with working code).
       | 
       | The only case that makes sense to me is if you did numeric
       | computation against some fixed number of cases and if that worked
       | out then you assume it's right.
       | 
       | I guess this is what proof assistants do (among other things).
       | Maybe I should look into how to write a basic proof assistant.
        
         | jcranmer wrote:
         | Induction detection is relatively simple in LLVM IR. The first
         | thing to remember is that the compiler doesn't work on the same
         | source code that the human does; it works on a different
         | representation. The canonical loop structure would look
         | something like this:                 loop.preheader:         ;
         | This is the block that executes before the loop does.
         | br label %loop.body;            loop.body:         %i = phi i32
         | [0, %loop.preheader], [%i.next, %loop.body]         %val = phi
         | i1 [0, %loop.preheader], [%val.next, %loop.body]
         | %val.next = xor i1 %val, -1         %i.next = add i32 %i, 1
         | %cmp = icmp slt i32 %i, %N         br i1 %cmp, %loop.exit,
         | %loop.body            loop.exit:         ; After the loop is
         | done
         | 
         | Because of how phis work, in constructing this form, we
         | immediately realize how to describe a value either in terms of
         | its value in the first iteration (this is the value it takes
         | coming from the preheader), or in terms _solely_ of its value
         | in the previous iteration. In other words, %val.next = f(%val)
         | for some function f, which we can easily identify in the source
         | code.
         | 
         | Now, if we can compute the repeated composition of f easily, we
         | can replace the recurrence in the loop with repeated
         | composition. Repeated addition of the same value is
         | multiplication, and repeated multiplication of the same value
         | is exponentiation--that's two very easy examples to do so. We
         | recognize that there is no use of the value except of its final
         | occurrence, and so instead of doing the loop, we can generate
         | the composition itself. Thus, after N iterations of the loop,
         | we can conclude that %val.next = f^N(0), and if we can identify
         | the loop iteration count (that's %N in this case), we can
         | simply write out f^N instead of having to compute every
         | individual value.
        
         | vnorilo wrote:
         | Well, basically it's about converting iteratively computated
         | state to a function of iteration count.
         | 
         | Consider indvar[n] = indvar[n-1] + k, indvar[0] = a. It follows
         | that indvar[n] = a + n * k.
         | 
         | This is the type of indvar canonicalization that can zap loops
         | such as the one in the article. Suddenly the final state is
         | just a function of the loop trip count.
         | 
         | The return value is replaced with canonicalized indvar. That
         | makes the loop itself dead (nobody observes any of its effects)
         | and removed by subsequent dead code elimination pass.
         | 
         | My example transforms addition to multiplication. Another
         | common one is multiplication into power. That's close to what's
         | going on in the article example: xor is just a multiplication
         | of signed one-bit integers.
        
         | benmmurphy wrote:
         | the induction step is pretty cool. it will remove the loop
         | calculating the sum of an arithmetic progression and replace it
         | with just multiplies/shifts/subtracts and adds.
        
         | contravariant wrote:
         | Well in this case it seems reasonable that they simply added a
         | rule for the specific line:                   even = !even
         | 
         | and the line                   numberCompare++
         | 
         | resulting in                   even = (indvar & 1)
         | numberCompare = indvar
         | 
         | You can then solve from the exit condition that number ==
         | numberCompare, which gives you the indvar which can then be
         | substituted into 'even'.
         | 
         | I'm not saying it isn't magical, and certainly requires some
         | symbolic solving, but it's doable.
         | 
         | Of course the real goal is to eventually get it to optimize
         | while (number != 1) {             if (number & 1)
         | number = number * 3 + 1             number >>= 1;         }
        
           | [deleted]
        
           | onedognight wrote:
           | Maybe LLVM should add a Collatz[0] optimization pass that
           | would replace your loop with                   number = 1
           | 
           | for at least up to 64bit types?
           | 
           | [0] https://en.m.wikipedia.org/wiki/Collatz_conjecture
        
           | missblit wrote:
           | > Of course the real goal is to eventually get it to optimize
           | `while (number != 1) { ... }`
           | 
           | Valid C++ programs are guaranteed to be able to make forward
           | progress [1].
           | 
           | So if `number` is a normal stack allocated integer (and not
           | volatile, etc), then the infinite looping case is undefined
           | behavior here.
           | 
           | So it would be a valid optimization to transform this to
           | `number = 1;`.
           | 
           | (And indeed: https://godbolt.org/z/eodhfWe6h )
           | 
           | [1] https://en.cppreference.com/w/cpp/language/memory_model
        
             | secondcoming wrote:
             | Amazing, but crazy.
        
             | [deleted]
        
         | geofft wrote:
         | This appears to be the source code:
         | https://github.com/llvm/llvm-project/blob/main/llvm/lib/Tran...
         | It's long, but it looks well-commented! Some Googling also
         | finds
         | https://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimi...
         | which looks relevant.
         | 
         | I think "induction" here isn't exactly the sense of proofs by
         | mathematical induction as you learn in school (as in proving
         | something for all positive integers given a base case and a
         | recursive case) - I think all they mean by it is "a loop, over
         | a single integer." Quoting
         | https://en.wikipedia.org/wiki/Induction_variable :
         | 
         | > _In computer science, an induction variable is a variable
         | that gets increased or decreased by a fixed amount on every
         | iteration of a loop or is a linear function of another
         | induction variable._
         | 
         | The description at
         | https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/
         | makes it clearer:
         | 
         | > _An induction variable is any variable whose value can be
         | represented as a function of: loop invariants; the number of
         | loop iterations that have executed; and other induction
         | variables._
         | 
         | That is to say, the link to proofs-by-induction is that if you
         | have a loop like "for (i = 0; i < 10; i++)", you can prove the
         | statement "after N loops, i = N". Why? Because after 0 loops, i
         | = 0 (the first part), and on each loop, i is one greater (the
         | i++ part), and nothing else that happens in the loop changes i
         | (which is what that post means by "loop invariants").
         | 
         | In other words, it's not that the compiler is _detecting_ that
         | the programmer is doing induction (the programmer is not, in
         | fact the programmer is simply writing a for loop and not
         | thinking about it), it 's that the compiler is _performing_
         | induction on the loop.
         | 
         | So a simple case is if you have a loop like
         | int i;         for (i = 0; i < 10; i++)
         | printf("Hi!\n");         printf("i = %d\n", i");
         | 
         | You can prove that at the end of the loop, i = 10, and you
         | don't need to do actually increment i during each iteration to
         | get there. And if you didn't have that printf (i.e., if the
         | loop body was empty), you could optimize the loop out entirely
         | and just print out 10.
         | 
         | So it's not terribly magical that LLVM is able to turn
         | while (number != numberCompare)         {             even =
         | !even;             numberCompare++;         }
         | 
         | into, in pseudocode,                   numberCompare = number;
         | even = !!!...!!!even (numberCompare times);
         | 
         | i.e., that it's able to get numberCompare out of the loop, to
         | conclude that even in turn is only a function of numberCompare,
         | and then that there's no loop left.
         | 
         | What's more magical to me, I think, is that it's able to make
         | sense of "run the ! operation numberCompare times" and turn
         | that into "if (numberCompare % 2 == 1) even = !even". I don't
         | think that's too surprising either - a compiler should know
         | that, for booleans, !!foo can be simplified to foo. But
         | repeatedly applying that rule across an _abstract_ number of
         | calls to !!, plus possibly one more, seems pretty clever.
         | 
         | If I'm skimming the code right, that part of the simplification
         | is a different optimization pass entirely. I don't know off
         | hand where that's implemented. I'd be curious if someone can
         | find it.
        
       | CalChris wrote:
       | These optimizations are on LLVM IR into LLVM IR. So basically
       | every backend benefits. I don't think most backend engineers
       | would even understand them. I don't.
        
         | dragontamer wrote:
         | None of these steps are too hard to think about in SSA form (a
         | specific transformation of code that LLVM does near the
         | beginning of its analysis).
         | 
         | So the key is to first understand SSA form. After that, a lot
         | of optimization passes become obvious to think about.
         | 
         | ---------
         | 
         | SSA has two bits to understand:
         | 
         | 1. Single static assignment -- variables can be assigned, but
         | once assigned they can never change. (This part is easy).
         | 
         | 2. Graph-based, and therefore requires the 'Phi' function to
         | understand -- variables may take on two different values. EX:
         | "r0" may come from Node#1, while "r0" comes from Node#2. r0 may
         | have been renamed %1 in #1, and %2 in #2. "Phi" is a "fake
         | function" that resolves the ambiguity and kinda makes SSA code
         | "real" again.
         | 
         | Hmm, maybe that's a bad explanation. But whatever, my point is
         | that understanding "Phi" is the concept you want to focus on in
         | SSA. Phi is the harder part, but its still not that hard to
         | understand.
         | 
         | Once you get it, the whole form "clicks" in your brain and you
         | suddenly understand how optimizations can become possible.
        
           | jcranmer wrote:
           | Another explanation of SSA:
           | 
           | SSA is static single assignment. Every variable has a single
           | point of declaration. If you want to change the value of a
           | variable (say, i = i + 1), you instead have to create a new
           | variable. The advantage of this form is that tracking the
           | definition of a variable is trivial; in LLVM IR, to use a
           | variable, you literally pass in a pointer to the instruction
           | that defined it. The inverse, finding all uses of a value, is
           | done by building a use-list: whenever you create an
           | instruction, it adds itself to the use-list of all the values
           | it's using, which allows easy iteration of the users of a
           | value.
           | 
           | In short, SSA gives you a trivially correct-by-construction
           | to understand the dataflow of a program (at least, that data
           | which lives in registers; memory is more complicated).
           | There's no separate analysis that has to be maintained
           | throughout optimizations, and hence a possible source of
           | programs (as there is for failure to maintain the dominator
           | tree through CFG transformations, a historically bug-prone
           | step in LLVM).
           | 
           | Now, as you may notice, there are cases where it would seem
           | hard to pin down a _single_ definition of a variable:
           | conditional assignment. SSA solves this by creating Phi
           | values. Phi values can be better thought of as basic block
           | arguments: when you enter a basic block, you must provide a
           | value for each of the phi values in that basic block. The
           | actual value of the phi is dependent on which edge you used
           | to enter a basic block, or in other words, the control
           | dependencies of the code.
           | 
           | In short, phis are the intersection of control flow in the
           | code with the dataflow--and they are the only things in that
           | intersection.
        
           | mhh__ wrote:
           | LLVMs IR has linear basic blocks inside a function graph,
           | there are IRs which are graph all the way down as well.
        
         | woodruffw wrote:
         | > I don't think most backend engineers would even understand
         | them. I don't.
         | 
         | Most engineers are intimidated by compilers, but they really
         | shouldn't be! LLVM's source code is _very_ approachable and
         | well-isolated by concern, and most of the fundamental concepts
         | in an optimizing compiler are understandable with a beginner 's
         | understanding of data structures, assembly, and resource
         | allocation.
         | 
         | One slightly funny thing, though: LLVM's optimizations are on
         | IR, so every frontend can _theoretically_ benefit from them.
         | However, many assume patterns originating from the C /C++
         | frontend or heavy canonicalization, so you need to be a
         | _little_ careful as a language frontend developer to take full
         | advantage of all optimizations!
        
           | jagger27 wrote:
           | > so you need to be a little careful as a language frontend
           | developer to take full advantage of all optimizations!
           | 
           | So does this mean that it's possible to express exotic
           | patterns in LLVM-IR that aren't typically produced when
           | compiling C or C++?
           | 
           | Are Rust's optimizer passes significantly different than
           | clang's? I would guess borrow checking has a special pass
           | that wouldn't be applicable to C.
           | 
           | This topic is really fascinating to me!
        
           | CalChris wrote:
           | I'm actually working on an LLVM backend. I'd read the Dragon
           | Book before, long before, but I think my real education was
           | the LLVM Developer Meeting tutorials.
           | 
           | LLVM is pass oriented but broadly there are front ends
           | (Clang, Rust, Fortran), a middle end (LLVM IR SSA
           | optimizations) and multiple in-tree and out-of-tree backends.
           | I was a little resentful of having to write all of 5 lines or
           | so inside of Clang. Otherwise, I happily live in the backend.
           | Doing a merge with a new release is an hour or so of conflict
           | resolution.
           | 
           | The pass structure importantly means that you can test your
           | backend code, even at the GlobalISel subpass level, without
           | needing to understand Clang, LLVM, .... Pretty sweet if you
           | ask me.
        
           | mhh__ wrote:
           | LLVM is approachable but some things are no better than GCC,
           | I find. In particular I have noticed there seems to be very
           | little high level documentation for the instruction
           | scheduler.
           | 
           | I find GCC machine def files actually slightly easier to read
           | than LLVM's
        
         | srcreigh wrote:
         | Every computer science student in Waterloo learns about this in
         | 2nd year (i.e. age ~19, in some cases ~1 year after learning to
         | code). The course is called CS241. You built a compiler from
         | scratch including tokenizer, assembler, parsing, type checking,
         | and optimizations. The class ends with an optional code size
         | optimization contest.
         | 
         | Here's some publicly available notes, search "Optimization".
         | The passes discussed in the article are register allocation and
         | constant propagation.
         | 
         | http://anthony-zhang.me/University-Notes/CS241/CS241.html
        
           | CalChris wrote:
           | Your notes, while honestly very good, don't cover data flow
           | analysis or _Single Static Assignment_ (SSA) which is what
           | LLVM IR [1] uses and which are the tools that the article 's
           | optimizations are based on. Hell, LLVM backend engineers
           | rarely even look at IR. We mostly look at Generic Machine IR
           | [2].
           | 
           | [1] https://llvm.org/docs/LangRef.html
           | 
           | [2] https://llvm.org/docs/GlobalISel/GMIR.html
        
           | dleslie wrote:
           | Interesting, in my day SFU kept this til 3rd year, as CMPT379
           | IIRC, and only as an option for Theoretical or Systems paths.
           | 
           | Looks like it's still 379; not sure about the optional
           | component.
           | 
           | http://anoopsarkar.github.io/compilers-class/
        
           | macintux wrote:
           | Decades ago I participated in some group programming contest
           | for universities (in person, not online). We did terribly,
           | but mainly I remember being in awe of how quickly Waterloo
           | was solving the challenges.
        
       | martincmartin wrote:
       | _you get a linear time O(n) isEven function_
       | 
       | In complexity theory, the size of a problem is the number of bits
       | to represent the input, so if the input integer is _i_ , then the
       | size is _n = log_2(i)_ so the algorithm is actually exponential
       | in the number of bits it takes to represent the input.
        
         | trias wrote:
         | is complexity theory constrained to binary representations? Why
         | should it?
        
         | dragontamer wrote:
         | In a strict sense, you're right.
         | 
         | But in practice, you're not. Case in point: matrix
         | multiplication is commonly quoted as O(n^3) (naive), when in
         | fact, the amount of data used is O(n^2), and therefore should
         | be quoted as O(size^2) (quadratic) with respect to data-size.
         | (EDIT: was bad at math for a second).
         | 
         | But no, people mean "n-by-n matrix" has "O(n^3) naive"
         | implementation. In practice, the "n" is just whatever is most
         | convenient for a problem.
         | 
         | In many cases, n is proportional to the input size. But in
         | other cases (such as the famous matrix multiplication
         | examples), n is the sqrt(input-size).
        
       | thrasumachos wrote:
       | Nice that it even provides the right answer for negative numbers
       | as undefined behavior but only when optimizations are enabled!
        
       ___________________________________________________________________
       (page generated 2021-08-17 23:00 UTC)