[HN Gopher] Exploring Clang/LLVM optimization on programming horror ___________________________________________________________________ Exploring Clang/LLVM optimization on programming horror Author : maattdd Score : 167 points Date : 2021-08-17 07:37 UTC (15 hours ago) (HTM) web link (blog.matthieud.me) (TXT) w3m dump (blog.matthieud.me) | woodruffw wrote: | Great writeup! This is a nice overview of how LLVM's optimization | passes compose to produce nicely optimized assembly. | | One nitpicky observation: mem2reg doesn't "move values from | memory [...] to registers (directly inside the CPU)." It's an SSA | expansion pass, taking the minimal SSA form produced by the C/C++ | frontend to turning it into a more aggressive SSA form by | eliminating as many `alloca`s as possible. SSA is characterized | by the presence of infinitely many "fresh" _abstract_ registers, | none of which correspond to actual CPU registers -- LLVM is | responsible at a later stage for performing efficient lowering | and allocation of registers /stack slots for the SSA registers. | slavik81 wrote: | Could you give an example of an alloca mem2reg would eliminate? | dragontamer wrote: | I don't think that would be helpful in learning what's going | on. | | Instead, I'll point you out to the keyword "Dominator | Frontier", and "SSA Construction Algorithm". | | https://en.wikipedia.org/wiki/Static_single_assignment_form#. | .. | | You should read that only after you understand SSA and Phi | functions in particular. Once you know how Phi works, the | next question is the algorithm for how they're constructed. | | https://en.wikipedia.org/wiki/Dominator_(graph_theory) | | This page may also help. | dragontamer wrote: | https://llvm.org/docs/Passes.html#passes-mem2reg | | > -mem2reg: Promote Memory to Register | | > This file promotes memory references to be register | references. It promotes alloca instructions which only have | loads and stores as uses. An alloca is transformed by using | dominator frontiers to place phi nodes, then traversing the | function in depth-first order to rewrite loads and stores as | appropriate. This is just the standard SSA construction | algorithm to construct "pruned" SSA form. | | ----------- | | https://llvm.org/docs/LangRef.html#alloca-instruction | | > The 'alloca' instruction allocates memory on the stack frame | of the currently executing function, to be automatically | released when this function returns to its caller. If the | address space is not explicitly specified, the object is | allocated in the alloca address space from the datalayout | string. | | -------- | | Based on my reading and understanding of alloca and mem2reg | above, I have to disagree with your assessment. It seems like | alloca roughly corresponds to a "push" in your typical x86 | assembly language, or maybe a "add esp, #ofBytes". | | By removing alloca instructions, the mem2reg step is turning | stack-memory into registers. | woodruffw wrote: | > By removing alloca instructions, the mem2reg step is | turning stack-memory into registers. | | This is true, but it's also misleading: the transformation of | stack slots into machine registers is a beneficial _side | effect_ of mem2reg. Reducing stack use is an important | optimization, but producing an optimal SSA form is even | _more_ important, since key passes like folding, DCE and SRoA | rely _heavily_ on the SSA form. The "reg" in "mem2reg" | refers explicitly to the latter (SSA registers), as the text | you've excerpted directly says. | | You can also prove this to yourself by contriving an IR | function that contains more SSA registers than machine | registers: you'll see that, in addition to any ABI | constraints, LLVM will be forced to spill any excess SSA | registers back onto the stack. | | Edit: but also yes, to confirm: `alloca` corresponds more or | less directly to `add ESP, <adjust>` within the x86 stack | model. But it doesn't have to! LLVM's semantics are target | independent even when the IR isn't. | dragontamer wrote: | Gotcha. I think I see what you're talking about now. | | I have crude familiarity with SSA (no personal experience | but I've read a book once) but not with LLVM's terminology. | I'm looking at the example in the blogpost, and I can see | that its clearly adding in the important Phi functions | needed | | Strange that llvm names their aggressive SSA step in this | manner. But yes, adding the phi functions in this manner is | a very important step before we can think about | optimizations. | | The "inductive variable" is far easier to detect once its | in a "Phi-form" so to speak. | woodruffw wrote: | It's an extremely confusing naming choice! I don't know | why they did it, but to take a guess: a lot of compiler | literature refers to SSA values as "registers," since the | SSA model of a computer program directly corresponds to | the concept of an (infinite) register machine in abstract | computation. | | The end result being that we have the same word | "register" referring to two _opposed_ resources (one | infinite, one extremely finite) :-). | gsnedders wrote: | At least historically LLVM has itself been inconsistent | in the naming of SSA "values", and mem2reg is arguably | just the most prominent example of the "register" name | being used. | eatonphil wrote: | The core simplification is due to LLVM's induction-detection | step. Even after doing induction-based proofs in school, until | reading this I had the sense that induction is kinda magical and | kinda not real. | | I still cannot fathom how you can encode induction detection into | an algorithm, any pointers welcome (keeping it simple, please, | and ideally with working code). | | The only case that makes sense to me is if you did numeric | computation against some fixed number of cases and if that worked | out then you assume it's right. | | I guess this is what proof assistants do (among other things). | Maybe I should look into how to write a basic proof assistant. | jcranmer wrote: | Induction detection is relatively simple in LLVM IR. The first | thing to remember is that the compiler doesn't work on the same | source code that the human does; it works on a different | representation. The canonical loop structure would look | something like this: loop.preheader: ; | This is the block that executes before the loop does. | br label %loop.body; loop.body: %i = phi i32 | [0, %loop.preheader], [%i.next, %loop.body] %val = phi | i1 [0, %loop.preheader], [%val.next, %loop.body] | %val.next = xor i1 %val, -1 %i.next = add i32 %i, 1 | %cmp = icmp slt i32 %i, %N br i1 %cmp, %loop.exit, | %loop.body loop.exit: ; After the loop is | done | | Because of how phis work, in constructing this form, we | immediately realize how to describe a value either in terms of | its value in the first iteration (this is the value it takes | coming from the preheader), or in terms _solely_ of its value | in the previous iteration. In other words, %val.next = f(%val) | for some function f, which we can easily identify in the source | code. | | Now, if we can compute the repeated composition of f easily, we | can replace the recurrence in the loop with repeated | composition. Repeated addition of the same value is | multiplication, and repeated multiplication of the same value | is exponentiation--that's two very easy examples to do so. We | recognize that there is no use of the value except of its final | occurrence, and so instead of doing the loop, we can generate | the composition itself. Thus, after N iterations of the loop, | we can conclude that %val.next = f^N(0), and if we can identify | the loop iteration count (that's %N in this case), we can | simply write out f^N instead of having to compute every | individual value. | vnorilo wrote: | Well, basically it's about converting iteratively computated | state to a function of iteration count. | | Consider indvar[n] = indvar[n-1] + k, indvar[0] = a. It follows | that indvar[n] = a + n * k. | | This is the type of indvar canonicalization that can zap loops | such as the one in the article. Suddenly the final state is | just a function of the loop trip count. | | The return value is replaced with canonicalized indvar. That | makes the loop itself dead (nobody observes any of its effects) | and removed by subsequent dead code elimination pass. | | My example transforms addition to multiplication. Another | common one is multiplication into power. That's close to what's | going on in the article example: xor is just a multiplication | of signed one-bit integers. | benmmurphy wrote: | the induction step is pretty cool. it will remove the loop | calculating the sum of an arithmetic progression and replace it | with just multiplies/shifts/subtracts and adds. | contravariant wrote: | Well in this case it seems reasonable that they simply added a | rule for the specific line: even = !even | | and the line numberCompare++ | | resulting in even = (indvar & 1) | numberCompare = indvar | | You can then solve from the exit condition that number == | numberCompare, which gives you the indvar which can then be | substituted into 'even'. | | I'm not saying it isn't magical, and certainly requires some | symbolic solving, but it's doable. | | Of course the real goal is to eventually get it to optimize | while (number != 1) { if (number & 1) | number = number * 3 + 1 number >>= 1; } | [deleted] | onedognight wrote: | Maybe LLVM should add a Collatz[0] optimization pass that | would replace your loop with number = 1 | | for at least up to 64bit types? | | [0] https://en.m.wikipedia.org/wiki/Collatz_conjecture | missblit wrote: | > Of course the real goal is to eventually get it to optimize | `while (number != 1) { ... }` | | Valid C++ programs are guaranteed to be able to make forward | progress [1]. | | So if `number` is a normal stack allocated integer (and not | volatile, etc), then the infinite looping case is undefined | behavior here. | | So it would be a valid optimization to transform this to | `number = 1;`. | | (And indeed: https://godbolt.org/z/eodhfWe6h ) | | [1] https://en.cppreference.com/w/cpp/language/memory_model | secondcoming wrote: | Amazing, but crazy. | [deleted] | geofft wrote: | This appears to be the source code: | https://github.com/llvm/llvm-project/blob/main/llvm/lib/Tran... | It's long, but it looks well-commented! Some Googling also | finds | https://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimi... | which looks relevant. | | I think "induction" here isn't exactly the sense of proofs by | mathematical induction as you learn in school (as in proving | something for all positive integers given a base case and a | recursive case) - I think all they mean by it is "a loop, over | a single integer." Quoting | https://en.wikipedia.org/wiki/Induction_variable : | | > _In computer science, an induction variable is a variable | that gets increased or decreased by a fixed amount on every | iteration of a loop or is a linear function of another | induction variable._ | | The description at | https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/ | makes it clearer: | | > _An induction variable is any variable whose value can be | represented as a function of: loop invariants; the number of | loop iterations that have executed; and other induction | variables._ | | That is to say, the link to proofs-by-induction is that if you | have a loop like "for (i = 0; i < 10; i++)", you can prove the | statement "after N loops, i = N". Why? Because after 0 loops, i | = 0 (the first part), and on each loop, i is one greater (the | i++ part), and nothing else that happens in the loop changes i | (which is what that post means by "loop invariants"). | | In other words, it's not that the compiler is _detecting_ that | the programmer is doing induction (the programmer is not, in | fact the programmer is simply writing a for loop and not | thinking about it), it 's that the compiler is _performing_ | induction on the loop. | | So a simple case is if you have a loop like | int i; for (i = 0; i < 10; i++) | printf("Hi!\n"); printf("i = %d\n", i"); | | You can prove that at the end of the loop, i = 10, and you | don't need to do actually increment i during each iteration to | get there. And if you didn't have that printf (i.e., if the | loop body was empty), you could optimize the loop out entirely | and just print out 10. | | So it's not terribly magical that LLVM is able to turn | while (number != numberCompare) { even = | !even; numberCompare++; } | | into, in pseudocode, numberCompare = number; | even = !!!...!!!even (numberCompare times); | | i.e., that it's able to get numberCompare out of the loop, to | conclude that even in turn is only a function of numberCompare, | and then that there's no loop left. | | What's more magical to me, I think, is that it's able to make | sense of "run the ! operation numberCompare times" and turn | that into "if (numberCompare % 2 == 1) even = !even". I don't | think that's too surprising either - a compiler should know | that, for booleans, !!foo can be simplified to foo. But | repeatedly applying that rule across an _abstract_ number of | calls to !!, plus possibly one more, seems pretty clever. | | If I'm skimming the code right, that part of the simplification | is a different optimization pass entirely. I don't know off | hand where that's implemented. I'd be curious if someone can | find it. | CalChris wrote: | These optimizations are on LLVM IR into LLVM IR. So basically | every backend benefits. I don't think most backend engineers | would even understand them. I don't. | dragontamer wrote: | None of these steps are too hard to think about in SSA form (a | specific transformation of code that LLVM does near the | beginning of its analysis). | | So the key is to first understand SSA form. After that, a lot | of optimization passes become obvious to think about. | | --------- | | SSA has two bits to understand: | | 1. Single static assignment -- variables can be assigned, but | once assigned they can never change. (This part is easy). | | 2. Graph-based, and therefore requires the 'Phi' function to | understand -- variables may take on two different values. EX: | "r0" may come from Node#1, while "r0" comes from Node#2. r0 may | have been renamed %1 in #1, and %2 in #2. "Phi" is a "fake | function" that resolves the ambiguity and kinda makes SSA code | "real" again. | | Hmm, maybe that's a bad explanation. But whatever, my point is | that understanding "Phi" is the concept you want to focus on in | SSA. Phi is the harder part, but its still not that hard to | understand. | | Once you get it, the whole form "clicks" in your brain and you | suddenly understand how optimizations can become possible. | jcranmer wrote: | Another explanation of SSA: | | SSA is static single assignment. Every variable has a single | point of declaration. If you want to change the value of a | variable (say, i = i + 1), you instead have to create a new | variable. The advantage of this form is that tracking the | definition of a variable is trivial; in LLVM IR, to use a | variable, you literally pass in a pointer to the instruction | that defined it. The inverse, finding all uses of a value, is | done by building a use-list: whenever you create an | instruction, it adds itself to the use-list of all the values | it's using, which allows easy iteration of the users of a | value. | | In short, SSA gives you a trivially correct-by-construction | to understand the dataflow of a program (at least, that data | which lives in registers; memory is more complicated). | There's no separate analysis that has to be maintained | throughout optimizations, and hence a possible source of | programs (as there is for failure to maintain the dominator | tree through CFG transformations, a historically bug-prone | step in LLVM). | | Now, as you may notice, there are cases where it would seem | hard to pin down a _single_ definition of a variable: | conditional assignment. SSA solves this by creating Phi | values. Phi values can be better thought of as basic block | arguments: when you enter a basic block, you must provide a | value for each of the phi values in that basic block. The | actual value of the phi is dependent on which edge you used | to enter a basic block, or in other words, the control | dependencies of the code. | | In short, phis are the intersection of control flow in the | code with the dataflow--and they are the only things in that | intersection. | mhh__ wrote: | LLVMs IR has linear basic blocks inside a function graph, | there are IRs which are graph all the way down as well. | woodruffw wrote: | > I don't think most backend engineers would even understand | them. I don't. | | Most engineers are intimidated by compilers, but they really | shouldn't be! LLVM's source code is _very_ approachable and | well-isolated by concern, and most of the fundamental concepts | in an optimizing compiler are understandable with a beginner 's | understanding of data structures, assembly, and resource | allocation. | | One slightly funny thing, though: LLVM's optimizations are on | IR, so every frontend can _theoretically_ benefit from them. | However, many assume patterns originating from the C /C++ | frontend or heavy canonicalization, so you need to be a | _little_ careful as a language frontend developer to take full | advantage of all optimizations! | jagger27 wrote: | > so you need to be a little careful as a language frontend | developer to take full advantage of all optimizations! | | So does this mean that it's possible to express exotic | patterns in LLVM-IR that aren't typically produced when | compiling C or C++? | | Are Rust's optimizer passes significantly different than | clang's? I would guess borrow checking has a special pass | that wouldn't be applicable to C. | | This topic is really fascinating to me! | CalChris wrote: | I'm actually working on an LLVM backend. I'd read the Dragon | Book before, long before, but I think my real education was | the LLVM Developer Meeting tutorials. | | LLVM is pass oriented but broadly there are front ends | (Clang, Rust, Fortran), a middle end (LLVM IR SSA | optimizations) and multiple in-tree and out-of-tree backends. | I was a little resentful of having to write all of 5 lines or | so inside of Clang. Otherwise, I happily live in the backend. | Doing a merge with a new release is an hour or so of conflict | resolution. | | The pass structure importantly means that you can test your | backend code, even at the GlobalISel subpass level, without | needing to understand Clang, LLVM, .... Pretty sweet if you | ask me. | mhh__ wrote: | LLVM is approachable but some things are no better than GCC, | I find. In particular I have noticed there seems to be very | little high level documentation for the instruction | scheduler. | | I find GCC machine def files actually slightly easier to read | than LLVM's | srcreigh wrote: | Every computer science student in Waterloo learns about this in | 2nd year (i.e. age ~19, in some cases ~1 year after learning to | code). The course is called CS241. You built a compiler from | scratch including tokenizer, assembler, parsing, type checking, | and optimizations. The class ends with an optional code size | optimization contest. | | Here's some publicly available notes, search "Optimization". | The passes discussed in the article are register allocation and | constant propagation. | | http://anthony-zhang.me/University-Notes/CS241/CS241.html | CalChris wrote: | Your notes, while honestly very good, don't cover data flow | analysis or _Single Static Assignment_ (SSA) which is what | LLVM IR [1] uses and which are the tools that the article 's | optimizations are based on. Hell, LLVM backend engineers | rarely even look at IR. We mostly look at Generic Machine IR | [2]. | | [1] https://llvm.org/docs/LangRef.html | | [2] https://llvm.org/docs/GlobalISel/GMIR.html | dleslie wrote: | Interesting, in my day SFU kept this til 3rd year, as CMPT379 | IIRC, and only as an option for Theoretical or Systems paths. | | Looks like it's still 379; not sure about the optional | component. | | http://anoopsarkar.github.io/compilers-class/ | macintux wrote: | Decades ago I participated in some group programming contest | for universities (in person, not online). We did terribly, | but mainly I remember being in awe of how quickly Waterloo | was solving the challenges. | martincmartin wrote: | _you get a linear time O(n) isEven function_ | | In complexity theory, the size of a problem is the number of bits | to represent the input, so if the input integer is _i_ , then the | size is _n = log_2(i)_ so the algorithm is actually exponential | in the number of bits it takes to represent the input. | trias wrote: | is complexity theory constrained to binary representations? Why | should it? | dragontamer wrote: | In a strict sense, you're right. | | But in practice, you're not. Case in point: matrix | multiplication is commonly quoted as O(n^3) (naive), when in | fact, the amount of data used is O(n^2), and therefore should | be quoted as O(size^2) (quadratic) with respect to data-size. | (EDIT: was bad at math for a second). | | But no, people mean "n-by-n matrix" has "O(n^3) naive" | implementation. In practice, the "n" is just whatever is most | convenient for a problem. | | In many cases, n is proportional to the input size. But in | other cases (such as the famous matrix multiplication | examples), n is the sqrt(input-size). | thrasumachos wrote: | Nice that it even provides the right answer for negative numbers | as undefined behavior but only when optimizations are enabled! ___________________________________________________________________ (page generated 2021-08-17 23:00 UTC)