# Speed Optimization Tips *released under CC0 1.0, public domain* This text attempts to document practices and language constructs that on a randomly chosen platform and with randomly chosen compiler (with further unspecified flags) can help performance while being very unlikely to hurt performance (significantly). The text focuses on **speed** optimization, **not** memory optimization, but also attempts to not waste too much memory (because these optimizations are often used on platforms that are limited in both speed and memory). No specific platform is assumed, but out of convenience specific platform assembly outputs may be used as a demonstration of possible result. The language used is C99, but the advice shouldn't be specific to this language. We attempt to present general **ideas** and **techniques**. ## optimizing at incorrect places achieves nothing It is extremely important to know **where** in the code to focus on optimizations, as wrongly targeting the effort will very likely achieve no performance gain. Let's analyze an **example**: We have a game and each frame our code updates 100 enemies, each of which takes 100 ns, and renders (only!) a 100 * 100 pixels screen, each pixel also taking 100 ns. This means one frame lasts some 1010000 ns, 99% of which is rendering pixels and 1% updating enemies. If we try to accelerate the enemy updating code, even if we achieve a 1000% speedup, the whole code will not be accelerated even by 1%! On the other hand, if we accelerate the pixel rendering code by 50%, we accelerate the whole program by about the same amount. It is therefore very important to identify the **bottlenecks** of your program and apply heavy optimizations there. Still, we shouldn't waste resources anywhere, and we should learn the habit of writing reasonably efficient code at all times. Optimizing heavily at wrong places will not only achieve practically no speedup, but can hurt us by lowering readability, increasing code size etc. *"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."* --Donald Knuth ## what to avoid - **branching** (`if` statements and ternary operators): Branching compiles into conditional jump instructions, which can lower performance. This is because CPU preloads instruction ahead of time into a pipeline which helps to execute them so fast – with a conditional jump, the CPU doesn't know which instructions are going to get executed and should therefore be preloaded. It tries to predict the branch, but generally has to fail this guess in considerable number of cases, which results in a performance drop because the pipeline becomes ineffective. Some platforms may lack an instruction pipeline, in which case branch optimization won't matter, but with portable code we must not assume any platform. - **division**: - **unpredictability**: Compilers, hardware and operating systems try to help our programs run quickly, but in order to be able to do this they usually have to predict what we are likely going to do. Writing code that accesses memory, branches and does other things in predictable ways can help them do this. ## comparison doesn't have to imply branching As has been said, branching (the `if` statement) should be avoided if possible. We may be often able to achieve this if we realize that **comparison doesn't have to mean branching!** For **example**, this simple function TODO: **check with -O3... come up with another example?** ```C int zeroClamp(int number) { if (number >= 0) // branching, bad! return number; else return 0; } ``` compiles to ``` 0x000000000040052d <+7>: cmp DWORD PTR [rbp-0x4],0x0 0x0000000000400531 <+11>: js 0x400538 # jump 0x0000000000400533 <+13>: mov eax,DWORD PTR [rbp-0x4] 0x0000000000400536 <+16>: jmp 0x40053d # jump 0x0000000000400538 <+18>: mov eax,0x0 ``` while ```C int zeroClamp(int number) { return number * (number >= 0); // not branching, only comparing } ``` compiles to ``` 0x000000000040052d <+7>: mov eax,DWORD PTR [rbp-0x4] 0x0000000000400530 <+10>: not eax 0x0000000000400532 <+12>: shr eax,0x1f # test of positivity optimized to bit shift 0x0000000000400535 <+15>: movzx eax,al 0x0000000000400538 <+18>: imul eax,DWORD PTR [rbp-0x4] ``` Both outputs consist of five instructions, but the former has two jumps, one conditional (`js`) and one unconditional (`jmp`), while the latter has no jumps at all! Measuring run speeds of both versions on the same sequence of a billion randomly generated numbers showed the former version was on given hardware 5% faster, which can already be considerable in critical code, and eliminating more branches with this technique can achieve a much greater performance gain. ## use powers of two **If you can, use powers of two** (1, 2, 4, 8, 16, ...) for your constants, because these are very friendly to computers, because computers count in binary. Especially with multiplication and most importantly division operations the compiler will optimize to extremely fast bit shifts (similarly to how we humans can very quickly multiply and divide by 10). These values can further help alignment, not wasting memory etc. For **example** imagine a racing game in which we store the car position as game units and we can choose how many of these units equal 1 meter. E.g.: ```C #define GAME_UNITS_PER_METER 1000 // not power of two! slow ``` will say 1000 of game units mean 1 meter. However, the better way to go is ```C #define GAME_UNITS_PER_METER 1024 // power of two! fast ``` Now let's write a function: ``` uint distanceInMeters(uint distanceInGameUnits) { return distanceInGameUnits / GAME_UNITS_PER_METER; } ``` The latter (with `1024` constant) may compile to something like this: ``` 0x0000000000400751 <+0>: mov eax,edi 0x0000000000400753 <+2>: shr eax,0xa # optimized to bit shift 0x0000000000400756 <+5>: ret ``` While the former (with `1000` constant) compiles to either this (with optimization for size): ``` 0x0000000000400761 <+0>: mov ecx,0x3e8 0x0000000000400766 <+5>: mov eax,edi 0x0000000000400768 <+7>: xor edx,edx 0x000000000040076a <+9>: div ecx # division, slow! 0x000000000040076c <+11>: ret ``` or this (without size optimization the division gets replaces by multiplication, but results in many more instructions): ``` 0x0000000000400701 <+0>: push rbp 0x0000000000400702 <+1>: mov rbp,rsp 0x0000000000400705 <+4>: mov DWORD PTR [rbp-0x4],edi 0x0000000000400708 <+7>: mov eax,DWORD PTR [rbp-0x4] 0x000000000040070b <+10>: mov edx,0x10624dd3 0x0000000000400710 <+15>: mul edx 0x0000000000400712 <+17>: mov eax,edx 0x0000000000400714 <+19>: shr eax,0x6 0x0000000000400717 <+22>: pop rbp 0x0000000000400718 <+23>: ret ``` ## cache friendliness Some platforms do not use cache, in which case this optimization will have no effect, but when writing portable code, we can't assume this. To make CPU caching effective, we should put related data in memory close together so that at given time we can keep working within a small memory range, which can fit into cache and so be worked with quickly. We should avoid accessing memory very randomly, unpredictably and withing great ranges. ## avoiding floating point Some platforms, typically embedded, may lack a hardware floating point unit (FPU) and only emulate floating point using a slow software library. In time critical code that we intend to be portable we should therefore prefer integer operations whenever we can. In practice we can **practically always** replace floating point math with integer one. Very common way to avoid floating point is to use fixed point numbers, for which a lot of libraries exist. However, such libraries are only helpers, not necessary, and if we want to keep maximum control and be able to optimize every expression, we can work just with integers without any abstraction above. This is not difficult to do. Let us see how. Let's assume 16bit integers. A fixed point number allocates a certain number of bits to the integer and fractional part, e.g. Q8 allocates 8 bits to the integer part and 8 bits to the fractional part. This means that the interval between any two consecutive integer values is divided into 256 smaller parts (thanks to 8 fractional bits). When we are working with integers only, we can achieve the same by simply saying that numbers in our program represent the number of 256ths. So a number 1 in our code means 1/256, a number 128 represents 1/2 (128/256), 256 represents 1 (256/256), 512 represents 2 (512/256) etc. Let's now see how floating point expressions compare to integer ones: ``` #define UF 64 // how many fractions a unit (1.0) is divided into // floating point: lowercase, integer: uppercase float a = 2.5; int16_t A = 2.5 * UF; // compiler reduces to a constant float b = 0.3; int16_t B = 0.3 * UF; a += 5; A += 5 * UF; float expression = (a / b) * 20; int16_t EXPRESSION = ((A * 20) / B) * UF; /* we need to pay attention to reordering operations so as to prevent rounding errors, overflows etc. */ printf("%f %d\n",expression,EXPRESSION / UF); // output: 499.999969 505 ``` Price to pay for this technique is that we need to pay extra attention to writing the expressions, which however also allows extra optimizations. Often we as programmers, unlike the compiler, know what kind of values the variables will be, and we can use this knowledge to tailor the specific expression. What we need to be careful about are especially: - **rounding errors**: E.g. if we want to express a ratio of two numbers, `a` and `b`, in percents, we would likely write an expression as `(a / b) * 100`. However, this can only give results of `0`, `100`, `200` etc., due to rounding of integer division. We should rewrite the expressions as `(a * 100) / b`. - **overflows**: E.g. if we have an expression `(a * b) / (UF * UF)` and we know `a` and `b` can be large numbers, we could rewrite the expression as `(a / UF) * (b / UF)` to prevent overflow. If we don't know what values will come as inputs, we can check them at runtime and choose an appropriate version of the expression. TODO: FP may be faster on platforms with FPU (fewer instructions), but we can suppose these platforms will also be faster in general, so we should optimize for the slower platform (lacking FPU). ## fast interpolations via scaling Linear interpolation is a very common operation in computer graphics and can often be the part of the bottleneck, so it is good to know tricks that allow to do it quickly. Imagine we want to fill a window, whose size can change, with a black-to-white gradient from left to right. A common approach using floating point could look like this: ``` for (int x = 0; x < screenWidth; ++x) { int value = x / ((float) screenWidth) * 256.0; for (int y = 0; y < screenHeight; ++y) putPixel(x,y,rgb(value,value,value)); } ``` This is a very wasteful and slow approach for several reasons, such as using division in every iteration. We can achieve the same result using only integer operations, one division outside the loop, and an addition and a bit shift inside the loop: ``` #define SCALE_FACTOR 6 // can be set to different values int scaledValue = 0 int scaledStep = (255 << SCALE_FACTOR) / screenWidth; for (int x = 0; x < screenWidth; ++x) { int value = scaledStep >> SCALE_FACTOR; for (int y = 0; y < screenHeight; ++y) putPixel(x,y,rgb(value,value,value)); scaledValue += scaledStep; } ``` We are effectively using fixed point real value with `SCALE_FACTOR` decimal places (`scaledValue`), to which we are adding a fixed point step (`scaledStep`) in each iteration, which we are converting to integer (bit shift by `SCALE_FACTOR`). ## compiler flags and hints We can usually achieve an instant great performance boost by simply telling the compiler we want to optimize for speed. With gcc we do this with `-O1`, `-O2` and `-O3` flags. TODO: compare all three options assembly inline likely unlikely unroll _fast data types from stdlib other? ## else should be the less likely branch When writing `if` statements, put the more likely code in the *then* branch and the less likely in the *else* branch, because So instead of ```C if (probablyPositive < 0) // not very likely, bad! negativeSum += probablyPositive; else positiveSum += probablyPositive; ``` write ```C if (probablyPositive >= 0) // most likely, correct! positiveSum += probablyPositive; else negativeSum += probablyPositive; ``` Testing this in benchmark shows the latter version of the branch really is a few percent faster. ## put most frequently accessed struct member first ## approximations Very often performance is wasted on computing unnecessarily precise results. With computers, everything we can ever get is an approximation of the real world, but these are practically always all we need, so the question is only about how accurate we **need** to go. Or rather how inaccurate can we go in order to achieve our goal. Computing anything more than is needed is simply a waste of resources, and it very often occurs only out of of negligence, by forgetting to ask the question "do we really need this much accuracy?". We are not looking for **perfect** solutions, we are looking for **good enough** ones. Worse is better. Especially in games and graphics we can extremely often get away with faking things – some would even say computer graphics is mostly the art of faking. Screen space effects, camera shearing instead of camera rotation, affine texturing instead of perspective correct texturing, environment maps, local illumination instead of global illumination, we would hardly find anything that is not an approximation in graphics. A lot of mathematical functions can be approximated by other, much cheaper functions, typically polynomials. For example, the `sin(x)` function can be approximated by simply `x` for small values of `x`. However, by approximations **we don't mean just different mathematical functions**! It is important to realize that approximation can also mean an **approximate behavior**. A great example can be shown with rendering: if we are rendering multiple 3D objects, we need to sort them by their distance from the camera and draw them in the correct order so that close objects correctly obstruct the more distant ones (so called painter's algorithm). The perfectly correct behavior is to sort the object before rendering every frame, but doing this in real time for many objects can be slow. A behavior that is not perfect, but can be faster and still good enough, can be to sort the objects only once in every 16 frames, or performing only one step of a sorting algorithm per frame. This can sometimes render the objects in wrong order, but most of the time we will still be seeing the correct result. For a simple **example**, in a top-down 2D game we may need to compute distance of objects so that e.g. door will open when the player comes near them. What we understand by *distance* is usually the Euclidean distance, and it is typically the *correct* distance to compute – the perfect result. It is computed like this: ```C int distance(int x0, int y0, int x1, int y1) // Euclidean distance { int dx = x1 - x0; int dy = y1 - y0; return sqrt(dx * dx + dy * dy); // expensive } ``` However, we can mostly get away with so called **Manhattan (taxicab) distance**: ``` int distance(int x0, int y0, int x1, int y1) // Manhattan distance { int dx = x1 - x0; int dy = y1 - y0; return dx + dy; // cheap } ``` The latter one is a good distance approximation and is much cheaper to compute, using only addition as opposed to an addition, two multiplications and a square root. ## everything can be precomputed The term **space-time tradeoff** means the possibility to trade memory usage for program run speed and vice versa. An important thing is that **we can almost always do this**, and **we can usually choose to what degree**, so if there is a part of code (i.e. inside a time critical inner loop) that we desperately need to speed up, we probably can if we have some spare memory. Very commonly we use so called **look up tables** (LUTs) – tables containing precomputed values of a function that would be too expensive to compute at the time we'll need it, e.g. trigonometric functions like sin and cos, or even precomputed division results. In the most extreme case, we can precompute all values of the function for every possible input of the function, by which we make the function run extremely fast (as fast as retrieving a value from an array), but we probably sacrifice an extreme amount of memory too (depending on how many input possibilities exist). In practice, we want to do **something in between**. For example in the case of the sin function, we don't have to store all the values from 0 to 2 pi – it is enough to store only one quadrant, i.e. values for 0 to pi / 2, as the function is symmetric and the remaining quadrants can be quickly computed by using simple transformations (e.g. `sin(pi + pi/5) == -1 * sin(pi/5)`). LUTs aren't the only way of gaining performance via precomputation. Another way is using e.g. **acceleration structures** such as search indices or tree structures such as quad trees. Precomputation can happen at both **compile time** and **run time**. For example, the famous pseudo 3D game Doom achieved its fast rendering by precomputing a tree structure (BSP tree) of game sectors that helped very quickly determining which walls to draw and in what order. This tree was computed at compile time. Because of this precomputed structure the walls in levels could not move in other than vertical directions. Good candidates for things to precompute are **static** things, i.e. things that don't move or change in a similar way, such as positions of unmovable items. Doom BSPs. **examples?** tan? (compile time) sprite rendering (run time) ## early branching This is another case of memory-space tradeoff, in which we try to speed up a loop by creating two (or more) versions of the loop so that we can move a branch it contains above the loop. By this we end up with longer code (less memory) due to multiple versions of the loop, but we achieve faster execution. This sounds complicated, but an example should show the concept is simple. Take for example this function that moves all positions in an array by a constant step either left or right: ```C void moveAll(int positions[POSITION_COUNT], int left) { for (int i = 0; i < POSITION_COUNT; ++i) positions[i] += left ? -STEP : STEP; } ``` We can see the loop contains a condition that will be evaluated over and over again in each iteration, and in each iteration it will give the same result. Knowing this, we can try to move the condition out of the loop: ```C void moveAll(int positions[POSITION_COUNT], int left) { if (left) // early branch, moved out of the loop { for (int i = 0; i < POSITION_COUNT; ++i) // one version of the loop positions[i] -= STEP; } else // right { for (int i = 0; i < POSITION_COUNT; ++i) // another version of the loop positions[i] += STEP; } } ``` We optimize in **two** aspects: we remove a computation (condition evaluation) out of the loop, and we eliminate branching (a possibility of pipeline slowdown) from it as well. (It is possible that loop prediction used in CPUs could mostly prevent the latter in this simple case, but multiple branches in a more complex case could already break the prediction.) Benchmarking this simple example shows about a 13% speed increase for the optimization. ## replace division by multiplication Expensive division operation (*A = B / C*) can be replaced by multiplication by a reciprocal value (*A = B * 1/C*). In order for us to compute the reciprocal (*1/C*) we still need to perform division, so this fact won't help us when doing a single division, but it can help us when doing multiple divisions by the same number. We are still dealing with integers, so the reciprocal value (*1/C*) is a number not greater than 1, which would usually get rounded to 0. We have to get around this by using an extra scaling constant *N*. Let's now try to derive the formula we can use: ``` A = B / C A = B * 1/C N * A = B * N/C multiply both sides by a constant N (power of 2) A = (B * (N/C)) / N divide both sides by N (only a bit shift with N = power of 2) ``` The last line shows that the division *A = B / C* can be computed as multiplying *B* by a precomputed value *N/C* (scaled reciprocal) and dividing this all by *N*, which with *N* being a power of two can be optimized to only a bit shift. Compilers use this trick to optimize division by constants, but they may be unable to optimize divisions by variables like this. Let's test this in code: ```C void divideAll(uint values[NUMBER_COUNT], uint results[NUMBER_COUNT], uint divideBy) { for (int i = 0; i < NUMBER_COUNT; ++i) results[i] = values[i] / divideBy; // division in every iteration, bad! } ``` We can now optimize the code to: ```C void divideAll(uint values[NUMBER_COUNT], uint results[NUMBER_COUNT], uint divideBy) { #define RECIP_SCALE 1024 uint recip = RECIP_SCALE / divideBy; // only one division, bearable for (int i = 0; i < NUMBER_COUNT; ++i) results[i] = (values[i] * recip) / RECIP_SCALE; // ^ only a bit shift #undef RECIP_SCALE } ``` We can see the latter version only performs one initial division. Inside the loop only multiplication and a bit shift happen. Benchmarking the snippets show the optimized version is about 20% faster! ## less flexibility for more performance Performance can sometimes be gained for the price of limiting your program, while at the same time you may not mind the limitation at all. Do not pay for something you don't need. For example when programming a game for a handheld console, we typically know the resolution of the display and because we optimize the game for this specific hardware, we won't allow any changes to this resolution. In this case we can therefore assume constant resolution and make it a constant known to compiler at compile time, as opposed to a runtime variable that can change, which we would do in case of a game running on PC in a resizeable window. We simply don't need this flexibility and so we exchange it for performance. So, instead of ``` uint screenWidth = 320; uint screenHeight = 240; ``` we write ``` #define SCREEN_WIDTH 320 #define SCREEN_HEIGHT 240 ``` Now whenever in code we have an expression working with resolution, such as ``` int screenMiddleColumn = SCREEN_WIDTH / 2; ``` the compiler can compute the value at compile time, replace the expression with a constant, save instructions and with it both time and space (memory). This can be especially effective when we make the constants powers of two, as operations with such constants can be highly optimized by he compiler. ## move computations from run time to compile time This is just a general advice to think about what computations can be moved from run time to compile time. ## passing function parameters via global variables This is not always possible (e.g. if we need recursion) or nice, but passing parameters to a function via global variable can save stack push and pop instructions and maybe even make accessing the variable faster (as location of global variable is known at compile time). Inlining a function can probably achieve or similar optimization automatically, but not every compiler may be able to do it, it's good to know how to do it manually. ## create your own cache We can use a small amount of memory – even just a few bytes – to store an expensively computed value if we suspect we may need to compute that value again in the future, in which case we may skip computing it and only fetch it from the memory. However we have to be careful not to actually slow down the program due to additional cost of managing this memory – only use this cache for considerably expensive values, and when you know the cache will actually hit often. pixelfunc cache ## align data for fast access ## avoid data type conversions ## arguments passed in registers (ARM)? ## prefer unsigned types faster in theory, todo: test ## prefer comparing to zero? zero flag ## trying to be too smart can be counter productive Trying to outsmart a compiler can backfire and even result in slower code, as we can end up confusing the compiler who may not be able to make sense of the code and optimize it.