Compilers are a necessary technology to turn high-level, easier-to-write code into efficient machine code for computers to execute. Their sophistication at doing this is often overlooked. You may spend a lot of time carefully considering algorithms and fighting error messages but perhaps not enough time looking at what compilers are capable of doing.
This article introduces some compiler and code generation concepts, and then shines a torch over a few of the very impressive feats of transformation your compilers are doing for you, with some practical demonstrations of my favorite optimizations. I hope you will gain an appreciation for what kinds of optimizations you can expect your compiler to do for you, and how you might explore the subject further. Most of all, you may learn to love looking at the assembly output and may learn to respect the quality of the engineering in your compilers.
The examples shown here are in C or C++, which are the languages I have had the most experience with, but many of these optimizations are also available in other compiled languages. Indeed, the advent of front-end-agnostic compiler toolkits such as LLVM3 means most of these optimizations work in the exact same way in languages such as Rust, Swift, and D.
I have always been fascinated by what compilers are capable of. I spent a decade making video games where every CPU cycle counted in the war to get more sprites, explosions, or complicated scenes on the screen than our competitors. Writing custom assembly, and reading the compiler output to see what it was capable of, was par for the course.
Fast-forward five years and I was at a trading company, having switched out sprites and polygons for fast processing of financial data. Just as before, knowing what the compiler was doing with code helped inform the way we wrote the code.
Obviously, nicely written, testable code is extremely important—especially if that code has the potential to make thousands of financial transactions per second. Being fastest is great, but not having bugs is even more important.
In 2012, we were debating which of the new C++11 features could be adopted as part of the canon of acceptable coding practices. When every nanosecond counts, you want to be able to give advice to programmers about how best to write their code without being antagonistic to performance. While experimenting with how code uses new features such as auto
, lambdas, and range-based for
, I wrote a shell script (a)
to run the compiler continuously and show its filtered output. This proved so useful in answering all these "what if?" questions that I went home that evening and created Compiler Explorer.1
Many optimizations fall under the umbrella of strength reduction: taking expensive operations and transforming them to use less expensive ones.
Over the years I have been constantly amazed by the lengths to which compilers go in order to take our code and turn it into a work of assembly code art. I encourage all compiled language programmers to learn a little assembly in order to appreciate what their compilers are doing for them. Even if you cannot write it yourself, being able to read it is a useful skill.
All the assembly code shown here is for 64-bit x86 processors, as that is the CPU I'm most familiar with and is one of the most common server architectures. Some of the examples shown here are x86-specific, but in reality, many types of optimizations apply similarly on other architectures. Additionally, I cover only the GCC and Clang compilers, but equally clever optimizations show up on compilers from Microsoft and Intel.
This is far from a deep dive into compiler optimizations, but some concepts used by compilers are useful to know. In these pages, you will note a running column of examples of scripts and instructions for the processes and operations discussed. All are linked by the corresponding (letter)
.
Many optimizations fall under the umbrella of strength reduction: taking expensive operations and transforming them to use less expensive ones. A very simple example of strength reduction would be taking a loop with a multiplication involving the loop counter, as shown in (b)
. Even on today's CPUs, multiplication is a little slower than simpler arithmetic, so the compiler will rewrite that loop to be something like (c)
.
Here, strength reduction took a loop involving multiplication and turned it into a sequence of equivalent operations using only addition. There are many forms of strength reduction, more of which show up in the practical examples given later.
Another key optimization is inlining, in which the compiler replaces a call to a function with the body of that function. This removes the overhead of the call and often unlocks further optimizations, as the compiler can optimize the combined code as a single unit. You will see plenty of examples of this later.
Other optimization categories include:
The golden rule for helping the compiler optimize is to ensure it has as much information as possible to make the right optimization decisions. One source of information is your code: If the compiler can see more of your code, it's able to make better decisions. Another source of information is the compiler flags you use: telling your compiler the exact CPU architecture you are targeting can make a big difference. Of course, the more information a compiler has, the longer it could take to run, so there is a balance to be struck here.
Let's take a look at an example (d)
, counting the number of elements of a vector that pass some test (compiled with GCC, optimization level 3; https://godbolt.org/z/acm19_count1). If the compiler has no information about testFunc
, it will generate an inner loop like (e)
.
To understand this code, it's useful to know that a std::vector<>
contains some pointers: one to the beginning of the data; one to the end of the data; and one to the end of the storage currently allocated (f)
. The size of the vector is not directly stored, it's implied in the difference between the begin
() and end
() pointers. Note that the calls to vector<>:: size
() and vector<>: :operator[]
have been inlined completely.
In the assembly code (e), ebp points to the vector object, and the begin()
and end()
pointers are therefore QWORD PTR [rbp+0]
and QWORD PTR [rbp+8]
, respectively.
Another neat trick the compiler has done is to remove any branching: you might reasonably expect if (testFunc(...))
would turn into a comparison and branch. Here the compiler does the comparison cmp al, 1,
which sets the processor carry flag if testFunc()
returned false, otherwise it clears it. The sbb r12d, -1
instruction then subtracts -1 with borrow, the subtract equivalent of carrying, which also uses the carry flag. This has the desired side effect: If the carry is clear (testFunc
() returned true), it subtracts -1
, which is the same as adding 1
. If the carry is set, it subtracts -1 + 1
, which has no effect on the value. Avoiding branches can be advantageous in some cases if the branch is not easily predictable by the processor.
It may seem surprising that the compiler reloads the begin()
and end()
pointers each loop iteration, and indeed it rederives size()
each time too. However, the compiler is forced to do so: it has no idea what testFunc()
does and must assume the worst. That is, it must assume calls to testFunc()
may cause the vec to be modified. The const
reference here doesn't allow any additional optimizations for a couple of reasons: testFunc()
may have a non-const reference to vec (perhaps through a global variable), or testFunc()
might cast away const.
If, however, the compiler can see the body of testFunc(), and from this know that it does not in fact modify vec (g)
, the story is very different (https://godbolt.org/z/acm19_count2).
In this case the compiler has realized that the vector's begin()
and end()
are constant during the operation of the loop. As such it has been able to realize that the call to size()
is also a constant. Armed with this knowledge, it hoists these constants out of the loop, and then rewrites the index operation (vec[i]
) to be a pointer walk, starting at begin()
and walking up one int at a time to end()
. This vastly simplifies the generated assembly.
In this example, I gave the compiler a body to testFunc()
but marked it as non-inlineable (a GNU extension) to demonstrate this optimization in isolation. In a more realistic codebase, the compiler could inline testFunc()
if it believed it beneficial.
Another way to enable this optimization without exposing the body of the function to the compiler is to mark it as [[gnu:: pure]]
(another language extension). This promises the compiler that the function is pure—entirely a function of its arguments with no side effects.
Interestingly, using range-for in the initial example yields optimal assembly, even without knowing that testFunc()
does not modify vec
(https://godbolt.org/z/acm19_count3). This is because range-for
is defined as a source code transformation that puts begin()
and end()
into local variables as shown in (h)
and is is interpreted as (i)
.
All things considered, if you need to use a "raw" loop, the modern range-for
style is preferred: it's optimal even if the compiler cannot see the body of called functions, and it is clearer to the reader. Arguably better still is to use the STL's count_if
function to do all the work for you: the compiler still generates optimal code (https://godbolt.org/z/acm19_count4).
In the traditional single-translation-unit-at-a-time compilation model, function bodies are often hidden from call sites, as the compiler has seen only their declaration. Link time optimization (LTO, also known as LTCG for link time code generation) can be used to allow the compiler to see across translation unit boundaries. In LTO, individual translation units are compiled to an intermediate form instead of machine code. During the link process—when the entire program (or dynamic linked library) is visible—machine code is generated. The compiler can take advantage of this to inline across translation units, or at least use information about the side effects of called functions to optimize.
Enabling LTO for optimized builds can be a good win in general, as the compiler can see your whole program. I now rely on LTO to let me move more function bodies out of headers to reduce coupling, compile time, and build dependencies for debug builds and tests, while still giving me the performance I need in final builds.
Despite being a relatively established technology (I used LTCG in the early 2000s on the original Xbox), I have been surprised how few projects use LTO. In part this may be because programmers unintentionally rely on undefined behavior that becomes apparent only when the compiler gets more visibility: I know I have been guilty of this.
Over the years, I have collected a number of interesting real-world optimizations, both from first-hand experience optimizing my own code and from helping others understand their code on Compiler Explorer. Here are some of my favorite examples of how clever the compiler can be.
Integer division by a constant. It may be surprising to learn that—until very recently—about the most expensive thing you could do on a modern CPU is an integer divide. Division is more than 50 times more expensive than addition and more than 10 times more expensive than multiplication. (This was true until Intel's release of the Cannon Lake microarchitecture, where the maximum latency of a 64-bit divide was reduced from 96 cycles to 18.6 This is only around 20 times slower than an addition, and 5 times more expensive than multiplication.)
Thankfully, compiler authors have some strength reduction tricks up their sleeves when it comes to division by a constant. I am sure we have all realized that division by a power of two can often be replaced by a logical shift right—rest assured the compiler will do this for you. I would advise not writing a >> in your code to do division; let the compiler work it out for you. It's clearer, and the compiler also knows how to account properly for signed values: integer division truncates toward zero, and shifting down by itself truncates toward negative infinity.
However, what if you are dividing by a non-power-of-two value, as illustrated in (j)
? Are you out of luck? Luckily, the compiler has your back again. The code compiles to (k)
(https://godbolt.org/z/acm19_div3).
Not a divide instruction in sight. Just a shift, and a multiply by a strange large constant: the 32-bit unsigned inputvalue is multiplied by Oxaaaaaaab
, and the resulting 64-bit value is shifted down by 33 bits. The compiler has replaced division with a cheaper multiplication by the reciprocal, in fixed point. The fixed point in this case is at bit 33, and the constant is one-third expressed in these terms (it's actually 0.33333333337213844). The compiler has an algorithm for determining appropriate fixed points and constants to achieve the division while preserving the same rounding as an actual division operation with the same precision over the range of the inputs. Sometimes this requires a number of extra operations—for example, in dividing by 1022 as shown in (l
, https://godbolt.org/z/acm19_div1023).
The algorithm is well known and documented extensively in the excellent book, Hacker's Delight.8
In short, you can rely on the compiler to do a great job of optimizing division by a compile-time-known constant.
You might be thinking: Why is this such an important optimization? How often does one actually perform integer division, anyway? The issue is not so much with division itself as with the related modulus operation, which is often used in hash-map implementations as the operation to bring a hash value into the range of the number of hash buckets.
Knowing what the compiler can do here can lead to interesting hash-map implementations. One approach is to use a fixed number of buckets to allow the compiler to generate the perfect modulus operation without using the expensive divide instruction.
Most hash maps support rehashing to a different number of buckets. Naively, this would lead to a modulus with a number known only at runtime, forcing the compiler to emit a slow divide instruction. Indeed, this is what the GCC libstdc++ library implementation of std::unordered _ map
does.
Clang's libc++ goes a little further: it checks if the number of buckets is a power of two, and if so skips the divide instruction in favor of a logical AND
. Having a power-of-two bucket count is alluring as it makes the modulus operation fast, but in order to avoid excessive collisions it relies on having a good hash function. A prime-number bucket count gives decent collision resistance even for simplistic hash functions.
You can rely on the compiler to do a great job of optimizing division by a compile-time-known constant.
Some libraries such as boost::multi _ index
go a step further: instead of storing the actual number of buckets, they use a fixed number of prime-sized bucket counts (m)
.
That way, for all possible hash-table sizes the compiler generates the perfect modulus code, and the only extra cost is to dispatch to the correct piece of code in the switch
statement.
GCC 9 has a neat trick (n)
for checking for divisibility by a non-power-of-two (https://godbolt.org/z/acm19_multof3) and compiles to (o)
.
This apparent witchcraft is explained very well by Daniel Lemire in his blog.2 As an aside, it's possible to do these kinds of integer division tricks at runtime too. If you need to divide many numbers by the same value, you can use a library such as libdivide
.5
How often have you wondered, How many set bits are in this integer? Probably not all that often. But it turns out this simple operation is surprisingly useful in a number of cases. For example, calculating the Hamming distance between two bitsets, dealing with packed representations of sparse matrices, or handling the results of vector operations.
You might write a function to count the bits as shown in (p)
. Of note is the bit manipulation "trick" a &= (a - 1);
, which clears the bottom-most set bit. It's a fun one to prove to yourself how it works on paper. Give it a go.
When targeting the Haswell microarchitecture, GCC 8.2 compiles this code to the assembly shown in (q)
(https://godbolt.org/z/acm19_bits). Note how GCC has cleverly found the BLSR
bit-manipulation instruction to pick off the bottom set bit. Neat, right? But not as clever as Clang 7.0 illustrated in (r)
.
This operation is common enough that there is an instruction on most CPU architectures to do it in one go: POPCNT (population count). Clang is clever enough to take a whole loop in C++ and reduce it to a single instruction. This is a great example of good instruction selection: Clang's code generator recognizes this pattern and is able to choose the perfect instruction.
I was actually being a little unfair here: GCC 9 also implements this (s)
, and in fact shows a slight difference. At first glance this appears to be suboptimal: Why on earth would you write a zero value, only to overwrite it immediately with the result of the "population count" instruction popcnt?
A little research brings up Intel CPU erratum SKL029: "POPCNT
Instruction May Take Longer to Execute Than Expected"—there is a CPU bug! Although the popcnt
instruction completely overwrites the output register eax
, it is incorrectly tagged as depending on the prior value of eax. This limits the CPU's ability to schedule the instruction until any prior instructions writing to eax have completed—even though they have no impact.
GCC's approach here is to break the dependency on eax:
the CPU recognizes xor eax, eax
as a dependency-breaking idiom. No prior instruction can influence eax after xor eax, eax,
and the popcnt
can run as soon as its input operand edi
is available.
This affects only Intel CPUs and seems to be fixed in the Cannon Lake microarchitecture, although GCC still emits XOR
when targeting it.
Maybe you have never needed to count the number of set bits in an integer, but you have probably written code like the example in (t)
. Instinctively, I thought the code generation would be full of compares and branches, but both Clang and GCC use a clever trick to make this code pretty efficient. GCC 9.1's output is shown in (u
; https://godbolt.org/z/acm19_conds)
The compilers turn this sequence of comparisons into a lookup table. The magic value loaded into rax is a 33-bit lookup table, with a one-bit in the locations where you would return true (indices 32, 13, 10, and 9 for ' ', \r, \n,
and \t,
respectively). The shift and & then pick out the cth bit and return it. Clang generates slightly different but broadly equivalent code. This is another example of strength reduction.
I was pleasantly surprised to see this kind of optimization. This is definitely the kind of thing that—prior to investigating in Compiler Explorer—I would have written manually assuming I knew better than the compiler.
One unfortunate thing I did notice while experimenting is—for GCC, at least—the order of the comparisons can affect the compiler's ability to make this optimization. If you switch the order of the comparison of the \r
and \n
, GCC generates the code in (v)
.
There's a pretty neat trick with the and to combine the comparison of \r
and \t
, but this seems worse than the code generated before. That said, a simplistic benchmark on Quick Bench suggests the compare-based version might be a tiny bit faster in a predictable tight loop.a Who ever said this was simple, eh?
Sometimes you need to add a bunch of things up. Compilers are extremely good at taking advantage of the vectorized instructions available in most CPUs these days, so even a pretty straightforward piece of code such as (w)
gets turned into code whose core loop looks like (x)
(https://godbolt.org/z/acm19_sum).
The compiler has been able to process eight values per instruction, by separating the total into eight separate subtotals for each one. At the end it sums across those subtotals to make the final total. It's as if the code was rewritten for you to look more like the (y)
example.
Simply place the compiler's optimization level at a high enough setting and pick an appropriate CPU architecture to target, and vectorization kicks in. Fantastic!
This does rely on the fact that separating the totals into individual subtotals and then summing at the end is equivalent to adding them in the order the program specified. For integers, this is trivially true; but for floating-point data types this is not the case. Floating point operations are not associative: (a+b)+c is not the same as a+(b+c), as—among other things—the precision of the result of an addition depends on the relative magnitude of the two inputs.
This means, unfortunately, that changing the vector<int>
to be a vector<float>
does not result in the code you would ideally want. The compiler could use some vector operations (it can square eight values at once), but it is forced to sum across those values serially as shown in (z
; https://godbolt.org/z/acm19_sumf).
This is unfortunate, and there is not an easy way around it. If you are absolutely sure the order of addition is not important in your case, you can give GCC the dangerous (but amusingly named)-funsafe-math-optimizations
flag. This lets it generate this beautiful inner loop illustrated in (a'
; https://godbolt.org/z/acm19_sumf_unsafe).
Amazing stuff: processing eight floats at a time, using a single instruction to accumulate and square. The drawback is potentially unbounded precision loss. Additionally, GCC does not allow you to turn this feature on for just the functions you need it for—it's a per-compilation unit flag. Clang at least lets you control it in the source code with #pragma Clang fp contract.
While playing around with these kinds of optimizations, I discovered that compilers have even more tricks up their sleeves, check out (b')
. GCC generates fairly straightforward code for this, and with appropriate compiler settings will use vector operations as noted earlier. Clang, however, generates the code in (c')
; (https://godbolt.org/z/acm19_sum_up).
First, note there is no loop at all. Working through the generated code, you see that Clang returns:
It has replaced the iteration of a loop with a closed-form general solution of the sum. The solution differs from what I would naively write myself:
This is presumably a result of the general algorithm Clang uses.
Further experimentation shows that Clang is clever enough to optimize many of these types of loops. Both Clang and GCC track loop variables in a way that allows this kind of optimization, but only Clang chooses to generate the closed-form version. It's not always less work: for small values of x
the overhead of the closed-form solution might be more than just looping. Krister Walfridsson goes into great detail about how this is achieved in a blog post.7
It is also worth noting that in order to do this optimization, the compiler may rely on signed integer overflow being undefined behavior. As such, it can assume your code cannot pass a value of x that would overflow the result (65536, in this case). If Clang cannot make that assumption, it is sometimes unable to find a closed-form solution (https://godbolt.org/z/acm19_sum_fail).
Although it seems to have fallen out of favor a little, traditional virtual-function-based polymorphism has its place. Whether it's to allow for genuine polymorphic behavior, or add a "seam" for testability, or allow for future extensibility, polymorphism through virtual functions can be a convenient choice.
As we know, though, virtual functions are slow. Or are they? Let's see how they affect the sum-of-squares example from earlier—something like (d')
.
Of course, this is not polymorphic yet. A quick run through the compiler shows the same highly vectorized assembly (https://godbolt.org/z/acm19_poly1).
Now adding the virtual keyword in front of the int operator()
should result in a much slower implementation, filled with indirect calls, right? Well, sort of (https://godbolt.org/z/acm19_poly2). There is a lot more going on than before, but at the core of the loop is something perhaps surprising (e')
.
What is happened here is GCC has made a bet. Given that it has seen only one implementation of the Transform class, it is likely going to be that one implementation that is used. Instead of blindly indirecting through the virtual function pointer, it has taken the slight hit of comparing the pointer against the only known implementation. If it matches, then the compiler knows what to do: it inlines the body of the Transform::operator ()
and squares it in place.
That's right: the compiler has inlined a virtual call. This is amazing, and was a huge surprise when I first discovered this. This optimization is called speculative devirtualization and is the source of continued research and improvement by compiler writers. Compilers are capable of devirtualizing at LTO time too, allowing for whole-program determination of possible function implementations.
The compiler has missed a trick, however. Note that at the top of the loop it reloads the virtual function pointer from the vtable every time. If the compiler were able to notice that this value remains constant if the called function does not change the dynamic type of Transform, this check could be hoisted out of the loop, and then there would be no dynamic checks in the loop at all. The compiler could use loop-invariant code motion to hoist the vtable check out of the loop. At this point the other optimizations could kick in, and the whole code could be replaced with the vectorized loop from earlier in the case of the vtable check passing.
You would be forgiven for thinking that the dynamic type of the object could not possibly change, but it's actually allowed by the standard: an object can placement new over itself so long as it returns to its original type by the time it's destructed. I recommend that you never do this, though. Clang has an option to promise you never do such horrible things in your code: -fstrict-vtable-pointers.
Of the compilers I use, GCC is the only one that does this as a matter of course, but Clang is overhauling its type system to leverage this kind of optimization more.4
C++11 added the final specifier to allow classes and virtual methods to be marked as not being further overridden. This gives the compiler more information about which methods may profit from such optimizations, and in some cases may even allow the compiler to avoid a virtual call completely (https://godbolt.org/z/acm19_poly3). Even without the final keyword, sometimes the analysis phase is able to prove that a particular concrete class is being used (https://godbolt.org/z/acm19_poly4). Such static devirtualization can yield significant performance improvements.
Hopefully, after reading this article, you will appreciate the lengths to which the compiler goes to ensure efficient code generation. I hope that some of these optimizations are a pleasant surprise and will factor in your decisions to write clear, intention-revealing code and leave it to the compiler to do the right thing. I have reinforced the idea that the more information the compiler has, the better job it can do. This includes allowing the compiler to see more of your code at once, as well as giving the compiler the right information about the CPU architecture you are targeting. There is a trade-off to be made in giving the compiler more information: it can make compilation slower. Technologies such as link time optimization can give you the best of both worlds.
Optimizations in compilers continue to improve, and upcoming improvements in indirect calls and virtual function dispatch might soon lead to even faster polymorphism. I am excited about the future of compiler optimizations. Go take a look at your compiler's output.
Acknowledgments. The author would like to extend his thanks to Matt Hellige, Robert Douglas, and Samy Al Bahra, who gave feedback on drafts of this article.
Related articles
on queue.acm.org
C Is Not a Low-level Language
David Chisnall
https://queue.acm.org/detail.cfm?id=3212479
Uninitialized Reads
Robert C. Seacord
https://queue.acm.org/detail.cfm?id=3041020
You Don't Know Jack about Shared Variables or Memory Models
Hans-J. Boehm, Sarita V. Adve
https://queue.acm.org/detail.cfm?id=2088916
1. Godbolt, M. Compiler explorer, 2012; https://godbolt.org/.
2. Lemire, D. Faster remainders when the divisor is a constant: beating compilers and libdivide, 2019; http://bit.ly/33nzsy4/.
3. LLVM. The LLVM compiler infrastructure, 2003; https://llvm.org.
4. Padlewski, P. RFC: Devirtualization v2. LLVM; http://lists.llvm.org/pipermail/llvm-dev/2018-March/121931.html.
5. ridiculous_fish. Libdivide, 2010; https://libdivide.com/.
6. Uops. Uops.info Instruction Latency Tables; https://uops.info/table.html.
7. Walfridsson, K. How LLVM optimizes power sums, 2019; https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geometric-sums.html.
8. Warren, H.S. Hacker's Delight. 2nd edition. Addison-Wesley Professional, 2012.
Copyright held by author/owner. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.
The line in the code block (c) on page 42
for (int iTimes1234 = 0; iTimes1234 < 100 * iTimes1234 += 1234) {
should probably be
for (int iTimes1234 = 0; iTimes1234 < 100 * 1234; iTimes1234 += 1234) {
Displaying 1 comment