While interpreters execute code on-the-fly, compilers take a different approach: translate everything to machine code before running anything. This [[AOT]] (Ahead-Of-Time) compilation pays an upfront cost - compilation takes time - but the resulting program runs at full native speed with no interpretation overhead.
Compilation is one of the most sophisticated automated transformations in computing. A compiler must understand your code deeply enough to preserve its meaning while restructuring it dramatically for performance. Modern optimizing compilers can produce code that's faster than what most humans could write by hand.
The three phases of compilation
Modern compilers are organized into three major phases, each handling different concerns:
Compiler Pipeline
int main() {
int x = 5;
return x * 2;
}Human-readable program written by developer
- • Written in high-level language (C, Rust, etc.)
- • Human-readable syntax
- • Contains variables, functions, control flow
Frontend: understanding the source
The frontend handles everything language-specific: lexing, parsing, semantic analysis. It transforms human-readable source code into an intermediate representation that captures the program's meaning without language-specific syntax.
A key insight: the frontend is where language design lives. Want to add a new language feature? Modify the frontend. The rest of the compiler doesn't care whether the IR came from C, Rust, Swift, or a new language you invented. This separation is why LLVM can support dozens of languages.
Middle end: optimization
The middle end transforms IR to make it faster, smaller, or both. This is where the magic happens - turning naive code into highly efficient code. These transformations are called [[optimization pass]]es, and modern compilers apply dozens of them in carefully ordered sequences.
Crucially, optimizations operate on IR, not source code or machine code. This means the same optimization benefits every language that compiles to that IR, and every target architecture benefits from the same analysis.
Backend: generating machine code
The backend transforms IR into machine code for a specific architecture. Want to target ARM instead of x86? Swap the backend. Want to compile to WebAssembly for browsers? Add a WebAssembly backend. The frontend and middle end stay the same.
Static Single Assignment (SSA)
Most modern compilers use [[SSA]] form for their intermediate representation. In SSA, each variable is assigned exactly once. If a variable is reassigned in the source, the SSA form creates new versions:
// Original code
int x = 5;
x = x + 1;
x = x * 2;
// In SSA form (conceptually)
x1 = 5
x2 = x1 + 1
x3 = x2 * 2
// With control flow (phi nodes handle merging)
if (condition) {
x = 1; // x2 = 1
} else {
x = 2; // x3 = 2
}
use(x); // x4 = φ(x2, x3) -- "phi" picks based on which branch ranSSA might seem like extra complexity, but it simplifies many optimizations dramatically. When each variable has exactly one definition, tracking where values come from is trivial. Dead code elimination, constant propagation, and common subexpression elimination all become simpler.
Optimization techniques
Let's explore the optimizations that make compiled code fast:
Constant folding and propagation
[[constant folding]] computes expressions that can be evaluated at compile time:
// Before optimization
int x = 2 + 3 * 4;
int y = x * 2;
// After constant folding
int x = 14; // 2 + 3 * 4 computed at compile time
int y = 28; // 14 * 2 computed at compile time
// Zero runtime cost!Constant propagation extends this by tracking known constant values through the program. If a variable is definitely constant at a certain point, the compiler substitutes the value and potentially enables more folding.
const int SIZE = 1024;
const int BYTES = SIZE * sizeof(int); // Constant propagation
void process() {
for (int i = 0; i < SIZE; i++) { // SIZE is known: 1024
buffer[i] = i * BYTES; // BYTES is known: 4096
}
}Dead code elimination
[[dead code elimination]] removes code that can never run or whose results are never used:
// Before optimization
int calculate(int x) {
int unused = expensive_function(); // Result never used
if (false) {
do_something(); // Never executes
}
int temp = x * 2;
temp = x * 3; // Previous assignment dead
return x * 4; // temp never used either!
}
// After dead code elimination
int calculate(int x) {
return x * 4; // Only the essential logic remains
}This optimization is more powerful than it might seem. Combined with constant propagation and inlining, entire branches of conditional logic can be proved unreachable and eliminated. Configuration checks that always go one way become zero-cost.
Function inlining
[[inlining]] replaces function calls with the function's body, eliminating call overhead:
// Before inlining
inline int square(int x) { return x * x; }
inline int add(int a, int b) { return a + b; }
int result = add(square(3), square(4));
// After inlining
int result = (3 * 3) + (4 * 4);
// After constant folding
int result = 25; // Entire computation gone!Inlining is powerful because it enables other optimizations. Once code is inlined, the optimizer sees more context. Constants from the call site can propagate into the function. Dead code that depends on the specific arguments can be eliminated.
Loop optimizations
Programs spend most of their time in loops, so loop optimization is critical:
// Loop invariant code motion (LICM)
// Before:
for (int i = 0; i < n; i++) {
result[i] = data[i] * (a + b); // a + b recomputed every iteration!
}
// After LICM:
int temp = a + b; // Hoisted out of loop
for (int i = 0; i < n; i++) {
result[i] = data[i] * temp; // Computed once
}
// Loop unrolling - reduces branch overhead
// Before:
for (int i = 0; i < 1000; i++) {
process(i);
}
// After unrolling (by 4):
for (int i = 0; i < 1000; i += 4) {
process(i);
process(i + 1);
process(i + 2);
process(i + 3);
}
// 4x fewer loop iterations, fewer branchesOther loop optimizations include strength reduction (replacing expensive operations with cheaper ones - e.g., multiplication by loop counter becomes repeated addition), loop fusion (combining adjacent loops), and loop tiling (restructuring for cache efficiency).
Vectorization (SIMD)
Modern CPUs can perform the same operation on multiple data elements simultaneously. This is SIMD - Single Instruction, Multiple Data. Compilers can automatically vectorize suitable loops:
// Original scalar code
for (int i = 0; i < 1000; i++) {
c[i] = a[i] + b[i];
}
// Compiler auto-vectorizes to (conceptually):
for (int i = 0; i < 1000; i += 8) {
// Load 8 floats at once (256 bits)
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
// Add all 8 pairs simultaneously
__m256 vc = _mm256_add_ps(va, vb);
// Store 8 results at once
_mm256_store_ps(&c[i], vc);
}
// 8x throughput improvement!Register allocation
[[register allocation]] is one of the trickiest parts of compilation. CPUs have a small number of registers - fast storage built directly into the processor. A typical x86-64 CPU has 16 general-purpose registers. A function might use dozens of variables.
The register allocator decides which values live in registers (fast) versus memory (slow). Accessing a register takes 1 cycle. Accessing memory might take 100+ cycles if it misses all caches. Good allocation can make a 2-10x performance difference.
; Good register allocation - everything in registers
calculate:
mov eax, edi ; a in eax (from first arg register)
imul eax, esi ; multiply by b (second arg register)
add eax, edx ; add c (third arg register)
ret ; result in eax, no memory touched
; Poor allocation - constant memory spills
calculate:
mov [rbp-4], edi ; Spill a to stack
mov [rbp-8], esi ; Spill b to stack
mov eax, [rbp-4] ; Reload a
imul eax, [rbp-8] ; Multiply with memory operand
add eax, [rbp-12] ; Add with memory operand
ret ; Much slower!Register allocation is NP-complete (computationally hard), so compilers use sophisticated heuristics. Graph coloring is the classic approach: build a graph where nodes are values and edges connect values that are live at the same time. The goal is to color nodes (assign registers) so no adjacent nodes share a color. If you run out of colors, some values must 'spill' to memory.
LLVM: the modular compiler infrastructure
LLVM has revolutionized compiler development by providing a reusable infrastructure. Instead of building a complete compiler from scratch, language designers can:
1. Write a frontend that emits LLVM IR
2. Get 100+ LLVM optimization passes for free
3. Target any LLVM backend (x86, ARM, RISC-V, WebAssembly, GPU...)
4. Use LLVM's debugger, profiler, and tooling integrationLanguages using LLVM include Rust, Swift, Clang (C/C++/Objective-C), Julia, Kotlin Native, and many others. When a new CPU architecture emerges, adding an LLVM backend gives all these languages support for it.
| LLVM Component | Purpose |
|---|---|
| Clang | C/C++/Objective-C frontend |
| LLVM Core | Optimizer and IR infrastructure |
| LLDB | Debugger |
| LLD | Fast linker |
| libc++ | C++ standard library |
| Polly | Polyhedral loop optimizer |
| MLIR | Multi-level IR for domain-specific compilers |
Linking: putting it together
Compilation produces object files - machine code with unresolved references. [[linking]] combines these files and resolves references:
main.o:
- main() defined
- calls calculate() [unresolved - where is it?]
- calls printf() [unresolved]
math.o:
- calculate() defined
- calls sin() [unresolved]
libm.a (math library):
- sin() defined
- cos() defined
- ...
libc.a (C library):
- printf() defined
- malloc() defined
- ...
Linker combines them:
→ myprogram (executable)
- All references resolved
- Entry point set
- Ready to runThere are two types of linking:
**Static linking** copies library code into the executable. The result is larger but self-contained - you can run it on any compatible system without installing libraries.
**Dynamic linking** references shared libraries (.dll, .so, .dylib) that are loaded at runtime. Smaller executables, shared memory between processes, easier updates - but requires the libraries to be installed.
Debug vs release builds
Compilers offer different optimization levels that trade compile time and debuggability for performance:
| Flag (GCC/Clang) | Optimization Level | Use Case |
|---|---|---|
| -O0 | None (default) | Debugging - predictable code mapping |
| -O1 | Basic | Faster builds with some optimization |
| -O2 | Standard | Production - good balance |
| -O3 | Aggressive | Maximum speed - may increase code size |
| -Os | Size | Embedded systems - minimize binary size |
| -Oz | Minimum size | Even smaller than -Os |
| -Ofast | Unsafe | Breaks strict standards for speed |
# Debug build - fast compile, slow run, good debugging
gcc -O0 -g main.c -o main_debug
# Release build - slow compile, fast run
gcc -O3 -flto main.c -o main_release
# Size-optimized build
gcc -Os main.c -o main_small
# Check what optimizations are enabled
gcc -O2 -Q --help=optimizers | grep enabledThe compilation speed trade-off
Compilation takes time. Complex optimizations on large codebases can take minutes or hours. The Chromium browser takes 6+ hours for a clean build with full optimization. This creates a fundamental trade-off:
**Fast compilation** → Better developer experience, faster iteration
**Heavy optimization** → Better runtime performance, slower builds
Modern tools attack this from multiple angles:
**Incremental compilation**: Only recompile changed code. Rust, Go, and most IDEs do this automatically.
**Caching**: Reuse previous compilation results. Tools like ccache, sccache, and Gradle's build cache help.
**Distributed compilation**: Spread compilation across many machines. distcc, Incredibuild, and Buck support this.
**Tiered compilation**: Quick compile for development, full optimize for release. Many CI systems use this.
Premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning; he will be wise to look carefully at the critical code, but only after that code has been identified.
— Donald Knuth