When you write `console.log('Hello')` and press run, something remarkable happens. That simple text - just characters on a screen - gets transformed into electrical signals that physically move electrons through silicon. But how does a computer understand what those characters mean? How does text become action?
The journey from source code to execution is one of the most elegant processes in computing. It involves multiple stages of transformation, each one taking something human-readable and converting it into something closer to what hardware understands. Understanding this pipeline doesn't just satisfy curiosity - it fundamentally changes how you think about programming, debugging, and performance.
The problem of understanding
Here's the fundamental challenge: computers don't understand English, or JavaScript, or any programming language. At their core, they only understand one thing - binary. Sequences of 1s and 0s that represent "voltage on" or "voltage off." So there's an enormous gap between `function add(a, b) { return a + b; }` and the actual instructions a CPU can execute.
Consider what you're really asking a computer to do when you write code. You're describing abstract concepts like 'variables' and 'functions' and 'objects' - concepts that have no physical reality inside a processor. The CPU knows nothing about JavaScript functions. It only knows how to fetch instructions from memory, decode their meaning, execute arithmetic or logic operations, and store results. Everything else is an abstraction built on top.
Bridging this gap requires a sophisticated pipeline of transformations. Each step takes code that's easy for humans to write and gradually converts it into something the machine can execute. Think of it like translation - not just from English to French, but from English to increasingly precise technical specifications, until finally you have exact mechanical instructions.
Step 1: Lexical analysis (tokenization)
The first step is breaking source code into meaningful chunks called [[token]]s. This process, called lexical analysis or tokenization, is performed by a [[lexer]] (also called a scanner or tokenizer). The lexer reads your source code character by character, grouping them into the smallest units of meaning.
Think of the lexer as the part of your brain that recognizes individual words when reading. Before you can understand a sentence, you need to identify where one word ends and another begins. The lexer does the same thing for code - it figures out that 'function' is a keyword, 'myVariable' is an identifier, and '42' is a number.
Lexical Analysis (Tokenization)
function add(a, b) {
return a + b;
}Consider this code: `let x = 5 + 3;`. The lexer produces these tokens:
Token: KEYWORD Value: "let" Position: 1:1
Token: IDENTIFIER Value: "x" Position: 1:5
Token: OPERATOR Value: "=" Position: 1:7
Token: NUMBER Value: "5" Position: 1:9
Token: OPERATOR Value: "+" Position: 1:11
Token: NUMBER Value: "3" Position: 1:13
Token: PUNCTUATION Value: ";" Position: 1:14Notice how the lexer doesn't care about the meaning yet - it just identifies what kind of thing each piece of text is. Is it a keyword reserved by the language? An identifier (name) chosen by the programmer? A number? An operator? This classification is crucial for the next step.
The lexer also tracks position information (line and column numbers) which becomes essential for error reporting. When you see 'SyntaxError at line 42, column 15', that position came from the lexer. Without it, finding bugs in large codebases would be nearly impossible.
How lexers actually work
Most lexers are built using finite automata - state machines that transition between states based on input characters. When the lexer sees a digit, it enters a 'reading number' state. It continues reading digits (and possibly a decimal point and more digits) until it hits a non-digit, at which point it emits a NUMBER token and transitions back to the start state.
// Simplified lexer logic for numbers
function readNumber(input, position) {
let value = '';
let hasDecimal = false;
while (position < input.length) {
const char = input[position];
if (isDigit(char)) {
value += char;
position++;
} else if (char === '.' && !hasDecimal) {
value += char;
hasDecimal = true;
position++;
} else {
break; // End of number
}
}
return {
type: hasDecimal ? 'FLOAT' : 'INTEGER',
value: parseFloat(value),
position
};
}String literals are trickier because they can contain escape sequences (`\n`, `\t`, `\"`) and span multiple characters. The lexer must track whether it's inside a string and handle escapes correctly. Unicode support adds another layer of complexity - modern lexers must handle characters from any language, including emoji.
Step 2: Parsing (syntax analysis)
Once we have tokens, we need to understand their structure. The [[parser]] takes the flat list of tokens and builds a tree structure called an [[AST]] (Abstract Syntax Tree). This tree represents the hierarchical relationship between different parts of your code.
Why a tree? Because code has nested structure. A function contains statements. Statements contain expressions. Expressions contain operators and operands. This nesting is naturally represented as a tree where each node can have children. The tree structure also encodes precedence - `2 + 3 * 4` must be parsed so that multiplication happens first.
Abstract Syntax Tree (AST)
function add(a, b) {
return a + b;
}For `let x = 5 + 3;`, the AST might look like:
VariableDeclaration
├── kind: "let"
├── name: Identifier("x")
└── initializer: BinaryExpression
├── operator: "+"
├── left: NumericLiteral(5)
└── right: NumericLiteral(3)The parser validates that your code follows the grammar rules of the language. These rules are typically expressed as a [[context-free grammar]] - a formal specification of which token sequences are valid. If you write `let = 5 x;`, the lexer will happily produce tokens, but the parser will reject it because that sequence doesn't match any valid grammatical pattern.
Parsing techniques
There are two main families of parsing algorithms: top-down and bottom-up. Top-down parsers (like recursive descent) start from the grammar's start symbol and try to derive the input. Bottom-up parsers (like LR parsers) start from the input and try to reduce it to the start symbol.
Recursive descent is the most intuitive approach. Each grammar rule becomes a function. The function for 'expression' might call the function for 'term', which calls the function for 'factor'. This mirrors the grammar's structure directly:
// Recursive descent parser for arithmetic
function parseExpression() {
let left = parseTerm();
while (currentToken.type === 'PLUS' || currentToken.type === 'MINUS') {
const operator = currentToken.value;
advance(); // Consume operator
const right = parseTerm();
left = { type: 'BinaryExpression', operator, left, right };
}
return left;
}
function parseTerm() {
let left = parseFactor();
while (currentToken.type === 'STAR' || currentToken.type === 'SLASH') {
const operator = currentToken.value;
advance();
const right = parseFactor();
left = { type: 'BinaryExpression', operator, left, right };
}
return left;
}Notice how operator precedence emerges naturally from the grammar structure. Multiplication and division are handled in `parseTerm`, which is called by `parseExpression`. This means `*` and `/` bind more tightly than `+` and `-`.
Step 3: Semantic analysis
[[semantic analysis]] goes beyond syntax to check meaning. The code `let x = y + 1;` is syntactically valid, but if `y` hasn't been defined, it's semantically invalid. This phase catches errors that the parser can't detect:
• Undefined variables and functions
• Type mismatches (in typed languages)
• Duplicate declarations in the same scope
• Invalid operations (calling a number as a function)
• Scope violations (accessing private members)
• Arity errors (wrong number of arguments)
• Break/continue outside loops
• Return with value in void functionThe semantic analyzer builds and maintains a [[symbol table]] - a data structure that tracks every identifier in the program along with its type, scope, and other attributes. As it walks the AST, it adds new declarations to the symbol table and looks up references to verify they exist.
// Simplified symbol table operations
class SymbolTable {
constructor(parent = null) {
this.symbols = new Map();
this.parent = parent; // For nested scopes
}
define(name, type, attributes) {
if (this.symbols.has(name)) {
throw new Error(`Duplicate declaration: ${name}`);
}
this.symbols.set(name, { type, ...attributes });
}
lookup(name) {
if (this.symbols.has(name)) {
return this.symbols.get(name);
}
if (this.parent) {
return this.parent.lookup(name); // Search outer scope
}
throw new Error(`Undefined: ${name}`);
}
}For statically-typed languages like TypeScript or Rust, semantic analysis also performs type checking. It verifies that you're not trying to add a number to an array, call a method that doesn't exist, or pass the wrong type of argument to a function. Type inference - figuring out types that weren't explicitly declared - also happens here.
Step 4: Intermediate representation
After validation, the code is transformed into an [[IR]] - a format that's lower-level than source code but not yet machine-specific. Think of it as a universal assembly language. The most famous IR is LLVM IR, used by compilers for C, C++, Rust, Swift, Julia, and many others.
IR is designed to be easy to optimize and easy to translate to machine code. It uses an infinite number of virtual registers (unlike real CPUs with limited registers), simple instructions, and explicit control flow. This simplicity makes analysis and transformation tractable.
; LLVM IR for: int add(int a, int b) { return a + b; }
define i32 @add(i32 %a, i32 %b) {
entry:
%sum = add i32 %a, %b
ret i32 %sum
}
; More complex example with control flow
define i32 @abs(i32 %x) {
entry:
%is_negative = icmp slt i32 %x, 0
br i1 %is_negative, label %negate, label %done
negate:
%negated = sub i32 0, %x
br label %done
done:
%result = phi i32 [ %x, %entry ], [ %negated, %negate ]
ret i32 %result
}Why bother with IR? Because it separates concerns. The frontend (lexer, parser, semantic analyzer) deals with language-specific details. The backend (optimizer, code generator) deals with machine-specific details. IR sits in the middle, enabling:
| Benefit | Explanation |
|---|---|
| Cross-platform compilation | One frontend can target multiple architectures through different backends |
| Language interoperability | Multiple frontends can share the same optimizer and backends |
| Optimization reuse | Optimizations written for IR benefit all languages using it |
| Simpler compilers | Frontend writers don't need to know machine details |
| Better optimizations | IR makes program structure explicit, enabling deeper analysis |
LLVM's design has been so successful that it's become the de facto standard for new language implementations. When a new language wants good performance across platforms, using LLVM is often the fastest path. The Rust compiler, for instance, generates LLVM IR and lets LLVM handle all the complex optimization and code generation.
Step 5: Optimization
Before generating machine code, the compiler applies optimizations to the IR. These transformations preserve the program's meaning while improving performance. Modern optimizers are remarkably sophisticated, performing dozens of different optimization passes.
// Original code
int sum_squares(int n) {
int total = 0;
for (int i = 0; i < n; i++) {
int sq = i * i;
total = total + sq;
}
return total;
}
// After optimization, equivalent to:
int sum_squares(int n) {
// Compiler might use closed-form formula
// or heavily optimize the loop with:
// - Loop unrolling
// - Strength reduction (avoid multiplication)
// - SIMD vectorization
return n * (n - 1) * (2 * n - 1) / 6; // Mathematical identity
}Key optimizations include constant folding (computing `2 + 3` at compile time), dead code elimination (removing unreachable code), function inlining (replacing calls with function bodies), loop unrolling (reducing loop overhead), and many more. We'll explore these in depth in the 'Compilers explained' chapter.
Step 6: Machine code generation
Finally, the IR is converted to actual machine code - binary instructions that the CPU can execute directly. This involves several challenging problems:
**Instruction selection**: Choosing which CPU instructions to use. Different CPUs have different instruction sets (x86-64, ARM64, RISC-V), and even within an architecture, there are often multiple ways to accomplish the same thing. The code generator must pick the best options.
**Register allocation**: Deciding which CPU registers to use for which values. Registers are tiny, ultra-fast storage locations inside the CPU. A typical x86-64 processor has only 16 general-purpose registers, but a function might use dozens of variables. The allocator must decide what goes in registers versus memory, minimizing costly memory accesses.
**Instruction scheduling**: Ordering instructions to maximize performance. Modern CPUs can execute multiple instructions simultaneously, but only if there are no dependencies between them. Good scheduling exposes parallelism and hides memory latency.
; x86-64 assembly for: int add(int a, int b) { return a + b; }
; System V AMD64 ABI: first args in rdi, rsi; return in rax
add:
lea eax, [rdi + rsi] ; Single instruction: eax = rdi + rsi
ret ; Return (result in eax)
; ARM64 assembly for the same function
add:
add w0, w0, w1 ; w0 = w0 + w1 (first args in w0, w1)
ret ; Return (result in w0)The complete picture
Let's trace through the entire journey for a simple expression: `2 + 3 * 4`
Source: "2 + 3 * 4"
↓ Lexer
Tokens: [NUM:2] [OP:+] [NUM:3] [OP:*] [NUM:4]
↓ Parser (respects precedence!)
AST: BinaryExpr(+)
├── NumericLiteral(2)
└── BinaryExpr(*)
├── NumericLiteral(3)
└── NumericLiteral(4)
↓ Code Generator
IR: %1 = mul i32 3, 4
%2 = add i32 2, %1
↓ Optimizer (constant folding)
IR: ret i32 14
↓ Backend
x86: mov eax, 14
retNotice how the parser correctly handles operator precedence - multiplication before addition. This is why `2 + 3 * 4` equals 14, not 20. The tree structure encodes this precedence. And the optimizer is clever enough to compute the entire expression at compile time!
Different paths for different languages
Not all languages follow the exact same path:
| Language Type | Compilation Strategy | Examples |
|---|---|---|
| Compiled (AOT) | Full compilation to machine code before execution | C, C++, Rust, Go, Zig |
| Interpreted | Execute from AST or simple bytecode at runtime | Python (CPython), Ruby, Perl |
| JIT Compiled | Compile hot paths to machine code during execution | JavaScript (V8), Java (HotSpot), C# (.NET), PyPy |
| Transpiled | Compile to another high-level language | TypeScript → JavaScript, Kotlin → Java bytecode |
| Hybrid | Ahead-of-time compilation plus runtime JIT | Julia, Dart |
But regardless of the approach, they all start the same way: text goes in, tokens come out, trees get built, and eventually something executable emerges. Understanding this shared foundation helps you reason about any programming language.
Why this matters for you
Understanding this pipeline helps you:
**Debug better**: When you see "Unexpected token", you know the lexer found something it couldn't classify. "Syntax error" means the parser couldn't build a valid tree. "Type error" means semantic analysis caught a problem. "Linker error" means the final assembly step failed. Each error type points to a different stage.
**Write better code**: Knowing that parsing is a real cost helps you appreciate why simpler syntax often performs better at compile time. Understanding optimization helps you write code that compilers can optimize effectively. Knowing about register allocation explains why fewer local variables can sometimes improve performance.
**Choose better tools**: Understanding the difference between compiled, interpreted, and JIT-compiled languages helps you pick the right tool for the job and set realistic performance expectations. AOT compilation gives predictable performance but slower iteration; JIT compilation offers flexibility with eventual speed.
**Build better tools**: Many developer tools - linters, formatters, refactoring tools, syntax highlighters - are essentially partial compilers. They lex, parse, and analyze code without generating machine code. Understanding the compilation pipeline is the foundation for building these tools.
Programs must be written for people to read, and only incidentally for machines to execute.
— Harold Abelson, Structure and Interpretation of Computer Programs