Two programs can solve the same problem with wildly different speeds. A naive search might take hours while a clever algorithm finishes in milliseconds. Understanding why requires diving into both algorithmic complexity and the hidden mechanics of modern hardware.
Performance is where software meets hardware. You can write the most elegant algorithm, but if it fights against how CPUs and memory actually work, it will be slow. Conversely, seemingly ugly code that plays nicely with hardware can outperform theoretically superior approaches. Let's understand both sides.
Algorithmic complexity: Big O
[[Big O]] notation describes how an algorithm's runtime grows with input size. It's not about exact timing - it's about scaling behavior. An O(n) algorithm that takes 1 second for 1,000 items will take about 10 seconds for 10,000 items. An O(n²) algorithm might take 100 seconds for the same growth.
This matters because data sizes can vary enormously. An algorithm that works fine for 100 items might be unusable for 100 million. Big O helps you understand which algorithms will scale and which will hit a wall.
Big O Complexity
Let's make this concrete with searching:
// Linear search - O(n)
// Check every element until we find the target
function linearSearch(arr, target) {
for (let i = 0; i < arr.length; i++) {
if (arr[i] === target) return i;
}
return -1;
}
// Binary search - O(log n) - requires sorted array
// Eliminate half the remaining elements each step
function binarySearch(arr, target) {
let low = 0, high = arr.length - 1;
while (low <= high) {
const mid = Math.floor((low + high) / 2);
if (arr[mid] === target) return mid;
if (arr[mid] < target) low = mid + 1;
else high = mid - 1;
}
return -1;
}For 1 million items, linear search might check 500,000 items on average. Binary search checks at most 20. That's a 25,000x difference! For 1 billion items, binary search still checks at most 30.
| Array Size | Linear Search (avg) | Binary Search (max) |
|---|---|---|
| 1,000 | 500 comparisons | 10 comparisons |
| 1,000,000 | 500,000 comparisons | 20 comparisons |
| 1,000,000,000 | 500,000,000 comparisons | 30 comparisons |
Common complexity classes
| Complexity | Name | Example | 1M items |
|---|---|---|---|
| O(1) | Constant | Array access, hash lookup | 1 op |
| O(log n) | Logarithmic | Binary search, balanced tree | 20 ops |
| O(n) | Linear | Simple loop, linear search | 1M ops |
| O(n log n) | Linearithmic | Merge sort, heap sort | 20M ops |
| O(n²) | Quadratic | Nested loops, bubble sort | 1T ops |
| O(2ⁿ) | Exponential | Subset enumeration | Impossible |
| O(n!) | Factorial | Permutation generation | Impossible |
Beyond Big O: constant factors matter
Big O tells you about scaling, but it hides constant factors. An O(n) algorithm that takes 100 microseconds per operation is slower than an O(n log n) algorithm that takes 1 nanosecond per operation, at least until n gets very large.
This is why simple O(n²) algorithms like insertion sort outperform 'better' O(n log n) algorithms for small arrays. The overhead of merge sort's recursion and memory allocation isn't worth it for 10 elements. Many sorting libraries switch algorithms based on array size.
// Real-world hybrid: use insertion sort for small subarrays
function hybridSort(arr, low = 0, high = arr.length - 1) {
// For small arrays, insertion sort is faster despite O(n²)
if (high - low < 16) {
insertionSort(arr, low, high);
return;
}
// For larger arrays, use quicksort O(n log n)
const pivot = partition(arr, low, high);
hybridSort(arr, low, pivot - 1);
hybridSort(arr, pivot + 1, high);
}
// This is roughly how production sort implementations work!The memory hierarchy: why cache matters
Modern CPUs are astonishingly fast. A 4 GHz processor completes 4 billion cycles per second. But memory can't keep up. This growing gap - the [[memory wall]] - is one of the most important factors in software performance.
A CPU cycle takes about 0.25 nanoseconds. Reading from main RAM takes 50-100 nanoseconds - a 200-400x slowdown. If the CPU had to wait for memory on every access, it would be idle 99% of the time.
The solution is caching: small, fast memory close to the CPU that stores recently accessed data. Modern CPUs have multiple cache levels, each larger but slower:
CPU Cache Hierarchy
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| L1 Cache | 32-64 KB | ~1 ns (4 cycles) | ~1 TB/s |
| L2 Cache | 256-512 KB | ~3 ns (12 cycles) | ~500 GB/s |
| L3 Cache | 8-64 MB | ~10 ns (40 cycles) | ~200 GB/s |
| Main RAM | 8-128 GB | ~50-100 ns | ~50 GB/s |
| SSD | 256 GB - 8 TB | ~100,000 ns | ~5 GB/s |
A [[cache miss]] - when data isn't in cache - forces a slow RAM access. Cache-friendly code minimizes misses by exhibiting good locality:
**[[spatial locality]]**: Access memory sequentially. When you read one byte, the CPU loads a whole cache line (typically 64 bytes). Sequential access uses the whole line; random access wastes most of it.
**[[temporal locality]]**: Reuse data while it's still cached. Process everything you need from the current data before moving to new data.
// Cache-FRIENDLY: sequential access (uses spatial locality)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
sum += matrix[i][j]; // Accesses memory in order
}
}
// Cache-HOSTILE: strided access (wastes cache lines)
for (int j = 0; j < cols; j++) {
for (int i = 0; i < rows; i++) {
sum += matrix[i][j]; // Jumps around in memory
}
}
// For a 1000x1000 matrix, the friendly version might be 10-100x faster!Data structures and cache performance
Data structure choice dramatically affects cache performance. Arrays are cache-friendly because elements are contiguous. Linked lists are cache-hostile because elements are scattered throughout memory.
// Array: elements contiguous in memory
// Traversal is cache-friendly
const arr = [1, 2, 3, 4, 5, 6, 7, 8]; // All in one cache line!
// Linked list: nodes scattered in memory
// Traversal causes cache miss for EACH node
class Node {
constructor(val) {
this.val = val;
this.next = null; // Pointer to somewhere else in memory
}
}This is why arrays often beat linked lists in practice, even for operations where linked lists have better Big O complexity. Inserting into a linked list is O(1) after you find the spot, but traversing to find that spot causes cache misses. Inserting into an array requires O(n) shifting, but the shift is cache-friendly and often faster for reasonable sizes.
Hash tables illustrate the trade-off. Chained hash tables (buckets are linked lists) are cache-hostile. Open addressing (everything in one array) is cache-friendly. This is why modern hash table implementations usually use open addressing with linear or quadratic probing.
The CPU pipeline and instruction-level parallelism
Modern CPUs don't execute one instruction at a time. They use [[pipeline]]s to work on many instructions simultaneously, like an assembly line. While one instruction is executing, the next is being decoded, and the one after that is being fetched.
Classic 5-stage pipeline:
Time → Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
────────────────────────────────────────────────────────────────────
Instr 1 Fetch Decode Execute Memory Writeback
Instr 2 Fetch Decode Execute Memory Writeback
Instr 3 Fetch Decode Execute Memory
Instr 4 Fetch Decode Execute
Instr 5 Fetch Decode
Result: After startup, one instruction completes every cycle!Superscalar CPUs go further: they can execute multiple independent instructions per cycle. A modern Intel core might have 4-6 execution units working in parallel. This is measured as [[IPC]] - Instructions Per Cycle. A good IPC is 2-4; memory-bound code might see 0.5 or worse.
Branch prediction: guessing the future
Pipelines have a problem: what happens at a conditional branch (if/else)? The CPU doesn't know which way the branch goes until the condition is evaluated, but it needs to fetch the next instruction immediately.
The solution is [[branch prediction]]. The CPU guesses which way the branch will go and starts executing speculatively. If it guesses right, execution continues at full speed. If it guesses wrong, it must flush the pipeline and start over - a 10-20 cycle penalty.
// Predictable branch - CPU predicts correctly almost every time
for (int i = 0; i < 1000000; i++) {
if (i < 999999) { // True 999,999 times, false once
sum += i;
}
}
// Unpredictable branch - CPU wrong ~50% of the time
for (int i = 0; i < 1000000; i++) {
if (data[i] > 128) { // Random 50/50 based on data
sum += data[i];
}
}
// The unpredictable version can be 5-10x slower!A famous example: sorting data before processing can make code faster, even though sorting adds work! After sorting, the branch becomes predictable (first half always false, second half always true).
// Unsorted: random branches, many mispredictions
for (int i = 0; i < n; i++) {
if (data[i] >= 128) // Unpredictable
sum += data[i];
}
// Sorted: predictable branches
sort(data, n); // O(n log n) work, but...
for (int i = 0; i < n; i++) {
if (data[i] >= 128) // First half: always false. Second half: always true.
sum += data[i];
}
// Often faster overall despite sorting cost!Branchless programming
One way to avoid branch misprediction is to eliminate branches entirely. Many conditionals can be rewritten using arithmetic or SIMD instructions:
// Branching version
int max(int a, int b) {
if (a > b) return a;
else return b;
}
// Branchless version (compilers often do this automatically)
int max(int a, int b) {
return a ^ ((a ^ b) & -(a < b)); // Bit manipulation, no branch
}
// Or use conditional move (single instruction, no branch)
int max(int a, int b) {
int result = a;
if (b > a) result = b; // Compiler generates CMOV, not a branch
return result;
}
// For summing conditionally:
// Branching:
if (data[i] >= 128) sum += data[i];
// Branchless:
int mask = (data[i] >= 128) ? -1 : 0; // All 1s or all 0s
sum += data[i] & mask; // Adds data[i] or 0Memory alignment and padding
CPUs access memory most efficiently when data is aligned to its natural boundary. A 4-byte integer should start at an address divisible by 4. A 8-byte double should start at an address divisible by 8.
// Poor layout: lots of padding, wastes cache
struct BadLayout {
char a; // 1 byte
// 7 bytes padding to align double
double b; // 8 bytes
char c; // 1 byte
// 7 bytes padding
}; // Total: 24 bytes!
// Good layout: fields ordered by size
struct GoodLayout {
double b; // 8 bytes
char a; // 1 byte
char c; // 1 byte
// 6 bytes padding at end
}; // Total: 16 bytes!
// Best layout: pack if alignment isn't critical
struct __attribute__((packed)) PackedLayout {
double b; // 8 bytes
char a; // 1 byte
char c; // 1 byte
}; // Total: 10 bytes (but may have alignment issues)SIMD: doing more with each instruction
SIMD (Single Instruction, Multiple Data) lets you process multiple data elements with one instruction. Instead of adding two numbers, you add eight numbers simultaneously. Modern CPUs have wide SIMD registers: 256 bits (AVX2) or 512 bits (AVX-512).
// Scalar: process one element at a time
void add_arrays(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // One addition per iteration
}
}
// SIMD with AVX: process 8 elements at a time
#include <immintrin.h>
void add_arrays_simd(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]); // Load 8 floats
__m256 vb = _mm256_load_ps(&b[i]); // Load 8 floats
__m256 vc = _mm256_add_ps(va, vb); // Add 8 pairs
_mm256_store_ps(&c[i], vc); // Store 8 floats
}
}
// 8x theoretical speedup (often 4-6x in practice)Compilers can auto-vectorize simple loops, but complex code often needs manual SIMD or library help. Libraries like Eigen, NumPy, and Intel MKL use SIMD heavily for linear algebra, signal processing, and machine learning.
Putting it together: profiling-guided optimization
With so many factors affecting performance, how do you know what to optimize? The answer is profiling. Measure actual execution, find the hot spots, and focus your effort there.
Programs typically follow the 80/20 rule: 80% of time is spent in 20% of code. Often it's more extreme: 95% of time in 5% of code. Optimizing the cold 80% gives no perceptible improvement. Finding and optimizing the hot 20% (or 5%) can yield dramatic speedups.
# Linux: perf for sampling profiler
perf record ./myprogram
perf report
# macOS: Instruments
xcrun xctrace record --template 'Time Profiler' --launch ./myprogram
# Cross-platform: flamegraphs
perf record -g ./myprogram
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svgProgrammers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.
— Donald Knuth
The path to fast code is: make it work, make it right, make it fast - in that order. Profile to find bottlenecks. Consider algorithmic improvements first (Big O). Then consider memory access patterns (cache). Finally, consider low-level tricks (SIMD, branchless). Each level gives diminishing returns but the right optimization at the right level can yield 10-100x speedups.