Why some code runs faster

Hardware meets software

1,978 words10 min read

Two programs can solve the same problem with wildly different speeds. A naive search might take hours while a clever algorithm finishes in milliseconds. Understanding why requires diving into both algorithmic complexity and the hidden mechanics of modern hardware.

Performance is where software meets hardware. You can write the most elegant algorithm, but if it fights against how CPUs and memory actually work, it will be slow. Conversely, seemingly ugly code that plays nicely with hardware can outperform theoretically superior approaches. Let's understand both sides.

Algorithmic complexity: Big O

[[Big O]] notation describes how an algorithm's runtime grows with input size. It's not about exact timing - it's about scaling behavior. An O(n) algorithm that takes 1 second for 1,000 items will take about 10 seconds for 10,000 items. An O(n²) algorithm might take 100 seconds for the same growth.

This matters because data sizes can vary enormously. An algorithm that works fine for 100 items might be unusable for 100 million. Big O helps you understand which algorithms will scale and which will hit a wall.

Big O Complexity

Input size (n):10
Input Size (n)Operations
Complexity Classes
O(1)Constantn=10: ~1 ops
O(log n)Logarithmicn=10: ~3 ops
O(n)Linearn=10: ~10 ops
O(n log n)Linearithmicn=10: ~33 ops
O(n²)Quadraticn=10: ~100 ops
O(2ⁿ)Exponentialn=10: ~1024 ops
Examples
O(1)Array access by index
O(log n)Binary search
O(n)Linear search
O(n log n)Merge sort
O(n²)Bubble sort
O(2ⁿ)Fibonacci (naive)
Visualize how different complexities scale with input size

Let's make this concrete with searching:

// Linear search - O(n)
// Check every element until we find the target
function linearSearch(arr, target) {
    for (let i = 0; i < arr.length; i++) {
        if (arr[i] === target) return i;
    }
    return -1;
}

// Binary search - O(log n) - requires sorted array
// Eliminate half the remaining elements each step
function binarySearch(arr, target) {
    let low = 0, high = arr.length - 1;
    
    while (low <= high) {
        const mid = Math.floor((low + high) / 2);
        
        if (arr[mid] === target) return mid;
        if (arr[mid] < target) low = mid + 1;
        else high = mid - 1;
    }
    return -1;
}

For 1 million items, linear search might check 500,000 items on average. Binary search checks at most 20. That's a 25,000x difference! For 1 billion items, binary search still checks at most 30.

Array SizeLinear Search (avg)Binary Search (max)
1,000500 comparisons10 comparisons
1,000,000500,000 comparisons20 comparisons
1,000,000,000500,000,000 comparisons30 comparisons

Common complexity classes

ComplexityNameExample1M items
O(1)ConstantArray access, hash lookup1 op
O(log n)LogarithmicBinary search, balanced tree20 ops
O(n)LinearSimple loop, linear search1M ops
O(n log n)LinearithmicMerge sort, heap sort20M ops
O(n²)QuadraticNested loops, bubble sort1T ops
O(2ⁿ)ExponentialSubset enumerationImpossible
O(n!)FactorialPermutation generationImpossible

Beyond Big O: constant factors matter

Big O tells you about scaling, but it hides constant factors. An O(n) algorithm that takes 100 microseconds per operation is slower than an O(n log n) algorithm that takes 1 nanosecond per operation, at least until n gets very large.

This is why simple O(n²) algorithms like insertion sort outperform 'better' O(n log n) algorithms for small arrays. The overhead of merge sort's recursion and memory allocation isn't worth it for 10 elements. Many sorting libraries switch algorithms based on array size.

// Real-world hybrid: use insertion sort for small subarrays
function hybridSort(arr, low = 0, high = arr.length - 1) {
    // For small arrays, insertion sort is faster despite O(n²)
    if (high - low < 16) {
        insertionSort(arr, low, high);
        return;
    }
    
    // For larger arrays, use quicksort O(n log n)
    const pivot = partition(arr, low, high);
    hybridSort(arr, low, pivot - 1);
    hybridSort(arr, pivot + 1, high);
}

// This is roughly how production sort implementations work!

The memory hierarchy: why cache matters

Modern CPUs are astonishingly fast. A 4 GHz processor completes 4 billion cycles per second. But memory can't keep up. This growing gap - the [[memory wall]] - is one of the most important factors in software performance.

A CPU cycle takes about 0.25 nanoseconds. Reading from main RAM takes 50-100 nanoseconds - a 200-400x slowdown. If the CPU had to wait for memory on every access, it would be idle 99% of the time.

The solution is caching: small, fast memory close to the CPU that stores recently accessed data. Modern CPUs have multiple cache levels, each larger but slower:

CPU Cache Hierarchy

CPU Core
L1 Cache
64 KB
~4 cycles
L2 Cache
512 KB
~12 cycles
L3 Cache
8 MB
~40 cycles
Main Memory
16 GB
~200 cycles
Memory Access Log
Click a button to simulate memory access
Why Cache Locality Matters
Accessing RAM is ~50x slower than L1 cache. Writing code that accesses memory sequentially (spatial locality) and reuses recently accessed data (temporal locality) can dramatically improve performance by keeping data in fast caches.
Explore CPU cache levels and memory access patterns
LevelSizeLatencyBandwidth
L1 Cache32-64 KB~1 ns (4 cycles)~1 TB/s
L2 Cache256-512 KB~3 ns (12 cycles)~500 GB/s
L3 Cache8-64 MB~10 ns (40 cycles)~200 GB/s
Main RAM8-128 GB~50-100 ns~50 GB/s
SSD256 GB - 8 TB~100,000 ns~5 GB/s

A [[cache miss]] - when data isn't in cache - forces a slow RAM access. Cache-friendly code minimizes misses by exhibiting good locality:

**[[spatial locality]]**: Access memory sequentially. When you read one byte, the CPU loads a whole cache line (typically 64 bytes). Sequential access uses the whole line; random access wastes most of it.

**[[temporal locality]]**: Reuse data while it's still cached. Process everything you need from the current data before moving to new data.

// Cache-FRIENDLY: sequential access (uses spatial locality)
for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
        sum += matrix[i][j];  // Accesses memory in order
    }
}

// Cache-HOSTILE: strided access (wastes cache lines)
for (int j = 0; j < cols; j++) {
    for (int i = 0; i < rows; i++) {
        sum += matrix[i][j];  // Jumps around in memory
    }
}

// For a 1000x1000 matrix, the friendly version might be 10-100x faster!

Data structures and cache performance

Data structure choice dramatically affects cache performance. Arrays are cache-friendly because elements are contiguous. Linked lists are cache-hostile because elements are scattered throughout memory.

// Array: elements contiguous in memory
// Traversal is cache-friendly
const arr = [1, 2, 3, 4, 5, 6, 7, 8];  // All in one cache line!

// Linked list: nodes scattered in memory
// Traversal causes cache miss for EACH node
class Node { 
    constructor(val) { 
        this.val = val; 
        this.next = null;  // Pointer to somewhere else in memory
    }
}

This is why arrays often beat linked lists in practice, even for operations where linked lists have better Big O complexity. Inserting into a linked list is O(1) after you find the spot, but traversing to find that spot causes cache misses. Inserting into an array requires O(n) shifting, but the shift is cache-friendly and often faster for reasonable sizes.

Hash tables illustrate the trade-off. Chained hash tables (buckets are linked lists) are cache-hostile. Open addressing (everything in one array) is cache-friendly. This is why modern hash table implementations usually use open addressing with linear or quadratic probing.

The CPU pipeline and instruction-level parallelism

Modern CPUs don't execute one instruction at a time. They use [[pipeline]]s to work on many instructions simultaneously, like an assembly line. While one instruction is executing, the next is being decoded, and the one after that is being fetched.

Classic 5-stage pipeline:

Time →    Cycle 1   Cycle 2   Cycle 3   Cycle 4   Cycle 5   Cycle 6
────────────────────────────────────────────────────────────────────
Instr 1   Fetch     Decode    Execute   Memory    Writeback
Instr 2             Fetch     Decode    Execute   Memory    Writeback
Instr 3                       Fetch     Decode    Execute   Memory
Instr 4                                 Fetch     Decode    Execute
Instr 5                                           Fetch     Decode

Result: After startup, one instruction completes every cycle!

Superscalar CPUs go further: they can execute multiple independent instructions per cycle. A modern Intel core might have 4-6 execution units working in parallel. This is measured as [[IPC]] - Instructions Per Cycle. A good IPC is 2-4; memory-bound code might see 0.5 or worse.

Branch prediction: guessing the future

Pipelines have a problem: what happens at a conditional branch (if/else)? The CPU doesn't know which way the branch goes until the condition is evaluated, but it needs to fetch the next instruction immediately.

The solution is [[branch prediction]]. The CPU guesses which way the branch will go and starts executing speculatively. If it guesses right, execution continues at full speed. If it guesses wrong, it must flush the pipeline and start over - a 10-20 cycle penalty.

// Predictable branch - CPU predicts correctly almost every time
for (int i = 0; i < 1000000; i++) {
    if (i < 999999) {  // True 999,999 times, false once
        sum += i;
    }
}

// Unpredictable branch - CPU wrong ~50% of the time
for (int i = 0; i < 1000000; i++) {
    if (data[i] > 128) {  // Random 50/50 based on data
        sum += data[i];
    }
}

// The unpredictable version can be 5-10x slower!

A famous example: sorting data before processing can make code faster, even though sorting adds work! After sorting, the branch becomes predictable (first half always false, second half always true).

// Unsorted: random branches, many mispredictions
for (int i = 0; i < n; i++) {
    if (data[i] >= 128)  // Unpredictable
        sum += data[i];
}

// Sorted: predictable branches
sort(data, n);  // O(n log n) work, but...
for (int i = 0; i < n; i++) {
    if (data[i] >= 128)  // First half: always false. Second half: always true.
        sum += data[i];
}

// Often faster overall despite sorting cost!

Branchless programming

One way to avoid branch misprediction is to eliminate branches entirely. Many conditionals can be rewritten using arithmetic or SIMD instructions:

// Branching version
int max(int a, int b) {
    if (a > b) return a;
    else return b;
}

// Branchless version (compilers often do this automatically)
int max(int a, int b) {
    return a ^ ((a ^ b) & -(a < b));  // Bit manipulation, no branch
}

// Or use conditional move (single instruction, no branch)
int max(int a, int b) {
    int result = a;
    if (b > a) result = b;  // Compiler generates CMOV, not a branch
    return result;
}

// For summing conditionally:
// Branching:
if (data[i] >= 128) sum += data[i];

// Branchless:
int mask = (data[i] >= 128) ? -1 : 0;  // All 1s or all 0s
sum += data[i] & mask;  // Adds data[i] or 0

Memory alignment and padding

CPUs access memory most efficiently when data is aligned to its natural boundary. A 4-byte integer should start at an address divisible by 4. A 8-byte double should start at an address divisible by 8.

// Poor layout: lots of padding, wastes cache
struct BadLayout {
    char a;       // 1 byte
    // 7 bytes padding to align double
    double b;     // 8 bytes
    char c;       // 1 byte
    // 7 bytes padding
};  // Total: 24 bytes!

// Good layout: fields ordered by size
struct GoodLayout {
    double b;     // 8 bytes
    char a;       // 1 byte
    char c;       // 1 byte
    // 6 bytes padding at end
};  // Total: 16 bytes!

// Best layout: pack if alignment isn't critical
struct __attribute__((packed)) PackedLayout {
    double b;     // 8 bytes
    char a;       // 1 byte
    char c;       // 1 byte
};  // Total: 10 bytes (but may have alignment issues)

SIMD: doing more with each instruction

SIMD (Single Instruction, Multiple Data) lets you process multiple data elements with one instruction. Instead of adding two numbers, you add eight numbers simultaneously. Modern CPUs have wide SIMD registers: 256 bits (AVX2) or 512 bits (AVX-512).

// Scalar: process one element at a time
void add_arrays(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // One addition per iteration
    }
}

// SIMD with AVX: process 8 elements at a time
#include <immintrin.h>
void add_arrays_simd(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);  // Load 8 floats
        __m256 vb = _mm256_load_ps(&b[i]);  // Load 8 floats
        __m256 vc = _mm256_add_ps(va, vb);  // Add 8 pairs
        _mm256_store_ps(&c[i], vc);         // Store 8 floats
    }
}

// 8x theoretical speedup (often 4-6x in practice)

Compilers can auto-vectorize simple loops, but complex code often needs manual SIMD or library help. Libraries like Eigen, NumPy, and Intel MKL use SIMD heavily for linear algebra, signal processing, and machine learning.

Putting it together: profiling-guided optimization

With so many factors affecting performance, how do you know what to optimize? The answer is profiling. Measure actual execution, find the hot spots, and focus your effort there.

Programs typically follow the 80/20 rule: 80% of time is spent in 20% of code. Often it's more extreme: 95% of time in 5% of code. Optimizing the cold 80% gives no perceptible improvement. Finding and optimizing the hot 20% (or 5%) can yield dramatic speedups.

# Linux: perf for sampling profiler
perf record ./myprogram
perf report

# macOS: Instruments
xcrun xctrace record --template 'Time Profiler' --launch ./myprogram

# Cross-platform: flamegraphs
perf record -g ./myprogram
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.

Donald Knuth

The path to fast code is: make it work, make it right, make it fast - in that order. Profile to find bottlenecks. Consider algorithmic improvements first (Big O). Then consider memory access patterns (cache). Finally, consider low-level tricks (SIMD, branchless). Each level gives diminishing returns but the right optimization at the right level can yield 10-100x speedups.

How Things Work - A Visual Guide to Technology