What makes a GPU different

Parallel processing explained

2,348 words12 min read

Your computer has two very different brains. One is famous - the [[CPU]] that everyone talks about when comparing laptops or building PCs. The other sits quietly on your graphics card (or integrated into modern chips), often possessing more raw computational power than its celebrated sibling: the [[GPU]]. Understanding why we need both processors, and how they differ at a fundamental level, explains everything from why your computer can render millions of pixels at 60 frames per second to why AI researchers cannot buy enough graphics cards.

The story of the GPU is really a story about specialization versus generalization - a tradeoff that appears everywhere in engineering. A Swiss Army knife can do many things adequately; a chef's knife does one thing superbly. CPUs are Swiss Army knives. GPUs are like having a thousand butter knives: individually simple, but collectively capable of spreading an enormous amount of butter very quickly.

The fundamental tradeoff

Imagine you need to solve a thousand simple math problems - just basic addition and multiplication. You have two options for help. Option A is a brilliant mathematician, a Fields Medal winner who can solve incredibly complex problems, handle unexpected curveballs, work through intricate proofs, and switch between different types of problems effortlessly. They are phenomenal at hard problems but can only work on one thing at a time, maybe two or three with some mental juggling.

Option B is a gymnasium full of a thousand high school students. Each one can only do basic arithmetic - addition, multiplication, nothing fancy. They cannot handle complex logic or switch between different types of problems quickly. But they can all work simultaneously, and they are extremely good at following simple, repetitive instructions.

If you need to prove a theorem or debug a complex piece of logic? You absolutely want the mathematician. If you need to add up a thousand numbers? Give each student one addition and you are done in seconds. The mathematician, working sequentially, would take a thousand times longer. Neither is better in absolute terms - they are optimized for different workloads.

This is the CPU versus GPU tradeoff, and it is not a matter of one being more advanced than the other. A modern CPU has a small number (typically 4-24) of incredibly sophisticated cores, each one a marvel of engineering with deep instruction pipelines, branch predictors, out-of-order execution engines, and large multi-level caches. A modern GPU has thousands (up to 16,000+) of comparatively simple cores that share control logic and execute the same instruction across many data elements.

CPU vs GPU: Processing 1000 Pixels

CPU (4 cores, sequential)
0% complete
GPU (1000s of cores, parallel)
0% complete

The GPU finishes ~4x faster by processing thousands of pixels simultaneously

Watch how parallel processing dramatically accelerates repetitive tasks compared to sequential execution

Why graphics demand parallelism

The reason GPUs evolved this parallel architecture comes directly from what graphics rendering actually requires. Think about what happens when you play a video game or watch a video. Every single frame - and you want at least 60 of them per second for smooth motion - the computer needs to figure out what color every single pixel on your screen should be.

On a 1080p display, that is 1920 × 1080 = 2,073,600 pixels. At 60 frames per second, that is over 124 million pixel calculations every second. For a 4K display at 60fps, you are looking at over 497 million pixel calculations per second. And each of those calculations might involve texture lookups, lighting equations, shadow tests, and blending operations.

Here is the critical insight that makes GPUs possible: calculating the color of pixel (100, 200) does not depend on the color of pixel (500, 300). They are completely independent. You do not need to know one to compute the other. Computer scientists call this property 'embarrassingly parallel' - the work can be split up trivially because there are no dependencies between the pieces.

Given this property of graphics workloads, it makes no sense to have a sophisticated processor working on one pixel at a time. You want as many simple processors as possible, all working on different pixels simultaneously. This is exactly what a GPU provides.

Inside the silicon: CPU architecture

To understand what makes a GPU different, we first need to understand what makes a CPU sophisticated. A modern CPU core like those in an Intel Core i9 or AMD Ryzen 9 is optimized for single-threaded performance and handling unpredictable workloads. Each core contains several key components that make it powerful but expensive in terms of transistor count and power consumption.

First, there is branch prediction. Programs are full of if-statements and loops. When the CPU encounters a conditional branch, it cannot just stop and wait to see which way the code will go - that would waste dozens of clock cycles. Instead, sophisticated branch predictors use historical patterns to guess which branch will be taken, allowing the CPU to speculatively execute instructions ahead of time. Modern branch predictors are correct over 95% of the time.

Second, CPUs use out-of-order execution. Instructions do not always execute in the order they appear in the program. The CPU analyzes dependencies between instructions and executes them in whatever order maximizes throughput, as long as the final result is correct. This requires complex tracking of instruction dependencies and register renaming to avoid conflicts.

Third, CPUs have large, multi-level caches. Memory access is slow - hundreds of clock cycles to fetch data from RAM. CPUs mitigate this with hierarchical caches: a tiny, blazing-fast L1 cache (typically 32-64KB per core), a larger L2 cache (256KB-512KB), and a shared L3 cache (multiple megabytes). These caches exploit the fact that programs tend to access the same data repeatedly (temporal locality) and data near recently accessed data (spatial locality).

ComponentPurposeWhy GPUs minimize it
Branch PredictorGuess conditional outcomes to avoid pipeline stallsGPU code should minimize branches - all threads do the same thing
Out-of-Order EngineExecute instructions in optimal order regardless of program orderSimpler in-order execution is sufficient when you have thousands of threads
Large CachesHide memory latency by keeping data close to computeGPUs hide latency with thread switching instead of caching
Speculative ExecutionExecute ahead of branches to avoid stalls on mispredictionNot needed when code paths are uniform across threads

Inside the silicon: GPU architecture

A [[GPU]] takes the opposite approach. An NVIDIA RTX 4090 has over 16,000 [[CUDA Core]]s organized into 128 [[Streaming Multiprocessor]]s. But calling them 'cores' is somewhat misleading - they are not independent processors like CPU cores. They are more like arithmetic units that share control logic and execute in lockstep.

The basic building block of an NVIDIA GPU is the Streaming Multiprocessor (SM). Each SM contains multiple CUDA cores (the exact number varies by architecture), shared memory, an instruction cache, and warp schedulers. All the CUDA cores within an SM execute under common control - they fetch the same instruction and execute it on different data.

Each SM can manage thousands of threads simultaneously, but it executes them in groups called [[warp]]s - 32 threads that run in perfect lockstep. When you dispatch a shader program, the GPU assigns threads to warps, warps to SMs, and the SMs execute the shader across all assigned threads. The key is that every thread in a warp executes the same instruction at the same clock cycle, just operating on different data in their private registers.

Operation: Multiply each element by 2
Input Array:
4
7
2
9
1
8
3
6
× 2
Single instruction processes one element at a time
Output Array:
?
?
?
?
?
?
?
?
Elements Processed
0 / 8
Instructions Executed
0
Speedup
Compare SISD (CPU-style) vs SIMD (GPU-style) execution models

SIMD and SIMT: the execution model

The magic that makes GPUs efficient is [[SIMD]] - Single Instruction, Multiple Data. Instead of each core running its own independent program fetching its own instructions, groups of cores execute the exact same instruction at the exact same time, just on different data elements. This massively reduces the overhead of instruction fetching, decoding, and scheduling.

NVIDIA calls their execution model [[SIMT]] - Single Instruction, Multiple Threads. It is essentially SIMD with some additional flexibility. The difference is that SIMT allows limited thread divergence: if some threads in a warp take a different branch than others, the GPU can handle it (though inefficiently). In pure SIMD, divergence is typically not possible at all.

// This fragment shader runs on EVERY pixel simultaneously
// Each invocation has different gl_FragCoord values
// but executes the exact same instructions

void main() {
    // Each thread has its own pixel coordinate
    vec2 uv = gl_FragCoord.xy / resolution;
    
    // Each thread samples a different part of the texture
    vec4 color = texture(inputImage, uv);
    
    // Each thread computes brightness for its pixel
    float brightness = dot(color.rgb, vec3(0.299, 0.587, 0.114));
    
    // Each thread outputs a different color
    fragColor = vec4(vec3(brightness), 1.0);
}

This shader converts an image to grayscale. When you dispatch it for a 1920×1080 image, the GPU creates over 2 million threads - one for each pixel. These threads are organized into warps of 32, and each warp executes this program in lockstep. Thread 0 might be processing pixel (0,0), thread 1 processing pixel (1,0), and so on. They all execute the same texture lookup instruction at the same time, just with different coordinates.

Memory: bandwidth over latency

GPU memory ([[VRAM]]) is designed around a completely different set of priorities than CPU memory. CPUs optimize for latency - getting any single piece of data as fast as possible. GPUs optimize for [[memory bandwidth]] - moving enormous amounts of data per second, even if any individual access takes a while.

A modern high-end GPU like the RTX 4090 has over 1 terabyte per second of memory bandwidth - roughly 10× what a high-end CPU can achieve. But accessing that memory takes hundreds of clock cycles. A single memory fetch might take 400-800 cycles to complete.

How do GPUs cope with this latency? The answer is [[latency hiding]] through massive parallelism. When one warp issues a memory request and has to wait, the SM immediately switches to executing a different warp that already has its data. With thousands of threads in flight, there is always something useful to do while waiting for memory. The latency does not disappear - it is just hidden behind other useful work.

This is fundamentally different from the CPU approach of using caches to reduce latency. CPUs try to make memory access fast. GPUs accept that memory access is slow and keep themselves busy with other work while waiting. Both approaches work; they are just optimized for different access patterns.

MetricModern CPUModern GPURatio
Cores8-2410,000-16,000+~500-1000×
Clock Speed4-5+ GHz1.5-2.5 GHz~0.5×
Memory Bandwidth50-100 GB/s500-1000+ GB/s~10×
Memory Latency~50-100 ns~400-800 ns~8× slower
Cache per Core1-2 MB~10-20 KB~100× less
Transistors~5-10 billion~50-80 billion~10×

Beyond graphics: GPGPU and the AI revolution

For most of their history, GPUs could only be programmed through graphics APIs like OpenGL and DirectX. You had to frame your computation as if it were a rendering problem. In 2007, NVIDIA changed everything by releasing CUDA (Compute Unified Device Architecture), which allowed programmers to write general-purpose code that runs on the GPU without the graphics pretense.

This opened the floodgates for General-Purpose GPU computing (GPGPU). Scientists realized that many computationally intensive problems - fluid dynamics, molecular simulations, financial modeling - are embarrassingly parallel and map beautifully onto GPU architectures. Supercomputers started incorporating GPUs. Cryptocurrency miners discovered that the hash calculations underlying proof-of-work were perfect for parallel execution.

But the biggest revolution came from machine learning. Neural networks are, at their core, enormous matrix multiplications. Training a model means doing the same mathematical operations across millions or billions of parameters. This is exactly the kind of work GPUs excel at.

Modern GPUs have evolved to serve this demand. NVIDIA's Tensor Cores are specialized units for matrix operations central to deep learning. Their architectures now include dedicated hardware for AI inference. The line between 'graphics processor' and 'AI accelerator' has become thoroughly blurred.

When to use which

Understanding the CPU/GPU tradeoff helps you know when each is appropriate. CPUs excel at tasks with complex control flow, unpredictable branches, pointer-chasing through data structures, and workloads that cannot be parallelized. Operating systems, compilers, web servers, databases with complex queries - these run best on CPUs.

GPUs excel at tasks where you do the same thing to lots of data: graphics rendering, image and video processing, scientific simulations, matrix operations, training and running neural networks. If you can express your problem as 'do this simple thing to each of these million elements,' a GPU will crush a CPU.

Modern applications often use both. A video game might run its game logic, physics, and AI decision-making on the CPU while offloading rendering to the GPU. A machine learning pipeline might preprocess data on the CPU, train on the GPU, and run the trained model for inference on either depending on latency requirements.

The future likely holds even more specialization. We already have TPUs (Tensor Processing Units) for matrix operations, NPUs (Neural Processing Units) for on-device AI, and various domain-specific accelerators. The principle remains the same: match the architecture to the workload. Sometimes you need a Swiss Army knife. Sometimes you need a thousand butter knives.

How Things Work - A Visual Guide to Technology