The mechanical marvel
Hard disk drives are mechanical engineering at its absolute finest. Inside that sealed metal enclosure, [[platter]]s spin at 5,400 to 15,000 RPM - that's up to 250 rotations per second. A read/write head flies just nanometers above the surface on an air bearing, reading and writing bits by manipulating tiny magnetic domains smaller than a bacterium.
The head doesn't touch the platter - it flies on a cushion of air so thin that a human hair would be a mountain in comparison. At 10 nanometers, the fly height is just 50 atoms. A particle of smoke would cause a head crash. This is why HDDs are assembled in cleanrooms and permanently sealed.
The precision is extraordinary. Modern HDDs pack over a trillion bits per square inch using technologies like perpendicular magnetic recording (PMR), shingled magnetic recording (SMR), and heat-assisted magnetic recording (HAMR). Each magnetic domain is tens of nanometers across, and the head must position itself within nanometers of accuracy while the platter screams past beneath it.
Storage Technologies Compared
| Metric | HDD | SATA SSD | NVMe |
|---|---|---|---|
| Random Read | 5-10 ms | 0.1 ms | 0.02 ms |
| Sequential Read | 150 MB/s | 550 MB/s | 7,000 MB/s |
| Sequential Write | 130 MB/s | 520 MB/s | 5,000 MB/s |
| Interface | SATA III | SATA III | PCIe 4.0 |
| Power (Active) | 6-8W | 2-3W | 5-8W |
| Shock Resistance | Low | High | High |
| Cost per TB | ~$15 | ~$50 | ~$80 |
Key insight: NVMe drives bypass the SATA bottleneck by connecting directly to the CPU via PCIe lanes, enabling parallel data transfer across multiple lanes simultaneously.
But mechanical systems have fundamental limitations. [[Seek time]] - the time to move the head to the right track - is typically 5-10 milliseconds. That sounds fast until you realize a modern CPU can execute 50 million instructions in that time. For random access patterns, HDDs are bottlenecks that make even the fastest CPUs wait. The head can only be in one place at a time.
How magnetic storage works
Data on a hard drive is stored in magnetic domains - tiny regions where all the magnetic dipoles align in the same direction. North-facing might represent a 1, south-facing a 0. To write, the head generates a magnetic field that flips the domain's orientation. To read, the head detects the weak magnetic field emanating from each domain using a magnetoresistive sensor.
Hard Drive Geometry:
Platter (typically 2-4 per drive)
├── Side 0 and Side 1 (both surfaces used)
└── Tracks (concentric circles, ~100,000+ per surface)
└── Sectors (512 bytes or 4096 bytes each)
└── Bits (magnetic domains)
Example: 2TB drive with 4 platters
- 8 surfaces (both sides of each platter)
- ~500,000 tracks per surface
- ~2,000 sectors per track (varies by zone)
- 4,096 bytes per sector
- ≈ 4 trillion bytes total
Addressing: CHS (Cylinder-Head-Sector) historically
Modern: LBA (Logical Block Addressing)
The drive firmware translates LBA to physical location.Reading sequential data is fast because the head stays in place while data streams by - limited only by rotational speed (RPM) and data density. A 7200 RPM drive can sustain 150+ MB/s sequential reads. But random reads require seeking, and those 5-10ms seek times add up fast. Reading 1000 random sectors takes 5-10 seconds on HDD vs milliseconds on SSD.
The flash revolution
Solid-state drives have no moving parts at all. Instead, they use [[NAND flash]] memory - billions of tiny floating-gate transistors that trap electrons to store data. No seeking, no spinning, no mechanical delays. Access time drops from milliseconds to microseconds - a 1,000x improvement that fundamentally changes what's possible.
Each flash cell stores charge in a floating gate - a conductive layer completely surrounded by insulating oxide. To write a 0 (program), a high voltage forces electrons through the oxide onto the floating gate via quantum tunneling. The trapped electrons shift the transistor's threshold voltage. To detect stored data, the controller measures when the transistor turns on. To erase (return to 1), an even higher voltage in the opposite direction pulls electrons back out.
Flash Cell Types and Their Trade-offs:
SLC (Single-Level Cell)
├── 1 bit per cell: threshold detects 2 states
├── Fastest: clear voltage separation
├── Most durable: ~100,000 program/erase cycles
├── Most expensive: 1 bit = 1 cell
└── Used in: enterprise SSDs, write-intensive workloads
MLC (Multi-Level Cell)
├── 2 bits per cell: 4 voltage levels
├── Slower: tighter margins between levels
├── Less durable: ~10,000 cycles
├── Cheaper: 2 bits per cell
└── Used in: high-end consumer SSDs
TLC (Triple-Level Cell)
├── 3 bits per cell: 8 voltage levels
├── Even slower: read requires multiple measurements
├── Lower endurance: ~3,000 cycles
├── Much cheaper: dominates consumer market
└── Used in: mainstream consumer SSDs
QLC (Quad-Level Cell)
├── 4 bits per cell: 16 voltage levels
├── Slowest: very tight voltage margins
├── Lowest endurance: ~1,000 cycles
├── Cheapest per GB: approaches HDD prices
└── Used in: read-heavy workloads, archivalThe write amplification problem
Flash has a fundamental quirk that shapes all SSD design: you can program (write) individual pages (typically 4-16KB), but you can only erase entire blocks (typically 256KB-4MB). To modify a single page, you must read the entire block, modify the page in RAM, erase the whole block, then write everything back.
This leads to write amplification - the SSD might write several times more data internally than what the host requested. If you update a 4KB file in a 256KB block, that's potentially 64x amplification in the worst case. The [[FTL]] (Flash Translation Layer) firmware works hard to minimize this through clever algorithms.
[[Wear leveling]] addresses another critical issue: flash cells wear out after a limited number of program/erase cycles. If the same cells were written repeatedly (like the file system metadata area), they'd die in weeks. The FTL spreads writes evenly across all cells, ensuring uniform wear. This is why SSDs report "total bytes written" lifetime ratings.
SSD architecture
A modern SSD isn't just flash chips - it's a sophisticated computer in its own right. The controller is a powerful processor running complex firmware that manages wear leveling, garbage collection, error correction, wear leveling, and the flash translation layer. There's also DRAM cache (for metadata and write buffering) and multiple flash channels for parallelism.
SSD Internal Architecture:
┌─────────────────────────────────────────────────┐
│ Host Interface │
│ (SATA, NVMe over PCIe) │
├─────────────────────────────────────────────────┤
│ SSD Controller │
│ ┌───────────────────────────────────────────┐ │
│ │ • Flash Translation Layer (FTL) │ │
│ │ • Wear Leveling algorithms │ │
│ │ • Garbage Collection │ │
│ │ • Error Correction (LDPC codes) │ │
│ │ • Encryption engine (AES) │ │
│ └───────────────────────────────────────────┘ │
├─────────────────────────────────────────────────┤
│ DRAM Cache (256MB-4GB) │
│ (mapping table, write buffer) │
├──────────┬──────────┬──────────┬───────────────┤
│ Channel 0│ Channel 1│ Channel 2│ ... Channel N │
├──────────┼──────────┼──────────┼───────────────┤
│ Flash Die│ Flash Die│ Flash Die│ Flash Die │
│ Flash Die│ Flash Die│ Flash Die│ Flash Die │
│ Flash Die│ Flash Die│ Flash Die│ Flash Die │
└──────────┴──────────┴──────────┴───────────────┘
Multiple channels enable parallel access.
8-16 channels is typical for high-end SSDs.NVMe: unleashing flash potential
Early SSDs connected via SATA, an interface designed for spinning disks. SATA's protocol has high overhead and limited queue depth (32 commands), topping out around 550 MB/s. The interface became the bottleneck long before the flash chips were maxed out.
NVMe (Non-Volatile Memory Express) was designed from scratch for flash storage, connecting directly to PCIe lanes. It supports 65,536 queues with 65,536 commands each, minimal protocol overhead, and multiple CPU cores can submit commands in parallel. Modern NVMe SSDs hit 7,000+ MB/s sequential and over 1 million random IOPS - 14x faster than SATA and approaching DRAM speeds.
| Metric | HDD (7200 RPM) | SATA SSD | NVMe SSD (Gen4) | NVMe SSD (Gen5) |
|---|---|---|---|---|
| Sequential Read | 150 MB/s | 550 MB/s | 7,000 MB/s | 12,000 MB/s |
| Sequential Write | 150 MB/s | 520 MB/s | 5,000 MB/s | 10,000 MB/s |
| Random Read IOPS | 100 | 90,000 | 1,000,000 | 1,500,000 |
| Random Write IOPS | 100 | 80,000 | 800,000 | 1,200,000 |
| Latency | 5-10 ms | 0.1 ms | 0.02 ms | 0.015 ms |
| Power (active) | 6-8W | 2-3W | 5-8W | 7-10W |
The future: denser and faster
3D NAND stacks flash cells vertically - over 200 layers in some current designs, with 300+ layer parts coming. This dramatically increases density without shrinking cell size (which would hurt reliability). Instead of cramming more cells into a flat plane, we build skyscrapers of storage. Samsung, Micron, and SK Hynix are racing to stack higher.
QLC (4 bits per cell) and even PLC (5 bits per cell) push capacity further, though at the cost of endurance and speed. These work well for read-heavy workloads like media storage. Sophisticated SLC caching (using a portion of the drive as fast SLC, then migrating to denser modes) helps maintain burst write performance.
Beyond flash, emerging technologies promise even better characteristics. Intel's Optane (3D XPoint) offered byte-addressability and lower latency than NAND, bridging the gap between DRAM and storage. Computational storage puts processing directly on the SSD. And researchers are exploring DNA storage (incredibly dense, stable for millennia, but very slow) and holographic storage for the future.