From spinning disks to flash

How data is physically stored

1,464 words8 min read

The mechanical marvel

Hard disk drives are mechanical engineering at its absolute finest. Inside that sealed metal enclosure, [[platter]]s spin at 5,400 to 15,000 RPM - that's up to 250 rotations per second. A read/write head flies just nanometers above the surface on an air bearing, reading and writing bits by manipulating tiny magnetic domains smaller than a bacterium.

The head doesn't touch the platter - it flies on a cushion of air so thin that a human hair would be a mountain in comparison. At 10 nanometers, the fly height is just 50 atoms. A particle of smoke would cause a head crash. This is why HDDs are assembled in cleanrooms and permanently sealed.

The precision is extraordinary. Modern HDDs pack over a trillion bits per square inch using technologies like perpendicular magnetic recording (PMR), shingled magnetic recording (SMR), and heat-assisted magnetic recording (HAMR). Each magnetic domain is tens of nanometers across, and the head must position itself within nanometers of accuracy while the platter screams past beneath it.

Storage Technologies Compared

Target sector
Read/write head
~5-10ms
Random access latency
MetricHDDSATA SSDNVMe
Random Read5-10 ms0.1 ms0.02 ms
Sequential Read150 MB/s550 MB/s7,000 MB/s
Sequential Write130 MB/s520 MB/s5,000 MB/s
InterfaceSATA IIISATA IIIPCIe 4.0
Power (Active)6-8W2-3W5-8W
Shock ResistanceLowHighHigh
Cost per TB~$15~$50~$80

Key insight: NVMe drives bypass the SATA bottleneck by connecting directly to the CPU via PCIe lanes, enabling parallel data transfer across multiple lanes simultaneously.

Compare how HDDs and SSDs access data differently

But mechanical systems have fundamental limitations. [[Seek time]] - the time to move the head to the right track - is typically 5-10 milliseconds. That sounds fast until you realize a modern CPU can execute 50 million instructions in that time. For random access patterns, HDDs are bottlenecks that make even the fastest CPUs wait. The head can only be in one place at a time.

How magnetic storage works

Data on a hard drive is stored in magnetic domains - tiny regions where all the magnetic dipoles align in the same direction. North-facing might represent a 1, south-facing a 0. To write, the head generates a magnetic field that flips the domain's orientation. To read, the head detects the weak magnetic field emanating from each domain using a magnetoresistive sensor.

Hard Drive Geometry:

Platter (typically 2-4 per drive)
├── Side 0 and Side 1 (both surfaces used)
    └── Tracks (concentric circles, ~100,000+ per surface)
        └── Sectors (512 bytes or 4096 bytes each)
            └── Bits (magnetic domains)

Example: 2TB drive with 4 platters
- 8 surfaces (both sides of each platter)
- ~500,000 tracks per surface
- ~2,000 sectors per track (varies by zone)
- 4,096 bytes per sector
- ≈ 4 trillion bytes total

Addressing: CHS (Cylinder-Head-Sector) historically
Modern: LBA (Logical Block Addressing)
The drive firmware translates LBA to physical location.

Reading sequential data is fast because the head stays in place while data streams by - limited only by rotational speed (RPM) and data density. A 7200 RPM drive can sustain 150+ MB/s sequential reads. But random reads require seeking, and those 5-10ms seek times add up fast. Reading 1000 random sectors takes 5-10 seconds on HDD vs milliseconds on SSD.

The flash revolution

Solid-state drives have no moving parts at all. Instead, they use [[NAND flash]] memory - billions of tiny floating-gate transistors that trap electrons to store data. No seeking, no spinning, no mechanical delays. Access time drops from milliseconds to microseconds - a 1,000x improvement that fundamentally changes what's possible.

Each flash cell stores charge in a floating gate - a conductive layer completely surrounded by insulating oxide. To write a 0 (program), a high voltage forces electrons through the oxide onto the floating gate via quantum tunneling. The trapped electrons shift the transistor's threshold voltage. To detect stored data, the controller measures when the transistor turns on. To erase (return to 1), an even higher voltage in the opposite direction pulls electrons back out.

Flash Cell Types and Their Trade-offs:

SLC (Single-Level Cell)
├── 1 bit per cell: threshold detects 2 states
├── Fastest: clear voltage separation
├── Most durable: ~100,000 program/erase cycles
├── Most expensive: 1 bit = 1 cell
└── Used in: enterprise SSDs, write-intensive workloads

MLC (Multi-Level Cell)
├── 2 bits per cell: 4 voltage levels
├── Slower: tighter margins between levels
├── Less durable: ~10,000 cycles
├── Cheaper: 2 bits per cell
└── Used in: high-end consumer SSDs

TLC (Triple-Level Cell)
├── 3 bits per cell: 8 voltage levels
├── Even slower: read requires multiple measurements
├── Lower endurance: ~3,000 cycles
├── Much cheaper: dominates consumer market
└── Used in: mainstream consumer SSDs

QLC (Quad-Level Cell)
├── 4 bits per cell: 16 voltage levels
├── Slowest: very tight voltage margins
├── Lowest endurance: ~1,000 cycles
├── Cheapest per GB: approaches HDD prices
└── Used in: read-heavy workloads, archival

The write amplification problem

Flash has a fundamental quirk that shapes all SSD design: you can program (write) individual pages (typically 4-16KB), but you can only erase entire blocks (typically 256KB-4MB). To modify a single page, you must read the entire block, modify the page in RAM, erase the whole block, then write everything back.

This leads to write amplification - the SSD might write several times more data internally than what the host requested. If you update a 4KB file in a 256KB block, that's potentially 64x amplification in the worst case. The [[FTL]] (Flash Translation Layer) firmware works hard to minimize this through clever algorithms.

[[Wear leveling]] addresses another critical issue: flash cells wear out after a limited number of program/erase cycles. If the same cells were written repeatedly (like the file system metadata area), they'd die in weeks. The FTL spreads writes evenly across all cells, ensuring uniform wear. This is why SSDs report "total bytes written" lifetime ratings.

SSD architecture

A modern SSD isn't just flash chips - it's a sophisticated computer in its own right. The controller is a powerful processor running complex firmware that manages wear leveling, garbage collection, error correction, wear leveling, and the flash translation layer. There's also DRAM cache (for metadata and write buffering) and multiple flash channels for parallelism.

SSD Internal Architecture:

┌─────────────────────────────────────────────────┐
│                 Host Interface                   │
│            (SATA, NVMe over PCIe)               │
├─────────────────────────────────────────────────┤
│                 SSD Controller                   │
│  ┌───────────────────────────────────────────┐  │
│  │ • Flash Translation Layer (FTL)           │  │
│  │ • Wear Leveling algorithms                │  │
│  │ • Garbage Collection                      │  │
│  │ • Error Correction (LDPC codes)          │  │
│  │ • Encryption engine (AES)                │  │
│  └───────────────────────────────────────────┘  │
├─────────────────────────────────────────────────┤
│              DRAM Cache (256MB-4GB)             │
│          (mapping table, write buffer)          │
├──────────┬──────────┬──────────┬───────────────┤
│ Channel 0│ Channel 1│ Channel 2│ ... Channel N │
├──────────┼──────────┼──────────┼───────────────┤
│ Flash Die│ Flash Die│ Flash Die│ Flash Die     │
│ Flash Die│ Flash Die│ Flash Die│ Flash Die     │
│ Flash Die│ Flash Die│ Flash Die│ Flash Die     │
└──────────┴──────────┴──────────┴───────────────┘

Multiple channels enable parallel access.
8-16 channels is typical for high-end SSDs.

NVMe: unleashing flash potential

Early SSDs connected via SATA, an interface designed for spinning disks. SATA's protocol has high overhead and limited queue depth (32 commands), topping out around 550 MB/s. The interface became the bottleneck long before the flash chips were maxed out.

NVMe (Non-Volatile Memory Express) was designed from scratch for flash storage, connecting directly to PCIe lanes. It supports 65,536 queues with 65,536 commands each, minimal protocol overhead, and multiple CPU cores can submit commands in parallel. Modern NVMe SSDs hit 7,000+ MB/s sequential and over 1 million random IOPS - 14x faster than SATA and approaching DRAM speeds.

MetricHDD (7200 RPM)SATA SSDNVMe SSD (Gen4)NVMe SSD (Gen5)
Sequential Read150 MB/s550 MB/s7,000 MB/s12,000 MB/s
Sequential Write150 MB/s520 MB/s5,000 MB/s10,000 MB/s
Random Read IOPS10090,0001,000,0001,500,000
Random Write IOPS10080,000800,0001,200,000
Latency5-10 ms0.1 ms0.02 ms0.015 ms
Power (active)6-8W2-3W5-8W7-10W

The future: denser and faster

3D NAND stacks flash cells vertically - over 200 layers in some current designs, with 300+ layer parts coming. This dramatically increases density without shrinking cell size (which would hurt reliability). Instead of cramming more cells into a flat plane, we build skyscrapers of storage. Samsung, Micron, and SK Hynix are racing to stack higher.

QLC (4 bits per cell) and even PLC (5 bits per cell) push capacity further, though at the cost of endurance and speed. These work well for read-heavy workloads like media storage. Sophisticated SLC caching (using a portion of the drive as fast SLC, then migrating to denser modes) helps maintain burst write performance.

Beyond flash, emerging technologies promise even better characteristics. Intel's Optane (3D XPoint) offered byte-addressability and lower latency than NAND, bridging the gap between DRAM and storage. Computational storage puts processing directly on the SSD. And researchers are exploring DNA storage (incredibly dense, stable for millennia, but very slow) and holographic storage for the future.

How Things Work - A Visual Guide to Technology