Squeezing data smaller

Compression algorithms

1,714 words9 min read

The compression paradox

Here's something that sounds impossible: take a file, make it smaller, then later reconstruct the original file perfectly. No information lost. How can you store the same information in less space? The answer lies in a fundamental truth about most real-world data: it's not random. It has patterns, redundancy, and structure that we can exploit.

Consider this sentence: "the the the the the". In plain ASCII, that's 23 bytes. But I could encode it as "5×'the '", which is only 8 characters. Same information, less space. That's compression in a nutshell. The key insight is that [[entropy]] - the true information content - is usually much lower than the raw data size suggests.

Claude Shannon, the father of information theory, proved in 1948 that there's a theoretical minimum size for any data based on its entropy. This is the Shannon limit, and it tells us exactly how compressible any data can be. Remarkably, modern compression algorithms get within a few percent of this theoretical limit. We're essentially squeezing data as small as physics allows.

Run-length encoding: the simplest trick

The most intuitive compression algorithm is run-length encoding (RLE). When you see the same value repeated, just store the value and the count. It's dead simple and works beautifully for certain types of data - particularly images with large areas of solid color.

Original: AAAAAABBBCCCCCCCCDDDD
RLE encoded: 6A3B8C4D

Original bytes: 21
Compressed bytes: 8
Compression ratio: 62%

But watch what happens with varied data:
Original: ABCDEFGHIJK
RLE encoded: 1A1B1C1D1E1F1G1H1I1J1K

Compressed bytes: 22 (WORSE than original!)

RLE only helps when there are actual runs.
It can even hurt on data without repetition.

RLE is still used today in places where runs are common: simple graphics formats like BMP, fax machines (which transmit mostly white with occasional black), and as a preprocessing step in more complex algorithms. JPEG uses RLE after quantization because many DCT coefficients become zero. The PCX image format is essentially just RLE. It's fast and requires almost no memory, which matters in embedded systems.

The lesson from RLE is important: compression algorithms make assumptions about the data. RLE assumes there are runs. When the assumption holds, compression is great. When it doesn't, you can actually make the data larger. Good compression algorithms adapt to the data rather than assuming a fixed structure.

Huffman coding: shorter codes for common symbols

In standard ASCII, every character takes 8 bits. But if 'e' appears 100 times and 'z' appears once, shouldn't 'e' get a shorter code? That's exactly what [[Huffman coding]] does. It builds a binary tree where more frequent symbols are closer to the root, giving them shorter codes.

Huffman Coding

Generated Codes:
A0
C100
D101
B110
R111
11
0
A
5x
1
6
0
2
0
C
1x
1
D
1x
1
4
0
B
2x
1
R
2x
88
Original (bits)
23
Compressed (bits)
73.9%
Saved
Type different text to see how Huffman codes adapt to character frequencies

The algorithm is elegant: start with each symbol as a leaf node weighted by its frequency. Repeatedly combine the two lowest-frequency nodes into a parent (whose weight is their sum). Continue until only one node remains - that's the root. Then assign 0 to left branches and 1 to right branches. The result is a prefix-free code - no code is a prefix of another - which means you can decode unambiguously without needing separators between symbols.

import heapq

def build_huffman_tree(text):
    """Build Huffman tree and return code mapping"""
    # Count frequencies
    freq = {}
    for char in text:
        freq[char] = freq.get(char, 0) + 1
    
    # Create heap of (frequency, id, node) tuples
    # id is used to break ties consistently
    heap = [(f, i, char) for i, (char, f) in enumerate(freq.items())]
    heapq.heapify(heap)
    
    counter = len(heap)
    while len(heap) > 1:
        # Pop two smallest
        freq1, _, node1 = heapq.heappop(heap)
        freq2, _, node2 = heapq.heappop(heap)
        
        # Combine into new internal node
        combined = (freq1 + freq2, counter, (node1, node2))
        counter += 1
        heapq.heappush(heap, combined)
    
    # Build code table by traversing tree
    codes = {}
    def traverse(node, code=""):
        if isinstance(node, str):
            codes[node] = code or "0"  # Handle single-char case
        else:
            traverse(node[0], code + "0")
            traverse(node[1], code + "1")
    
    traverse(heap[0][2])
    return codes

# Example: "AAAAABBC" → A: '0', B: '10', C: '11'
# 5 A's = 5 bits, 2 B's = 4 bits, 1 C = 2 bits = 11 bits total
# vs 8 chars × 8 bits = 64 bits uncompressed

Huffman coding is optimal for symbol-by-symbol coding - you can't do better if you must encode each symbol independently. However, it requires transmitting the code table along with the data (overhead), and it can't exploit correlations between symbols. Still, it's used in JPEG, MP3, ZIP, and countless other formats, usually combined with other techniques.

Arithmetic coding: beyond Huffman

Huffman coding has a limitation: it must assign an integer number of bits to each symbol. If a symbol's optimal code length is 2.3 bits, Huffman rounds to 2 or 3 bits. [[Arithmetic coding]] breaks this barrier by encoding the entire message as a single number between 0 and 1.

The idea is to divide the interval [0, 1) according to symbol probabilities. If 'A' has probability 0.5 and 'B' has probability 0.5, then 'A' maps to [0, 0.5) and 'B' maps to [0.5, 1). Each subsequent symbol subdivides the current interval. After encoding all symbols, any number in the final interval uniquely identifies the message.

Encoding "BAA" with P(A)=0.5, P(B)=0.5:

Start: [0, 1)
After 'B': [0.5, 1)
After 'A': [0.5, 0.75)   ← A is lower half of [0.5, 1)
After 'A': [0.5, 0.625)  ← A is lower half of [0.5, 0.75)

Any number in [0.5, 0.625) encodes "BAA"
We might output 0.5 or 0.5625 (binary: 0.1001)

Key insight: More probable symbols shrink the interval less,
effectively using fewer bits. The math works out to exactly
the entropy!

Arithmetic coding achieves compression within 1-2 bits of the theoretical entropy limit, beating Huffman for many data types. It's used in JPEG 2000, H.264/H.265 video codecs, and modern compression libraries. The main downside is that it's more computationally expensive and requires careful implementation to avoid numerical precision issues.

LZ77: the algorithm behind ZIP

Huffman and arithmetic coding exploit symbol frequency, but what about repeated phrases? The sentence "to be or not to be" has "to be" twice. LZ77, invented by Abraham Lempel and Jacob Ziv in 1977, handles this with a sliding window approach that forms the foundation of almost all modern compression.

The idea is beautifully simple: as you scan through the data, look backwards in a "window" of recently seen data (typically 32KB to 64KB). If you find a match, output a reference (distance back, length) instead of the literal bytes. This is called [[dictionary coding]] because the window acts as an implicit dictionary of recently seen patterns.

Input: "ABRACADABRA"

Position 0: A - no match found, output literal 'A'
Position 1: B - no match, output literal 'B'
Position 2: R - no match, output literal 'R'
Position 3: A - match 'A' at distance 3! But length 1 isn't worth it
            output literal 'A'
Position 4: C - no match, output literal 'C'
Position 5: A - no match, output literal 'A'
Position 6: D - no match, output literal 'D'
Position 7: ABRA - match! "ABRA" at distance 7, length 4
            output (7, 4) meaning "go back 7, copy 4 bytes"

Encoded: A B R A C A D (7,4)

The longer the match, the better the compression!
This is why ZIP works well on text (many repeated words)
but poorly on random data.

DEFLATE, the algorithm used in ZIP files, gzip, and PNG images, combines LZ77 with Huffman coding. First it finds repeated patterns with LZ77, then it Huffman-encodes both the literals and the distance/length pairs. This two-stage approach is remarkably effective across many data types and remains the workhorse of compression 40+ years later.

Modern variants like LZ4 optimize for speed (decompression at 4+ GB/s), while Zstandard (zstd) achieves better compression with faster speeds than gzip by using larger dictionaries, entropy coding improvements (finite state entropy instead of Huffman), and better match finding. LZMA (used in 7-Zip and xz) goes even further with range coding and sophisticated context modeling.

Why some things don't compress

Try compressing a ZIP file with ZIP. Or compress random noise. You'll find it doesn't get smaller - it might even get bigger due to metadata overhead. This is because already-compressed data and random data have high [[entropy]] - there are no patterns left to exploit. This is fundamental, not a limitation of the algorithm.

This leads to an important principle: compression ratio depends entirely on the data. English text might compress to 20% of original size because of highly predictable patterns (after 'q' comes 'u', spaces follow words). A database with lots of repeated values might compress to 5%. Random bytes? 100% or more. There's no universal "50% compression" - it varies wildly with the data's inherent entropy.

Data TypeTypical RatioWhy
English text20-30%High redundancy, predictable patterns
Source code15-25%Very repetitive structure, keywords
Log files5-15%Extremely repetitive timestamps, formats
Databases10-30%Repeated values, structured data
PNG images40-80%Already uses lossless compression
Already compressed~100%Entropy already maximized
Random bytes>100%No patterns, overhead makes it bigger
Encrypted data>100%Designed to look random

Modern compression: context and prediction

Modern algorithms like LZMA (used in 7-Zip and xz) and Zstandard go beyond simple pattern matching. They use multiple techniques together: large dictionaries, sophisticated match finding with hash chains and binary trees, and context modeling that predicts what byte is likely to come next based on recent context.

The key insight is that the better you can predict the next byte, the less information you need to encode it. If the model is 99% sure the next letter is 'e', you only need a tiny fraction of a bit to confirm or correct that prediction. This is why context matters: after 'th', the letter 'e' is much more likely than 'x'.

This connection between compression and prediction is profound. A perfect compressor would be a perfect predictor. This is why large language models like GPT can be viewed as compression algorithms - they're modeling the probability distribution of text, which is exactly what's needed for optimal compression. In fact, researchers have shown that language models achieve near-optimal compression ratios on text.

How Things Work - A Visual Guide to Technology