Generating pixels with AI

Diffusion models and GANs

2,531 words13 min read

Teaching machines to create images seemed impossible for decades - the space of possible images is vast, and quality requires capturing intricate details while maintaining global coherence. Then came GANs, which learned to generate images through adversarial training. Now diffusion models produce even more stunning results by learning to reverse the process of adding noise. This chapter explores both approaches.

Generative Adversarial Networks

GANs, introduced by Ian Goodfellow in 2014, frame image generation as a game between two networks. The generator creates fake images from random noise. The discriminator tries to distinguish real images from fakes. As each improves, the generator produces increasingly realistic images.

The generator transforms a random vector, typically 100-512 dimensions of Gaussian noise, into an image through a series of upsampling layers. It never sees real images directly - its only feedback is whether it fooled the discriminator. This indirect learning signal drives surprisingly effective generation.

The generator architecture typically uses transposed convolutions (also called deconvolutions) to progressively upsample from a small spatial resolution to the full image size. Starting from a 4×4 feature map, each layer doubles the resolution: 4→8→16→32→64→128→256. Batch normalization and ReLU activations stabilize training.

StyleGAN revolutionized GAN architecture by injecting the latent code at multiple layers rather than just the beginning. A mapping network transforms the initial random vector into an intermediate latent space, and this style vector is injected at each resolution through adaptive instance normalization. This separation of content and style enables remarkable control over generated images.

Timestep: t = 0
Pure noise - x_T
Forward Process (Training)

Gradually add Gaussian noise to data:

x_t = √α_t × x_0 + √(1-α_t) × ε

The model learns to predict the noise ε at each step.

Reverse Process (Sampling)

Iteratively remove predicted noise:

x_(t-1) = denoise(x_t, t)

Start from random noise, end with a clean sample.

Diffusion timeline:
x_0 (clean)x_T (noise)
Diffusion process visualization. Watch noise gradually transform into a coherent image through iterative denoising.

The discriminator's role

The discriminator is a binary classifier that receives an image - either real from the training set or fake from the generator - and outputs a probability that it is real. It learns from labeled examples: real images should produce high probabilities, generated images should produce low probabilities.

As the generator improves, the discriminator must become more sophisticated to detect fakes. This arms race drives both networks toward better performance. The discriminator becomes an expert at detecting the subtle artifacts that distinguish generated images from real ones.

The discriminator typically mirrors the generator architecture in reverse - a series of downsampling convolutions that progressively reduce spatial resolution while increasing channel depth. Strided convolutions are preferred over pooling to maintain spatial information. LeakyReLU activations (slope 0.2 for negative values) are standard.

PatchGAN discriminators classify whether each NxN patch of an image is real or fake, rather than classifying the whole image. This focuses the discriminator on local texture quality and enables efficient training on high-resolution images. The patches overlap, and the discriminator's output is averaged across all patches.

def train_gan_step(generator, discriminator, real_images, noise):
    """Single training step for GAN."""
    # Generate fake images
    fake_images = generator(noise)
    
    # Train discriminator
    real_pred = discriminator(real_images)
    fake_pred = discriminator(fake_images.detach())
    
    d_loss_real = binary_cross_entropy(real_pred, ones)
    d_loss_fake = binary_cross_entropy(fake_pred, zeros)
    d_loss = d_loss_real + d_loss_fake
    
    # Train generator (fool discriminator)
    fake_pred = discriminator(fake_images)
    g_loss = binary_cross_entropy(fake_pred, ones)
    
    return d_loss, g_loss

# Alternative: Wasserstein loss for more stable training
def wasserstein_loss(real_pred, fake_pred):
    """WGAN loss - no log, no saturation."""
    return fake_pred.mean() - real_pred.mean()

The minimax game

GAN training optimizes a minimax objective: the discriminator maximizes its ability to distinguish real from fake, while the generator minimizes the discriminator's success. Mathematically: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))].

At equilibrium, the generator produces images indistinguishable from real data, and the discriminator outputs 0.5 for everything - unable to tell the difference. In practice, this equilibrium is hard to achieve and training often oscillates or collapses.

The GAN objective is a two-player zero-sum game. Game theory tells us such games have a Nash equilibrium, but finding it through gradient descent is not guaranteed. The players alternate updates, and the loss landscape shifts each step. What was a good generator move becomes bad when the discriminator adapts.

In practice, the generator's loss is often modified. Instead of minimizing log(1 - D(G(z))), which has vanishing gradients when the discriminator wins, we maximize log(D(G(z))). This non-saturating loss provides stronger gradients early in training when the generator is poor.

Mode collapse

The most notorious GAN failure mode is mode collapse, where the generator produces only a few distinct outputs instead of diverse samples. The generator finds a few images that reliably fool the discriminator and exploits them, ignoring the rest of the data distribution.

Various techniques combat mode collapse. Minibatch discrimination lets the discriminator see batches of images, detecting when they are too similar. Unrolled GANs give the generator foresight into discriminator updates. Progressive growing starts with low resolution and gradually increases detail.

Spectral normalization constrains the discriminator's Lipschitz constant by normalizing weight matrices by their largest singular value. This prevents the discriminator from becoming too powerful too quickly, maintaining useful gradients for the generator. It's now standard in most GAN architectures.

Wasserstein GANs (WGANs) use the Earth Mover distance instead of Jensen-Shannon divergence, providing smoother gradients that correlate better with image quality. The discriminator becomes a 'critic' that estimates Wasserstein distance rather than classifying real vs fake. Gradient penalty enforces the Lipschitz constraint.

Diffusion models: A different approach

Diffusion models take a radically different approach. Instead of learning to generate from scratch, they learn to reverse a gradual noising process. Start with a real image, add small amounts of Gaussian noise over many steps until it becomes pure noise, then train a network to reverse each step.

The forward diffusion process is fixed - we know exactly how noise is added at each step. The network learns the reverse process: given a noisy image and the current noise level, predict the slightly less noisy version. Chaining many small denoising steps reconstructs clean images.

The forward process is a Markov chain that gradually adds Gaussian noise. At step t, we have: x_t = √(α_t) × x_{t-1} + √(1-α_t) × ε, where ε is standard Gaussian noise and α_t follows a noise schedule. After T steps (typically 1000), x_T is nearly pure Gaussian noise.

A key insight enables efficient training: we can jump directly from x_0 to any x_t without computing intermediate steps. Let ᾱ_t = ∏_{s=1}^{t} α_s (cumulative product). Then x_t = √ᾱ_t × x_0 + √(1-ᾱ_t) × ε. This allows training on arbitrary timesteps without sequential computation.

Denoising Diffusion Probabilistic Models

DDPMs formalize diffusion as a Markov chain. The forward process adds Gaussian noise: q(x_t | x_{t-1}) = N(x_t; √(1-β_t) x_{t-1}, β_t I). After enough steps, the image becomes indistinguishable from random noise. The variance schedule β_t controls how fast noise accumulates.

The reverse process learns to denoise: p_θ(x_{t-1} | x_t). A neural network, typically a U-Net, predicts the noise that was added. Subtracting the predicted noise yields a cleaner image. The loss function is simply mean squared error between predicted and actual noise.

The noise schedule critically affects generation quality. Linear schedules add noise at a constant rate, but this wastes capacity on heavily-noised steps. Cosine schedules slow down noise addition at both ends, spending more steps in the intermediate regime where details emerge. Learned schedules can adapt to the data distribution.

Three equivalent parameterizations exist: predicting the noise ε, predicting the clean image x_0, or predicting the 'velocity' v = √ᾱ_t × ε - √(1-ᾱ_t) × x_0. Noise prediction works best for most noise levels; x_0 prediction is better for low noise. Velocity prediction interpolates between these and often trains fastest.

def diffusion_loss(model, x_0, t, noise=None):
    """Training loss for diffusion model."""
    if noise is None:
        noise = torch.randn_like(x_0)
    
    # Add noise to create x_t
    x_t = sqrt_alpha_cumprod[t] * x_0 + sqrt_one_minus_alpha_cumprod[t] * noise
    
    # Predict the noise
    predicted_noise = model(x_t, t)
    
    # Simple MSE loss
    return F.mse_loss(predicted_noise, noise)

def sample(model, shape, num_steps=1000):
    """Generate image by iterative denoising."""
    x = torch.randn(shape)  # Start from pure noise
    
    for t in reversed(range(num_steps)):
        # Predict noise at this step
        predicted_noise = model(x, t)
        
        # Compute x_{t-1} from x_t and predicted noise
        alpha = alphas[t]
        alpha_cumprod = alpha_cumprods[t]
        beta = betas[t]
        
        if t > 0:
            noise = torch.randn_like(x)
        else:
            noise = 0
        
        x = (1/sqrt(alpha)) * (x - (beta/sqrt(1-alpha_cumprod)) * predicted_noise) 
        x = x + sqrt(beta) * noise
    
    return x

Why diffusion works so well

Diffusion models avoid many GAN pitfalls. Training is stable - we are just doing regression to predict noise. There is no adversarial dynamics, no mode collapse, no careful balancing of competing objectives. The loss directly measures prediction quality.

The iterative generation also provides control. We can stop early for rough outputs, run longer for refined details. We can guide the process toward desired attributes. We can interpolate between images by mixing their noise trajectories. This flexibility enables many applications.

Diffusion models also provide likelihood bounds, unlike GANs which only generate samples. The evidence lower bound (ELBO) gives a principled training objective with theoretical guarantees. While not as tight as autoregressive models, this connection to probability theory enables analysis and comparison.

The coarse-to-fine generation process matches human perception. Early denoising steps establish global structure - where objects are, their rough shapes. Middle steps add intermediate details - textures, lighting, basic features. Final steps refine fine details - sharp edges, subtle gradients. This hierarchy produces coherent results.

Latent diffusion and Stable Diffusion

Raw pixel-space diffusion is computationally expensive - every denoising step processes the full image. Latent diffusion models first encode images into a compressed latent space using a variational autoencoder, then run diffusion in this lower-dimensional space.

Stable Diffusion, released in 2022, made latent diffusion accessible. It compresses images 8x in each dimension before diffusion, then decodes the final latent back to pixels. This reduces computation by 64x while maintaining quality, enabling generation on consumer GPUs.

The VAE encoder compresses 512×512×3 images to 64×64×4 latents - a 48x reduction in size. The encoder uses downsampling convolutions; the decoder uses upsampling. Both are trained to reconstruct images while maintaining a regularized latent space. The KL divergence term encourages latents to follow a standard Gaussian.

Working in latent space has another advantage: the latent representation captures semantic content while discarding imperceptible details. The diffusion model doesn't waste capacity modeling exact pixel values that the decoder will regenerate. Instead, it focuses on meaningful image structure.

Conditioning: Controlling generation

Unconditional generation produces random samples. For practical use, we need control. Text-to-image models condition on text embeddings, typically from a frozen CLIP or T5 encoder. The denoising network receives the embedding and learns to generate images matching the description.

Cross-attention layers enable conditioning. The image features query against text embeddings, allowing text tokens to influence spatial locations. This mechanism lets the model attend to relevant words when generating different image regions.

CLIP embeddings capture semantic similarity between images and text - they were trained on millions of image-caption pairs to align visual and textual representations. Using CLIP embeddings for conditioning transfers this alignment to the diffusion model, enabling natural language control.

T5 embeddings, from a large language model, provide richer text understanding. Longer, more complex prompts benefit from T5's deeper language comprehension. Imagen and newer models use T5 encoders, producing better prompt following than CLIP-based models.

Classifier-free guidance

Classifier-free guidance improves prompt adherence dramatically. During training, we randomly drop the conditioning, teaching the model to generate both conditionally and unconditionally. At inference, we extrapolate away from the unconditional prediction toward the conditional one.

The formula: output = unconditional + scale × (conditional - unconditional). Higher guidance scales produce images that more strongly reflect the prompt, at the cost of diversity. Typical values range from 7 to 15 for text-to-image generation.

Classifier-free guidance can be understood as implicit classifier guidance. The difference (conditional - unconditional) points in the direction of increasing the probability of the condition - exactly what a classifier gradient would provide. But we get this signal without training a separate classifier.

Too-high guidance scales cause oversaturation and artifacts. The model exaggerates features it associates with the prompt, producing unrealistic colors and textures. Negative prompts provide another control axis - specifying what to avoid, with the model moving away from those embeddings.

ControlNet and fine control

ControlNet adds spatial control to pretrained diffusion models. It copies the encoder of a frozen model, trains this copy on paired data showing images with their control signals, and adds the trained copy's outputs back to the frozen decoder. This preserves the base model's knowledge while adding new capabilities.

Control signals include edge maps, depth maps, pose skeletons, and segmentation masks. An artist can sketch rough edges and let the model fill in photorealistic details. A designer can specify precise layouts while the AI handles texture and lighting.

IP-Adapter enables image-based conditioning - provide a reference image and generate variations matching its style or content. The reference image is encoded by CLIP's image encoder, and these embeddings condition generation through additional cross-attention layers. This enables style transfer, character consistency, and visual prompting.

LoRA (Low-Rank Adaptation) enables efficient fine-tuning of diffusion models. Instead of updating all weights, LoRA adds small trainable matrices to attention layers. These adaptations capture new concepts - a specific person's face, an art style, a particular object - with minimal storage (typically 10-200MB vs 4GB for full model).

The U-Net architecture

Most diffusion models use U-Net architectures with skip connections between encoder and decoder. The encoder downsamples through convolutions, capturing global structure. The decoder upsamples, guided by skip-connected features that preserve local details. Attention layers at various resolutions enable global coherence.

Time embedding informs the network of the current noise level. Different noise levels require different behaviors - early steps establish global structure, late steps refine details. The network learns to adapt its predictions based on where we are in the denoising process.

Diffusion Transformers (DiT) replace the U-Net with a pure transformer architecture. Images are patchified like in Vision Transformers, and transformer blocks process the sequence of patches. DiT scales more predictably than U-Nets and has become the backbone of state-of-the-art models like SORA.

The shift to transformers enables training techniques from language models: larger batch sizes, more aggressive learning rate schedules, and better scaling laws. DiT models show smoother loss curves and more predictable compute-quality tradeoffs than U-Net based models.

Sampling algorithms

The original DDPM requires 1000 steps - too slow for practical use. DDIM reformulates the process deterministically, enabling larger steps with fewer total iterations. Euler and Heun methods from numerical ODE solving provide further speedups. DPM-Solver and related methods achieve quality results in 20-50 steps.

The choice of sampler affects both speed and quality. Stochastic samplers add noise at each step for diversity; deterministic samplers always produce the same output from the same starting noise. Different samplers suit different applications.

Consistency models learn to map any point on the diffusion trajectory directly to the final clean image. Once trained, they generate in a single step - or can be used for few-step refinement. This represents a fundamental speedup: instead of iterating through the trajectory, jump directly to the end.

Rectified flows straighten the probability path from noise to data, enabling efficient one-step generation. The model learns to predict 'flow' vectors pointing toward clean images. Straighter flows require fewer integration steps. This approach underlies Stable Diffusion 3's improved speed and quality.

We are entering an era where creating images is limited only by imagination, not skill.

Anonymous researcher, 2023
How Things Work - A Visual Guide to Technology