Pattern matching with regex

Regular expressions decoded

3,075 words16 min read

Regular expressions are one of the most powerful - and most feared - tools in a programmer's arsenal. They look like someone fell asleep on their keyboard: `/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/`. But behind that cryptic syntax lies an elegant mathematical foundation dating back to the 1950s and the birth of theoretical computer science. Understanding regex isn't just about memorizing symbols - it's about grasping a fundamental model of computation that powers everything from text editors to network security systems.

What even is a regular expression?

At its core, a [[regular expression]] (regex) is a pattern that describes a set of strings. Think of it as a highly compressed specification for text matching. When you write the regex `cat`, you're describing the set of all strings containing the sequence 'c', then 'a', then 't'. When you write `ca+t`, you're describing 'cat', 'caat', 'caaat', and so on - the `+` means 'one or more of the preceding element'. This compact notation lets you express complex matching rules in a few characters that would otherwise require dozens of lines of conditional code.

But regex isn't just about convenience - it's mathematically equivalent to a class of abstract machines called [[finite automata]]. This connection, discovered by mathematician Stephen Kleene in 1956, means that any pattern you can express with a regex can be recognized by a simple machine with a finite number of states. This constraint is what makes regex both powerful and limited in specific, well-understood ways. The mathematical foundation gives us guarantees about what regex can and cannot do, which is invaluable when choosing the right tool for a job.

A brief history: from theory to grep

The story of regular expressions begins in the realm of pure mathematics. In 1951, mathematician Stephen Kleene was studying neural networks - not the modern AI kind, but theoretical models of how neurons might compute. He discovered that these simple networks could recognize exactly the same patterns as a notation he called 'regular events.' This notation used three operations: concatenation (ab means 'a followed by b'), alternation (a|b means 'a or b'), and the Kleene star (a* means 'zero or more a's').

For over a decade, regular expressions remained a theoretical curiosity. Then in 1968, Ken Thompson - who would later co-create Unix - published a paper describing how to efficiently implement regex matching using a technique now called Thompson's construction. He built this algorithm into the text editor QED, and later into ed, the standard Unix editor. The famous grep command (Global Regular Expression Print) emerged from this work, putting regex into the hands of every Unix user.

The evolution continued as Henry Spencer wrote a widely-used regex library in C, which was adopted by Perl in 1987. Perl's regex implementation added powerful features like backreferences and lookahead assertions that technically go beyond regular languages. These 'extended' regular expressions became the de facto standard, implemented in PCRE (Perl Compatible Regular Expressions), which powers PHP, Python, JavaScript, and countless other languages. Today, virtually every programming language includes regex support, though implementations vary in features and performance characteristics.

NFA Regex Engine Simulation

Input String:
abcbcd
aεεbεεcdq0q1q2q3q4q5q6
The NFA simultaneously tracks all possible states. Epsilon (ε) transitions happen without consuming input.
Watch how an NFA processes input string character by character

The building blocks: metacharacters

Regular expressions use special characters called [[metacharacters]] to express patterns beyond literal text matching. Understanding these is like learning an alphabet - once you know them, you can read and write any regex. The key insight is that metacharacters have special meaning only in specific contexts, and you can always escape them with a backslash to match the literal character.

SymbolNameMeaningExample
.DotMatch any single character (except newline by default)`c.t` matches 'cat', 'cot', 'c9t'
^CaretStart of string (or line in multiline mode)`^Hello` matches 'Hello world'
$DollarEnd of string (or line in multiline mode)`world$` matches 'Hello world'
*AsteriskZero or more of previous element`ca*t` matches 'ct', 'cat', 'caaat'
+PlusOne or more of previous element`ca+t` matches 'cat', 'caat', not 'ct'
?QuestionZero or one of previous element (optional)`colou?r` matches 'color', 'colour'
|PipeAlternation (logical OR)`cat|dog` matches 'cat' or 'dog'
[ ]BracketsCharacter class (match any single char inside)`[aeiou]` matches any vowel
( )ParenthesesGrouping and capturing`(ab)+` matches 'ab', 'abab', 'ababab'
{ }BracesSpecific quantifier`a{2,4}` matches 'aa', 'aaa', 'aaaa'

Operator precedence: what binds tighter?

Like arithmetic operators, regex operators have precedence rules that determine how patterns are parsed. From highest to lowest: escaped characters and bracket expressions bind tightest, then grouping with parentheses, then quantifiers (*, +, ?, {n,m}), then concatenation (implicit between adjacent elements), and finally alternation (|) binds loosest. This means `ab|cd` is parsed as `(ab)|(cd)`, not `a(b|c)d`. Understanding precedence helps you avoid subtle bugs and unnecessary parentheses.

Character classes: defining sets

[[Character classes]] let you match any character from a defined set. The pattern `[aeiou]` matches any single vowel. You can specify ranges: `[a-z]` matches any lowercase letter, `[0-9]` matches any digit. Combine them: `[a-zA-Z0-9]` matches any alphanumeric character. The range notation uses ASCII/Unicode ordering, so `[A-z]` actually includes some punctuation characters between uppercase and lowercase letters - a common gotcha.

Negation flips the logic: `[^aeiou]` matches any character that is NOT a vowel. The caret has different meanings depending on context - at the start of a character class it means 'not', but outside brackets it means 'start of string'. Inside a character class, most metacharacters lose their special meaning - `[.*+]` matches a literal dot, asterisk, or plus. The exceptions are backslash (for escapes), caret (for negation at start), hyphen (for ranges), and closing bracket.

// Common character class shortcuts
\d  // Any digit [0-9]
\w  // Any word character [a-zA-Z0-9_]
\s  // Any whitespace (space, tab, newline, carriage return, form feed)
\D  // NOT a digit [^0-9]
\W  // NOT a word character [^a-zA-Z0-9_]
\S  // NOT whitespace

// POSIX character classes (in some engines)
[[:alpha:]]  // Alphabetic characters
[[:digit:]]  // Digits (same as \d)
[[:alnum:]]  // Alphanumeric
[[:space:]]  // Whitespace including vertical tab
[[:punct:]]  // Punctuation characters

// Unicode-aware classes (with /u flag)
\p{L}   // Any letter (including accented, Cyrillic, Chinese, etc.)
\p{N}   // Any number
\p{Emoji}  // Emoji characters

// Examples
const phonePattern = /\d{3}-\d{3}-\d{4}/
phonePattern.test("555-123-4567")  // true

const wordPattern = /\w+/g
"Hello, World!".match(wordPattern)  // ["Hello", "World"]

// Unicode-aware matching
const intlWord = /\p{L}+/gu
"Привет мир 你好世界".match(intlWord)  // ["Привет", "мир", "你好世界"]

Quantifiers: greedy vs lazy vs possessive

Quantifiers specify how many times an element should match. By default, they're [[greedy]] - they match as much as possible while still allowing the overall pattern to succeed. This can cause surprising behavior that trips up even experienced developers.

// Greedy matching (default) - matches as MUCH as possible
const html = "<div>Hello</div><div>World</div>"
html.match(/<div>.*<\/div>/)
// Matches: "<div>Hello</div><div>World</div>" (entire string!)
// The .* gobbles up everything, then backtracks just enough

// Lazy/reluctant matching (add ?) - matches as LITTLE as possible
html.match(/<div>.*?<\/div>/)
// Matches: "<div>Hello</div>" (first occurrence only)
// The .*? takes the minimum needed to complete the match

// Possessive matching (add +, in some engines) - no backtracking
// Not supported in JavaScript, but available in Java, PCRE
// /<div>.*+<\/div>/  // Would FAIL because .* takes everything
//                     // and refuses to give any back

// Quantifier comparison
*    // Zero or more (greedy)
*?   // Zero or more (lazy)
*+   // Zero or more (possessive, some engines)

+    // One or more (greedy)
+?   // One or more (lazy)
++   // One or more (possessive)

?    // Zero or one (greedy)
??   // Zero or one (lazy)
?+   // Zero or one (possessive)

{n,m}   // Between n and m (greedy)
{n,m}?  // Between n and m (lazy)
{n,m}+  // Between n and m (possessive)

// Practical example: parsing quoted strings
const greedyQuote = /".*"/
const lazyQuote = /".*?"/
const input = '"first" and "second"'

greedyQuote.exec(input)[0]  // '"first" and "second"' (too much!)
lazyQuote.exec(input)[0]    // '"first"' (just right)

Capture groups and backreferences

Parentheses do double duty in regex: they group elements together AND capture the matched text for later use. [[Capture groups]] are numbered left-to-right starting from 1 (group 0 is the entire match), and you can reference them with backreferences. This enables powerful patterns like finding repeated words or validating matching delimiters.

// Basic capturing groups
const datePattern = /(\d{4})-(\d{2})-(\d{2})/
const match = "2024-03-15".match(datePattern)
// match[0] = "2024-03-15" (full match)
// match[1] = "2024" (first group - year)
// match[2] = "03" (second group - month)
// match[3] = "15" (third group - day)

// Named capture groups (ES2018+)
const namedPattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
const result = "2024-03-15".match(namedPattern)
result.groups.year   // "2024"
result.groups.month  // "03"
result.groups.day    // "15"

// Non-capturing groups (?:...) - group without capturing
const nonCapture = /(?:https?:\/\/)?([^/]+)/
"https://example.com/path".match(nonCapture)
// match[1] = "example.com" (protocol not captured)

// Backreferences - match the same text again
const duplicateWords = /\b(\w+)\s+\1\b/gi
"the the quick brown fox".match(duplicateWords)
// Matches "the the" - \1 references what group 1 captured

// Powerful: matching quotes with same delimiter
const quotedString = /(['"])(.*?)\1/g
`He said "hello" and 'goodbye'`.match(quotedString)
// Matches: [""hello"", "'goodbye'"]

// Named backreference
const xmlTag = /<(?<tag>\w+)>.*?<\/\k<tag>>/
xmlTag.test("<div>content</div>")  // true
xmlTag.test("<div>content</span>") // false - tags don't match

Lookahead and lookbehind assertions

[[Lookarounds]] are zero-width assertions - they check for patterns without consuming characters. They answer the question 'is this pattern present?' without including it in the match. This is incredibly powerful for complex validation and for matching text based on context without including that context in the result.

// LOOKAHEAD - check what comes AFTER current position

// Positive lookahead (?=...) - must be followed by
/\d+(?=px)/.exec("width: 100px margin: 50em")
// Matches "100" (followed by "px", but "px" not included)

// Negative lookahead (?!...) - must NOT be followed by  
/\d+(?!px)/.exec("width: 100px margin: 50em")
// Matches "50" (not followed by "px")

// LOOKBEHIND - check what comes BEFORE current position (ES2018+)

// Positive lookbehind (?<=...) - must be preceded by
/(?<=\$)\d+/.exec("Price: $50 or €40")
// Matches "50" (preceded by "$", but "$" not included)

// Negative lookbehind (?<!...) - must NOT be preceded by
/(?<!\$)\d+/.exec("Price: $50 or 40 items")
// Matches "40" (not preceded by "$")

// PRACTICAL EXAMPLES

// Password validation with multiple lookaheads
const strongPassword = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&]).{8,}$/
// Must contain: lowercase, uppercase, digit, special char, 8+ chars total
// Each lookahead checks for requirement without consuming chars

// Extract numbers NOT in parentheses
const notInParens = /(?<!$$)\b\d+\b(?!$$)/g
"values: 10 (20) 30 (40)".match(notInParens)
// Matches: ["10", "30"]

// Match word boundaries more precisely
const wordStart = /(?<=\s|^)\w+/g  // Words after space or start
const wordEnd = /\w+(?=\s|$)/g    // Words before space or end

// Replace without including context
"100px 50em 25rem".replace(/\d+(?=px)/g, n => n * 2)
// "200px 50em 25rem" - only numbers before "px" doubled

Under the hood: finite automata

Every regex can be converted into a [[finite automaton]] - a state machine with a finite number of states and transitions between them. There are two types: NFAs (Non-deterministic Finite Automata) and DFAs (Deterministic Finite Automata). Understanding these helps you write efficient patterns and avoid performance pitfalls.

An [[NFA]] can be in multiple states simultaneously and can have epsilon transitions (state changes without consuming input). When the regex engine encounters a choice (like in `a|b`), an NFA explores both paths in parallel. A [[DFA]] is always in exactly one state and every input leads to exactly one next state. NFAs are easier to construct from regex, but DFAs are faster to execute - they guarantee linear time matching.

PropertyNFADFA
States at any momentMultiple (explored in parallel)Exactly one
Epsilon transitionsAllowed (move without consuming input)Not allowed
Construction from regexDirect via Thompson's algorithmRequires NFA→DFA conversion
Space complexityO(n) states for regex of length nUp to O(2^n) states (exponential)
Match time complexityO(nm) where m is input lengthO(m) guaranteed linear
Backtracking requiredDepends on implementationNever

Thompson's construction, invented by Ken Thompson in 1968, provides a direct way to convert any regex into an NFA. Each regex operator maps to a specific NFA fragment: literals create two-state automata, alternation creates parallel paths with epsilon transitions, and Kleene star creates loops. These fragments compose together to form the complete NFA.

Engine implementations: Thompson vs backtracking

Modern regex engines fall into two camps. Thompson-style engines (used by grep, RE2, Rust's regex crate) simulate the NFA directly, tracking all active states. This guarantees linear time complexity but limits features - no backreferences or lookaround. Backtracking engines (used by Perl, PCRE, JavaScript, Python, Java) explore paths one at a time, backtracking on failure. This enables powerful features but risks exponential time on pathological patterns.

ENGINE COMPARISON

Thompson/NFA-based (grep, RE2, ripgrep, Rust regex):
├── Time: O(nm) guaranteed
├── Features: Core regex only, no backreferences
├── Memory: O(n) for regex states
└── Best for: Log searching, security-critical applications

Backtracking/PCRE-style (Perl, JavaScript, Python, Java):
├── Time: O(nm) typical, O(2^n) worst case  
├── Features: Full PCRE including backreferences, lookaround
├── Memory: O(m) for backtrack stack
└── Best for: Complex text processing, when you need advanced features

Hybrid engines (newer):
├── RE2: Falls back to backtracking for rare features
├── Hyperscan: Uses multiple algorithms, selects best
└── .NET: JIT compilation for hot patterns

Catastrophic backtracking: the performance trap

Some regex engines use [[backtracking]] to handle NFAs - when there are multiple possible paths, they try one and backtrack if it fails. This can lead to catastrophic exponential time complexity with certain patterns and inputs, potentially bringing down servers or enabling denial-of-service attacks.

// DANGEROUS patterns that cause exponential backtracking

// Pattern 1: Nested quantifiers
const bad1 = /^(a+)+$/
bad1.test("aaaaaaaaaaaaaaaaaaaaaaaaaaaaab")
// This can take MINUTES because the engine tries every
// possible way to divide the a's among the groups

// Why? For input "aaaa", the engine tries:
// (a)(a)(a)(a), (aa)(a)(a), (a)(aa)(a), (aaa)(a),
// (a)(a)(aa), (aa)(aa), (a)(aaa), (aaaa)
// With n a's, there are 2^(n-1) ways to partition them!

// Pattern 2: Overlapping alternatives
const bad2 = /^(a|a)+$/
// Same problem - each 'a' can match via either alternative

// Pattern 3: Optional overlapping
const bad3 = /^(.*a){20}$/
// Must find 20 'a's with any chars between - exponential paths

// Pattern 4: Real-world email regex gone wrong
const badEmail = /^([a-zA-Z0-9]+)*@/
// Nested quantifier on character class

// SAFE alternatives
const good1 = /^a+$/           // Simple, no nesting
const good2 = /^(?:a)+$/       // Non-capturing, but still just one path
const goodEmail = /^[a-zA-Z0-9]+@/  // No nested quantifier

// DETECTION: Tools and techniques
// - regex101.com shows step count
// - Time your regex with increasing input sizes
// - Look for: (X+)+, (X+)*, (X*)+, (X*X*)*, (X|X)+

// MITIGATION strategies:
// 1. Use atomic groups (?>...) when available (not in JS)
// 2. Use possessive quantifiers *+ ++ (not in JS)
// 3. Restructure to eliminate ambiguity
// 4. Add timeouts in production code
// 5. Use RE2 or other safe engines for untrusted patterns

Real-world patterns decoded

Let's decode some common regex patterns you'll encounter in the wild. Understanding these will help you both read existing code and write your own patterns. Each pattern represents accumulated wisdom about handling real-world data.

// EMAIL (simplified - RFC 5322 compliant regex is 6,000+ chars!)
/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/
// ^                    Start of string
// [a-zA-Z0-9._%+-]+    One or more allowed chars in local part
// @                    Literal @ symbol
// [a-zA-Z0-9.-]+       Domain name (letters, digits, dots, hyphens)
// \.                  Literal dot (escaped)
// [a-zA-Z]{2,}         TLD: 2+ letters (com, org, museum, etc.)
// $                    End of string

// URL (reasonably comprehensive)
/^(https?:\/\/)?([\da-z.-]+)\.([a-z.]{2,6})([\/\w .-]*)*\/?$/
// (https?:\/\/)?     Optional protocol (http:// or https://)
// ([\da-z.-]+)        Subdomain and domain
// \.([a-z.]{2,6})     TLD (including compound like .co.uk)
// ([\/\w .-]*)*      Path segments
// \/?$                Optional trailing slash, end of string

// IPv4 ADDRESS (with proper range validation)
/^(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$/
// Each octet matches: 250-255, 200-249, 100-199, 10-99, 0-9
// Repeated 3 times with dots, then once more without

// PHONE NUMBER (flexible US format)
/^\+?1?[-. ]?$$?\d{3}$$?[-. ]?\d{3}[-. ]?\d{4}$/
// \+?1?              Optional +1 country code
// [-. ]?             Optional separator (dash, dot, space)
// $$?\d{3}$$?      Area code, optionally in parens
// [-. ]?\d{3}       Exchange
// [-. ]?\d{4}       Subscriber number

// DATE (ISO 8601 format with validation)
/^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$/
// Year: any 4 digits
// Month: 01-12
// Day: 01-31 (doesn't validate month-specific limits)

// HTML TAG (simple, for learning - don't parse HTML with regex!)
/<([a-z]+)([^>]*)>(.*?)<\/\1>/gi
// <([a-z]+)           Opening tag name (captured)
// ([^>]*)             Attributes (anything except >)
// >                   End of opening tag
// (.*?)               Content (non-greedy)
// <\/\1>            Closing tag matching opening (backreference)

Beyond regular: what regex can't do

Regular expressions, despite their power, have fundamental limitations rooted in their mathematical foundation. They can only recognize [[regular languages]] - a specific class in the Chomsky hierarchy of formal languages. Understanding these limits helps you choose the right tool for each job.

The classic example: regex cannot match arbitrarily nested balanced parentheses like ((())). To understand why, consider that matching nested structures requires counting - you need to remember how many opening parens you've seen to know how many closing ones to expect. But finite automata have... finite memory. They can't count arbitrarily high. No matter how many states you add, there's always a deeper nesting level that will break the pattern.

CHOMSKY HIERARCHY OF LANGUAGES

Type 0: Recursively Enumerable (Turing machines)
        ↑
Type 1: Context-Sensitive (linear bounded automata)
        ↑
Type 2: Context-Free (pushdown automata) ← Balanced parens live here
        ↑
Type 3: Regular (finite automata) ← Standard regex

Each level is strictly more powerful than the one below.
Regular expressions can ONLY recognize Type 3 languages.

EXAMPLES OF WHAT REGEX CAN'T DO:
- Match balanced parentheses: ((())) 
- Match palindromes: abcba
- Match a^n b^n (equal numbers): aaabbb
- Parse HTML/XML properly
- Match valid JSON
- Count occurrences beyond a fixed limit

EXCEPTIONS - "regex" extensions that exceed regular:
- Backreferences: \1 makes (a+)\1 match aaaa but not aaa
- Recursive patterns: (?R) in PCRE can match balanced parens
- These are NOT regular expressions in the mathematical sense!

Regex across languages and engines

FeatureJavaScriptPythonPCRE (PHP/Perl)RE2 (Go/Rust)
Named groups`(?<name>...)``(?P<name>...)``(?P<name>...)` or `(?<name>...)``(?P<name>...)`
LookbehindES2018+ (fixed length)Variable lengthVariable lengthNot supported
Atomic groupsNoNo (use regex module)Yes `(?>...)`Automatic optimization
Possessive quantifiersNoNoYes `*+ ++ ?+`N/A (no backtracking)
Unicode categoriesYes with `/u`YesYesYes
RecursionNoregex module onlyYes `(?R)`No
ConditionalNoYes `(?(id)yes|no)`YesNo
Backtracking limitNoNo built-inpcre.backtrack_limitN/A (linear time)

Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.

Jamie Zawinski

Despite the joke, regex remains indispensable for text processing. The key is knowing when to use it and when to reach for a proper parser. For validating formats, searching logs, find-and-replace operations, and text transformation, nothing beats a well-crafted regex. For parsing HTML, XML, JSON, or programming languages - use a real parser. The boundary is roughly: if the structure can be arbitrarily nested, you need more than regex.

How Things Work - A Visual Guide to Technology