Regular Expressions Reference

Regex reference — anchors, quantifiers, character classes, groups, lookaheads. ^start, .*, \d+, [a-z], (?:non-capture), (?=lookahead). Every regex pattern explained.

7 min read

What it is

A reference for understanding and constructing regular expressions, the powerful pattern-matching language used in text processing and searching.

Installation

Regular expressions are a language, not a standalone tool. They are integrated into many programming languages and command-line utilities. You don’t "install" regular expressions themselves, but you’ll use them with tools like:

  • grep:
    • Linux/macOS: sudo apt update && sudo apt install grep or brew install grep
    • Windows: Available through Git Bash, WSL, or Cygwin.
  • sed:
    • Linux/macOS: sudo apt update && sudo apt install sed or brew install sed
    • Windows: Available through Git Bash, WSL, or Cygwin.
  • Python: python3 your_script.py (built-in re module)
  • JavaScript: node your_script.js (built-in RegExp object)

Core Concepts

  • Literals: Characters that match themselves (e.g., a matches "a").
  • Metacharacters: Characters with special meanings (e.g., ., *, +, ?, ^, $, [, ], {, }, (, ), |, \).
  • Character Classes: Define a set of characters that can match at a single position.
  • Quantifiers: Specify how many times a preceding element (character, group, or character class) must occur.
  • Anchors: Assert a position within the string without consuming characters.
  • Grouping and Capturing: Parentheses create groups, which can be treated as a unit and their matched content can be "captured" for later use.
  • Alternation: The | operator allows matching one pattern OR another.
  • Escaping: The backslash \ is used to escape metacharacters, treating them as literal characters, or to introduce special sequences.

Commands / Usage

This section describes common regular expression patterns and their meanings.

Basic Characters

  • a : Matches the literal character "a".
  • 1 : Matches the literal character "1".
  • : Matches a space character.

Metacharacters (Special Characters)

  • . : Matches any single character except a newline.
    • a.b matches "aab", "axb", "a b", but not "a\nb".
  • \ : Escapes a metacharacter or introduces a special sequence.
    • \. : Matches a literal dot character.
    • \\ : Matches a literal backslash character.
  • ^ : Matches the beginning of the string or line.
    • ^Hello matches "Hello world" but not "Say Hello".
  • $ : Matches the end of the string or line.
    • world$ matches "Hello world" but not "world Hello".
  • | : Acts as an OR operator. Matches either the expression before or after the pipe.
    • cat|dog matches "cat" or "dog".
  • ( ) : Groups expressions. Creates a capturing group.
    • (abc)+ matches "abc", "abcabc", "abcabcabc".
  • [ ] : Defines a character set. Matches any single character within the brackets.
    • [aeiou] matches any vowel.
    • [0-9] matches any digit.
    • [a-zA-Z] matches any uppercase or lowercase letter.
    • [^0-9] matches any character that is NOT a digit (negated set).
  • { } : Quantifier for specific counts.
    • a{3} matches exactly three "a"s ("aaa").
    • a{2,4} matches between two and four "a"s ("aa", "aaa", "aaaa").
    • a{2,} matches two or more "a"s ("aa", "aaa", "aaaa", …).
    • a{,3} matches zero to three "a"s ("", "a", "aa", "aaa").

Quantifiers

  • * : Matches the preceding element zero or more times.
    • a* matches "", "a", "aa", "aaa", …
    • .* matches any sequence of characters (including an empty string).
  • + : Matches the preceding element one or more times.
    • a+ matches "a", "aa", "aaa", … (but not "").
  • ? : Matches the preceding element zero or one time.
    • colou?r matches "color" and "colour".

Special Character Sequences (Often start with \)

  • \d : Matches any digit (equivalent to [0-9]).
    • \d\d matches two consecutive digits (e.g., "12").
  • \D : Matches any non-digit character (equivalent to [^0-9]).
  • \w : Matches any word character (alphanumeric plus underscore: [a-zA-Z0-9_]).
    • \w+ matches one or more word characters (e.g., "hello", "user_123").
  • \W : Matches any non-word character (equivalent to [^a-zA-Z0-9_]).
  • \s : Matches any whitespace character (space, tab, newline, etc.).
    • hello\sworld matches "hello world", "hello\tworld".
  • \S : Matches any non-whitespace character.
  • \b : Matches a word boundary. This is a zero-width assertion, meaning it matches a position, not a character. It matches the position between a word character and a non-word character, or at the beginning/end of the string if the first/last character is a word character.
    • \bcat\b matches "cat" as a whole word, but not "catalog" or "tomcat".
  • \B : Matches a non-word boundary.
  • \A : Matches the absolute beginning of the string (similar to ^ but ^ can match at the start of a line in multiline mode).
  • \Z : Matches the absolute end of the string (similar to $ but $ can match at the end of a line in multiline mode).

Grouping and Capturing

  • (expression) : Creates a capturing group. The matched content can be referred to later (backreferences) or extracted.
    • (\d{3})-(\d{4}) captures two groups: the first three digits and the following four digits.
  • (?:expression) : Creates a non-capturing group. Useful for grouping without capturing, which can improve performance and simplify backreferences.
    • (?:abc)+ matches "abc", "abcabc", etc., but doesn’t capture each "abc" individually.

Lookarounds (Zero-Width Assertions)

Lookarounds check for patterns ahead or behind the current position without consuming characters.

  • (?=pattern) : Positive Lookahead. Asserts that pattern matches after the current position.
    • \w+(?=\s+world) matches "hello" in "hello world".
  • (?!pattern) : Negative Lookahead. Asserts that pattern does not match after the current position.
    • \w+(?!\s+world) matches "hello" in "hello there".
  • (?<=pattern) : Positive Lookbehind. Asserts that pattern matches before the current position.
    • (?<=hello\s)\w+ matches "world" in "hello world".
  • (?<!pattern) : Negative Lookbehind. Asserts that pattern does not match before the current position.
    • (?<!hello\s)\w+ matches "there" in "hello there".

Flags/Modifiers

These are often applied to the regex as a whole, depending on the tool or language.

  • i : Case-insensitive matching.
  • g : Global matching (find all occurrences, not just the first).
  • m : Multiline mode. ^ matches the start of each line, and $ matches the end of each line.
  • s : Dotall mode. . matches any character, including newline characters.

Common Patterns

  • Matching email addresses:
    • [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
    • Matches common email formats.
  • Matching URLs:
    • (https?:\/\/)?([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6})([\/\w \.-]*)*\/?
    • A simplified URL pattern.
  • Matching dates (YYYY-MM-DD):
    • \d{4}-\d{2}-\d{2}
    • Matches dates in the specified format.
  • Matching phone numbers (e.g., XXX-XXX-XXXX):
    • \d{3}-\d{3}-\d{4}
    • Matches phone numbers in a specific format.
  • Finding all words in a text:
    • \w+
    • Use with the global flag (g) to find all sequences of word characters.
  • Extracting numbers from a string:
    • grep -oP '\d+' your_file.txt
    • grep -oP outputs only the matched parts. \d+ matches one or more digits.
  • Replacing multiple spaces with a single space:
    • sed 's/ \+/\ /g'
    • s/pattern/replacement/g is the sed command for substitution. \+ matches one or more spaces.
  • Finding lines that do not contain a specific word:
    • grep -v 'specific_word'
    • The -v flag in grep inverts the match.
  • Finding lines that do contain a specific word (case-insensitive):
    • grep -i 'specific_word'
    • The -i flag in grep makes it case-insensitive.
  • Extracting captured groups:
    • In grep -P, you can use \1, \2 etc., to refer to captured groups.
    • echo "First: John, Last: Doe" | grep -oP 'First: \K\w+' -> John
    • \K discards the match up to that point.

Gotchas

  • Greediness: By default, quantifiers (*, +, ?, {n,m}) are "greedy," meaning they match as much as possible.
    • "<a><b><c>" with the regex <.*> will match the entire string "<a><b><c>".
    • To make them "lazy" (match as little as possible), append a ?: <.*?> will match <a>, then <b>, then <c> individually.
  • Anchors in grep: ^ and $ match the start and end of a line by default, not the entire string, when used with grep in multiline mode (which is the default for grep). Use grep -P with \A and \Z for true start/end of string matching if needed.
  • Backslashes: Backslashes need to be escaped in many contexts, especially within string literals in programming languages. For example, to match a literal backslash in Python, you might need "\\\\".
  • Character Classes vs. Alternation: [abc] matches one of 'a', 'b', or 'c'. a|b|c matches 'a' OR 'b' OR 'c'.
  • Word Boundaries (\b): \b matches a position. It can be tricky around punctuation or the start/end of strings. \bfoo\b will match "foo" in "foo." but not in ".foo".
  • Lookbehind Limitations: Some regex engines have limitations on lookbehind patterns, such as requiring fixed-width patterns.
  • Unicode: Standard regex metacharacters might not behave as expected with Unicode characters. Use specific Unicode properties or flags if available in your regex engine.