Regular Expressions Reference

Regex reference — anchors, quantifiers, character classes, groups, lookaheads. ^start, .*, \d+, [a-z], (?:non-capture), (?=lookahead). Every regex pattern explained.

7 min read

What it is

A reference for understanding and constructing regular expressions, the powerful pattern-matching language used in text processing and searching.

Installation

Regular expressions are a language, not a standalone tool. They are integrated into many programming languages and command-line utilities. You don’t "install" regular expressions themselves, but you’ll use them with tools like:

grep:
- Linux/macOS: sudo apt update && sudo apt install grep or brew install grep
- Windows: Available through Git Bash, WSL, or Cygwin.
sed:
- Linux/macOS: sudo apt update && sudo apt install sed or brew install sed
- Windows: Available through Git Bash, WSL, or Cygwin.
Python: python3 your_script.py (built-in re module)
JavaScript: node your_script.js (built-in RegExp object)

Core Concepts

Literals: Characters that match themselves (e.g., a matches "a").
Metacharacters: Characters with special meanings (e.g., ., *, +, ?, ^, $, [, ], {, }, (, ), |, \).
Character Classes: Define a set of characters that can match at a single position.
Quantifiers: Specify how many times a preceding element (character, group, or character class) must occur.
Anchors: Assert a position within the string without consuming characters.
Grouping and Capturing: Parentheses create groups, which can be treated as a unit and their matched content can be "captured" for later use.
Alternation: The | operator allows matching one pattern OR another.
Escaping: The backslash \ is used to escape metacharacters, treating them as literal characters, or to introduce special sequences.

Commands / Usage

This section describes common regular expression patterns and their meanings.

Basic Characters

a : Matches the literal character "a".
1 : Matches the literal character "1".
: Matches a space character.

Metacharacters (Special Characters)

. : Matches any single character except a newline.
- a.b matches "aab", "axb", "a b", but not "a\nb".
\ : Escapes a metacharacter or introduces a special sequence.
- \. : Matches a literal dot character.
- \\ : Matches a literal backslash character.
^ : Matches the beginning of the string or line.
- ^Hello matches "Hello world" but not "Say Hello".
$ : Matches the end of the string or line.
- world$ matches "Hello world" but not "world Hello".
| : Acts as an OR operator. Matches either the expression before or after the pipe.
- cat|dog matches "cat" or "dog".
( ) : Groups expressions. Creates a capturing group.
- (abc)+ matches "abc", "abcabc", "abcabcabc".
[ ] : Defines a character set. Matches any single character within the brackets.
- [aeiou] matches any vowel.
- [0-9] matches any digit.
- [a-zA-Z] matches any uppercase or lowercase letter.
- [^0-9] matches any character that is NOT a digit (negated set).
{ } : Quantifier for specific counts.
- a{3} matches exactly three "a"s ("aaa").
- a{2,4} matches between two and four "a"s ("aa", "aaa", "aaaa").
- a{2,} matches two or more "a"s ("aa", "aaa", "aaaa", …).
- a{,3} matches zero to three "a"s ("", "a", "aa", "aaa").

Quantifiers

* : Matches the preceding element zero or more times.
- a* matches "", "a", "aa", "aaa", …
- .* matches any sequence of characters (including an empty string).
+ : Matches the preceding element one or more times.
- a+ matches "a", "aa", "aaa", … (but not "").
? : Matches the preceding element zero or one time.
- colou?r matches "color" and "colour".

Special Character Sequences (Often start with `\`)

\d : Matches any digit (equivalent to [0-9]).
- \d\d matches two consecutive digits (e.g., "12").
\D : Matches any non-digit character (equivalent to [^0-9]).
\w : Matches any word character (alphanumeric plus underscore: [a-zA-Z0-9_]).
- \w+ matches one or more word characters (e.g., "hello", "user_123").
\W : Matches any non-word character (equivalent to [^a-zA-Z0-9_]).
\s : Matches any whitespace character (space, tab, newline, etc.).
- hello\sworld matches "hello world", "hello\tworld".
\S : Matches any non-whitespace character.
\b : Matches a word boundary. This is a zero-width assertion, meaning it matches a position, not a character. It matches the position between a word character and a non-word character, or at the beginning/end of the string if the first/last character is a word character.
- \bcat\b matches "cat" as a whole word, but not "catalog" or "tomcat".
\B : Matches a non-word boundary.
\A : Matches the absolute beginning of the string (similar to ^ but ^ can match at the start of a line in multiline mode).
\Z : Matches the absolute end of the string (similar to $ but $ can match at the end of a line in multiline mode).

Grouping and Capturing

(expression) : Creates a capturing group. The matched content can be referred to later (backreferences) or extracted.
- (\d{3})-(\d{4}) captures two groups: the first three digits and the following four digits.
(?:expression) : Creates a non-capturing group. Useful for grouping without capturing, which can improve performance and simplify backreferences.
- (?:abc)+ matches "abc", "abcabc", etc., but doesn’t capture each "abc" individually.

Lookarounds (Zero-Width Assertions)

Lookarounds check for patterns ahead or behind the current position without consuming characters.

(?=pattern) : Positive Lookahead. Asserts that pattern matches after the current position.
- \w+(?=\s+world) matches "hello" in "hello world".
(?!pattern) : Negative Lookahead. Asserts that pattern does not match after the current position.
- \w+(?!\s+world) matches "hello" in "hello there".
(?<=pattern) : Positive Lookbehind. Asserts that pattern matches before the current position.
- (?<=hello\s)\w+ matches "world" in "hello world".
(?<!pattern) : Negative Lookbehind. Asserts that pattern does not match before the current position.
- (?<!hello\s)\w+ matches "there" in "hello there".

Flags/Modifiers

These are often applied to the regex as a whole, depending on the tool or language.

i : Case-insensitive matching.
g : Global matching (find all occurrences, not just the first).
m : Multiline mode. ^ matches the start of each line, and $ matches the end of each line.
s : Dotall mode. . matches any character, including newline characters.

Common Patterns

Matching email addresses:
- [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
- Matches common email formats.
Matching URLs:
- (https?:\/\/)?([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6})([\/\w \.-]*)*\/?
- A simplified URL pattern.
Matching dates (YYYY-MM-DD):
- \d{4}-\d{2}-\d{2}
- Matches dates in the specified format.
Matching phone numbers (e.g., XXX-XXX-XXXX):
- \d{3}-\d{3}-\d{4}
- Matches phone numbers in a specific format.
Finding all words in a text:
- \w+
- Use with the global flag (g) to find all sequences of word characters.
Extracting numbers from a string:
- grep -oP '\d+' your_file.txt
- grep -oP outputs only the matched parts. \d+ matches one or more digits.
Replacing multiple spaces with a single space:
- sed 's/ \+/\ /g'
- s/pattern/replacement/g is the sed command for substitution. \+ matches one or more spaces.
Finding lines that do not contain a specific word:
- grep -v 'specific_word'
- The -v flag in grep inverts the match.
Finding lines that do contain a specific word (case-insensitive):
- grep -i 'specific_word'
- The -i flag in grep makes it case-insensitive.
Extracting captured groups:
- In grep -P, you can use \1, \2 etc., to refer to captured groups.
- echo "First: John, Last: Doe" | grep -oP 'First: \K\w+' -> John
- \K discards the match up to that point.

Gotchas

Greediness: By default, quantifiers (*, +, ?, {n,m}) are "greedy," meaning they match as much as possible.
- "<a><b><c>" with the regex <.*> will match the entire string "<a><b><c>".
- To make them "lazy" (match as little as possible), append a ?: <.*?> will match <a>, then <b>, then <c> individually.
Anchors in grep: ^ and $ match the start and end of a line by default, not the entire string, when used with grep in multiline mode (which is the default for grep). Use grep -P with \A and \Z for true start/end of string matching if needed.
Backslashes: Backslashes need to be escaped in many contexts, especially within string literals in programming languages. For example, to match a literal backslash in Python, you might need "\\\\".
Character Classes vs. Alternation: [abc] matches one of 'a', 'b', or 'c'. a|b|c matches 'a' OR 'b' OR 'c'.
Word Boundaries (\b): \b matches a position. It can be tricky around punctuation or the start/end of strings. \bfoo\b will match "foo" in "foo." but not in ".foo".
Lookbehind Limitations: Some regex engines have limitations on lookbehind patterns, such as requiring fixed-width patterns.
Unicode: Standard regex metacharacters might not behave as expected with Unicode characters. Use specific Unicode properties or flags if available in your regex engine.