What it is
A reference for understanding and constructing regular expressions, the powerful pattern-matching language used in text processing and searching.
Installation
Regular expressions are a language, not a standalone tool. They are integrated into many programming languages and command-line utilities. You don’t "install" regular expressions themselves, but you’ll use them with tools like:
- grep:
- Linux/macOS:
sudo apt update && sudo apt install greporbrew install grep - Windows: Available through Git Bash, WSL, or Cygwin.
- Linux/macOS:
- sed:
- Linux/macOS:
sudo apt update && sudo apt install sedorbrew install sed - Windows: Available through Git Bash, WSL, or Cygwin.
- Linux/macOS:
- Python:
python3 your_script.py(built-inremodule) - JavaScript:
node your_script.js(built-inRegExpobject)
Core Concepts
- Literals: Characters that match themselves (e.g.,
amatches "a"). - Metacharacters: Characters with special meanings (e.g.,
.,*,+,?,^,$,[,],{,},(,),|,\). - Character Classes: Define a set of characters that can match at a single position.
- Quantifiers: Specify how many times a preceding element (character, group, or character class) must occur.
- Anchors: Assert a position within the string without consuming characters.
- Grouping and Capturing: Parentheses create groups, which can be treated as a unit and their matched content can be "captured" for later use.
- Alternation: The
|operator allows matching one pattern OR another. - Escaping: The backslash
\is used to escape metacharacters, treating them as literal characters, or to introduce special sequences.
Commands / Usage
This section describes common regular expression patterns and their meanings.
Basic Characters
a: Matches the literal character "a".1: Matches the literal character "1".: Matches a space character.
Metacharacters (Special Characters)
.: Matches any single character except a newline.a.bmatches "aab", "axb", "a b", but not "a\nb".
\: Escapes a metacharacter or introduces a special sequence.\.: Matches a literal dot character.\\: Matches a literal backslash character.
^: Matches the beginning of the string or line.^Hellomatches "Hello world" but not "Say Hello".
$: Matches the end of the string or line.world$matches "Hello world" but not "world Hello".
|: Acts as an OR operator. Matches either the expression before or after the pipe.cat|dogmatches "cat" or "dog".
(): Groups expressions. Creates a capturing group.(abc)+matches "abc", "abcabc", "abcabcabc".
[]: Defines a character set. Matches any single character within the brackets.[aeiou]matches any vowel.[0-9]matches any digit.[a-zA-Z]matches any uppercase or lowercase letter.[^0-9]matches any character that is NOT a digit (negated set).
{}: Quantifier for specific counts.a{3}matches exactly three "a"s ("aaa").a{2,4}matches between two and four "a"s ("aa", "aaa", "aaaa").a{2,}matches two or more "a"s ("aa", "aaa", "aaaa", …).a{,3}matches zero to three "a"s ("", "a", "aa", "aaa").
Quantifiers
*: Matches the preceding element zero or more times.a*matches "", "a", "aa", "aaa", ….*matches any sequence of characters (including an empty string).
+: Matches the preceding element one or more times.a+matches "a", "aa", "aaa", … (but not "").
?: Matches the preceding element zero or one time.colou?rmatches "color" and "colour".
Special Character Sequences (Often start with \)
\d: Matches any digit (equivalent to[0-9]).\d\dmatches two consecutive digits (e.g., "12").
\D: Matches any non-digit character (equivalent to[^0-9]).\w: Matches any word character (alphanumeric plus underscore:[a-zA-Z0-9_]).\w+matches one or more word characters (e.g., "hello", "user_123").
\W: Matches any non-word character (equivalent to[^a-zA-Z0-9_]).\s: Matches any whitespace character (space, tab, newline, etc.).hello\sworldmatches "hello world", "hello\tworld".
\S: Matches any non-whitespace character.\b: Matches a word boundary. This is a zero-width assertion, meaning it matches a position, not a character. It matches the position between a word character and a non-word character, or at the beginning/end of the string if the first/last character is a word character.\bcat\bmatches "cat" as a whole word, but not "catalog" or "tomcat".
\B: Matches a non-word boundary.\A: Matches the absolute beginning of the string (similar to^but^can match at the start of a line in multiline mode).\Z: Matches the absolute end of the string (similar to$but$can match at the end of a line in multiline mode).
Grouping and Capturing
(expression): Creates a capturing group. The matched content can be referred to later (backreferences) or extracted.(\d{3})-(\d{4})captures two groups: the first three digits and the following four digits.
(?:expression): Creates a non-capturing group. Useful for grouping without capturing, which can improve performance and simplify backreferences.(?:abc)+matches "abc", "abcabc", etc., but doesn’t capture each "abc" individually.
Lookarounds (Zero-Width Assertions)
Lookarounds check for patterns ahead or behind the current position without consuming characters.
(?=pattern): Positive Lookahead. Asserts thatpatternmatches after the current position.\w+(?=\s+world)matches "hello" in "hello world".
(?!pattern): Negative Lookahead. Asserts thatpatterndoes not match after the current position.\w+(?!\s+world)matches "hello" in "hello there".
(?<=pattern): Positive Lookbehind. Asserts thatpatternmatches before the current position.(?<=hello\s)\w+matches "world" in "hello world".
(?<!pattern): Negative Lookbehind. Asserts thatpatterndoes not match before the current position.(?<!hello\s)\w+matches "there" in "hello there".
Flags/Modifiers
These are often applied to the regex as a whole, depending on the tool or language.
i: Case-insensitive matching.g: Global matching (find all occurrences, not just the first).m: Multiline mode.^matches the start of each line, and$matches the end of each line.s: Dotall mode..matches any character, including newline characters.
Common Patterns
- Matching email addresses:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}- Matches common email formats.
- Matching URLs:
(https?:\/\/)?([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6})([\/\w \.-]*)*\/?- A simplified URL pattern.
- Matching dates (YYYY-MM-DD):
\d{4}-\d{2}-\d{2}- Matches dates in the specified format.
- Matching phone numbers (e.g., XXX-XXX-XXXX):
\d{3}-\d{3}-\d{4}- Matches phone numbers in a specific format.
- Finding all words in a text:
\w+- Use with the global flag (
g) to find all sequences of word characters.
- Extracting numbers from a string:
grep -oP '\d+' your_file.txtgrep -oPoutputs only the matched parts.\d+matches one or more digits.
- Replacing multiple spaces with a single space:
sed 's/ \+/\ /g's/pattern/replacement/gis thesedcommand for substitution.\+matches one or more spaces.
- Finding lines that do not contain a specific word:
grep -v 'specific_word'- The
-vflag ingrepinverts the match.
- Finding lines that do contain a specific word (case-insensitive):
grep -i 'specific_word'- The
-iflag ingrepmakes it case-insensitive.
- Extracting captured groups:
- In
grep -P, you can use\1,\2etc., to refer to captured groups. echo "First: John, Last: Doe" | grep -oP 'First: \K\w+'->John\Kdiscards the match up to that point.
- In
Gotchas
- Greediness: By default, quantifiers (
*,+,?,{n,m}) are "greedy," meaning they match as much as possible."<a><b><c>"with the regex<.*>will match the entire string"<a><b><c>".- To make them "lazy" (match as little as possible), append a
?:<.*?>will match<a>, then<b>, then<c>individually.
- Anchors in
grep:^and$match the start and end of a line by default, not the entire string, when used withgrepin multiline mode (which is the default forgrep). Usegrep -Pwith\Aand\Zfor true start/end of string matching if needed. - Backslashes: Backslashes need to be escaped in many contexts, especially within string literals in programming languages. For example, to match a literal backslash in Python, you might need
"\\\\". - Character Classes vs. Alternation:
[abc]matches one of 'a', 'b', or 'c'.a|b|cmatches 'a' OR 'b' OR 'c'. - Word Boundaries (
\b):\bmatches a position. It can be tricky around punctuation or the start/end of strings.\bfoo\bwill match "foo" in "foo." but not in ".foo". - Lookbehind Limitations: Some regex engines have limitations on lookbehind patterns, such as requiring fixed-width patterns.
- Unicode: Standard regex metacharacters might not behave as expected with Unicode characters. Use specific Unicode properties or flags if available in your regex engine.