awk Text Processing

awk cheatsheet — print columns, filter rows, sum values, use regex. awk '{print $2}', awk -F: '{print $1}', awk '/pattern/ {sum+=$3}'. Real examples.

10 min read

What it is

awk is a powerful text-processing utility for pattern scanning and processing, often used for data extraction, transformation, and reporting from structured text files.

Installation

Linux: awk is typically pre-installed on most Linux distributions. If not, you can install it using your package manager:

sudo apt update && sudo apt install gawk # Debian/Ubuntu
sudo yum install gawk # CentOS/RHEL
sudo dnf install gawk # Fedora

macOS: awk is pre-installed on macOS. You can also install gawk (GNU awk) for extended features:

brew install gawk

Windows: awk is not natively available on Windows. You can install it via:

Git Bash: Comes with awk.
Cygwin: Provides a Linux-like environment including awk.
WSL (Windows Subsystem for Linux): Install a Linux distribution (e.g., Ubuntu) and use its awk.

Core Concepts

Records (Lines): awk processes input line by line by default. Each line is considered a record. The special variable NR keeps track of the current record number.
Fields: Within each record, awk splits the line into fields based on a field separator. By default, whitespace (spaces and tabs) is the separator. Fields are accessed using $1, $2, $3, etc., where $0 represents the entire record.
Patterns: awk programs consist of pattern { action } pairs. A pattern can be a regular expression, a comparison, a range, or a special pattern like BEGIN or END. If a pattern matches the current record, the associated action is executed.
Actions: An action is a block of code enclosed in curly braces {} that awk executes when a pattern matches. Actions can include printing fields, performing calculations, manipulating strings, and controlling program flow.
Field Separator (FS): The character or pattern used to split records into fields. Defaults to whitespace.
Record Separator (RS): The character or pattern used to split input into records. Defaults to a newline character.
Output Field Separator (OFS): The character used to join fields when printing multiple items. Defaults to a space.
Output Record Separator (ORS): The character appended after each printed record. Defaults to a newline.

Commands / Usage

awk commands are typically structured as awk 'program' filename(s). The program is enclosed in single quotes.

Basic Data Extraction

Print the entire line:
```
awk '{ print }' data.txt
```
Prints every line from data.txt.
Print a specific field:
```
awk '{ print $1 }' data.txt
```
Prints the first field of each line from data.txt.
Print multiple fields:
```
awk '{ print $1, $3 }' data.txt
```
Prints the first and third fields of each line from data.txt, separated by the OFS (space by default).
Print fields with custom formatting:
```
awk '{ printf "Name: %s, Age: %s\n", $1, $2 }' users.txt
```
Prints the first field as "Name:" and the second as "Age:" for each line in users.txt.

Pattern Matching

Print lines containing a specific string:
```
awk '/error/ { print }' log.txt
```
Prints all lines from log.txt that contain the substring "error".
Print lines matching a regular expression:
```
awk '/^[0-9]{3}-[0-9]{2}-[0-9]{4}/ { print }' data.txt
```
Prints lines from data.txt that start with a pattern resembling an SSN (e.g., 123-45-6789).
Print lines NOT matching a pattern:
```
awk '!/debug/ { print }' log.txt
```
Prints all lines from log.txt that do not contain the substring "debug".
Print lines based on field content:
```
awk '$3 > 100 { print }' data.txt
```
Prints lines where the third field’s value is greater than 100.
Print lines based on string comparison:
```
awk '$1 == "apple" { print }' fruits.txt
```
Prints lines where the first field is exactly "apple".
Print lines based on multiple conditions (AND):
```
awk '$1 == "banana" && $2 < 50 { print }' data.txt
```
Prints lines where the first field is "banana" AND the second field is less than 50.
Print lines based on multiple conditions (OR):
```
awk '$1 == "orange" || $1 == "grape" { print }' fruits.txt
```
Prints lines where the first field is "orange" OR "grape".
Print lines within a range of record numbers:
```
awk 'NR >= 10 && NR <= 20 { print }' data.txt
```
Prints lines from record number 10 to 20 (inclusive).
Print lines between two patterns:
```
awk '/START_SECTION/, /END_SECTION/ { print }' config.txt
```
Prints lines from the one that matches /START_SECTION/ up to and including the one that matches /END_SECTION/.

Special Patterns: BEGIN and END

Execute before processing any lines:
```
awk 'BEGIN { print "Starting processing..." } { print }' data.txt
```
Prints "Starting processing…" before reading any lines from data.txt.
Execute after processing all lines:
```
awk '{ sum += $1 } END { print "Total sum:", sum }' numbers.txt
```
Calculates the sum of the first field of all lines in numbers.txt and prints the total at the end.
Initialize variables in BEGIN:
```
awk 'BEGIN { FS=":"; count=0 } { count++ } END { print "Total records:", count }' /etc/passwd
```
Sets the field separator to ":" and initializes a counter to 0, then counts records and prints the total.

Field and Record Separators

Set input field separator (e.g., comma):
```
awk -F',' '{ print $1, $2 }' data.csv
```
Processes data.csv using commas as field separators.
Set input field separator to multiple characters:
```
awk -F'[ :]' '{ print $1, $2 }' data.txt
```
Uses either a space or a colon as the field separator.
Set input record separator (e.g., blank line):
```
awk -v RS='\n\n' '{ print $1 }' data.txt
```
Treats paragraphs separated by blank lines as records.
Set output field separator:
```
awk 'BEGIN { OFS=" | " } { print $1, $2, $3 }' data.txt
```
Prints the first three fields separated by " | ".

Variables and Arithmetic

Counting lines:
```
awk 'END { print NR }' data.txt
```
Prints the total number of records (lines) processed.
Summing values:
```
awk '{ total += $2 } END { print total }' prices.txt
```
Calculates the sum of the second column in prices.txt.

Calculating averages:

awk '{ sum += $1; count++ } END { if (count > 0) print sum / count }' numbers.txt

Calculates the average of the first column.

Counting occurrences of a pattern:
```
awk '/warning/ { count++ } END { print count }' log.txt
```
Counts how many lines contain the word "warning".
Counting unique values in a field:
```
awk '{ seen[$1]++ } END { for (key in seen) print key, seen[key] }' data.txt
```
Counts the occurrences of each unique value in the first field.

String Manipulation

Concatenating strings:
```
awk '{ print $1 "-" $2 }' data.txt
```
Prints the first and second fields joined by a hyphen.
Getting string length:
```
awk '{ print length($1) }' data.txt
```
Prints the length of the first field.
Substrings:
```
awk '{ print substr($1, 1, 3) }' data.txt
```
Prints the first 3 characters of the first field.
Finding string position:
```
awk '/example/ { print index($0, "example") }' data.txt
```
Prints the starting position of "example" within lines that contain it.
Replacing strings:
```
awk '{ gsub(/old/, "new"); print }' data.txt
```
Globally replaces all occurrences of "old" with "new" in each line and prints the modified line.
Replacing the first occurrence:
```
awk '{ sub(/old/, "new"); print }' data.txt
```
Replaces only the first occurrence of "old" with "new" in each line.

Control Flow

If-else statements:
```
awk '{ if ($1 > 10) print $1, "High"; else print $1, "Low" }' numbers.txt
```
Prints "High" or "Low" based on the value of the first field.

Loops (for):

awk 'BEGIN { for (i=1; i<=5; i++) print i }'

Prints numbers 1 through 5.

Loops (while):
```
awk '{ i = 1; while (i <= NF) { print "Field " i ":", $i; i++ } }' data.txt
```
Iterates through all fields (NF is Number of Fields) of each line.
Next statement (skip to next record):
```
awk '/skip_this/ { next } { print }' data.txt
```
Skips processing for lines containing "skip_this" and proceeds to the next record.
Exit statement:
```
awk '/END_MARKER/ { print "Found marker, exiting."; exit } { print }' data.txt
```
Prints lines until "END_MARKER" is found, then prints a message and exits.

Arrays

Associative arrays (key-value pairs):

awk '{ counts[$1]++ } END { for (word in counts) print word, counts[word] }' words.txt

Counts the frequency of each word in words.txt.

Using array elements in patterns:
```
awk '/apple|banana/ { fruits[$0]++ } END { for (f in fruits) print f }' data.txt
```
Prints lines containing "apple" or "banana" (effectively de-duplicating if they appear multiple times).

Built-in Variables

NR: Number of the current record (line) processed.
FNR: Number of the current record within the current file (useful when processing multiple files).
NF: Number of fields in the current record.
FS: Input Field Separator (default is whitespace).
OFS: Output Field Separator (default is space).
RS: Input Record Separator (default is newline).
ORS: Output Record Separator (default is newline).
FILENAME: Name of the current input file.
ARGC: Argument count (number of command-line arguments).
ARGV: Argument vector (the actual command-line arguments).

Common Patterns

Extracting specific columns from a CSV file:
```
awk -F',' '{ print $1, $3 }' users.csv
```
Finding lines with values greater than a threshold in a specific column:
```
awk '$4 > 1000 { print $1, $4 }' sales_data.txt
```

Calculating the sum of a column:

awk '{ sum += $2 } END { print "Total:", sum }' data.txt

Calculating the average of a column:

awk '{ sum += $1; count++ } END { print sum / count }' data.txt

Printing unique values from a column:
```
awk '!seen[$1]++' data.txt
```
This is a concise way to print unique lines based on the first field.

Counting occurrences of specific values:

awk '{ status[$3]++ } END { for (s in status) print s, status[s] }' log.txt

Reformatting data (e.g., space-separated to tab-separated):
```
awk 'BEGIN { OFS="\t" } { print $1, $2, $3 }' data.txt
```

Processing multiple files and keeping track of line numbers:

awk '{ print FILENAME ": Line " FNR ": " $0 }' file1.txt file2.txt

Filtering logs for specific IP addresses and counting them:

awk '/192\.168\.1\.100/ { ip_counts["192.168.1.100"]++ } END { print "IP Count:", ip_counts["192.168.1.100"] }' access.log

Extracting data between two markers, ignoring lines outside:

awk '/START_DATA/, /END_DATA/ { if (NR > 1 && !/^START_DATA$/ && !/^END_DATA$/) print }' config.txt

Gotchas

Whitespace as default FS: awk treats multiple whitespace characters (spaces and tabs) as a single delimiter by default. If your data has mixed spaces and tabs, this is convenient. However, if you explicitly set FS=" " and have multiple spaces, it might create empty fields. Use FS="[ \t]+" for robust whitespace splitting or rely on the default.
Quoting: Always enclose your awk program in single quotes ('...') to prevent the shell from interpreting special characters within the program. If your awk program itself needs to contain single quotes, you’ll need to escape them carefully (e.g., 'this is a '\'quote'\'' example').
print vs printf: print automatically adds the ORS (newline by default), while printf requires you to explicitly add newlines (\n) if needed. printf offers more control over formatting.
Arithmetic Operations: awk attempts to interpret fields as numbers when used in arithmetic contexts. If a field cannot be converted to a number, it’s treated as 0.
Array Initialization: Associative arrays in awk are created automatically when you first access an element. You don’t need to declare them explicitly, but it’s good practice to initialize counters or sums in a BEGIN block.
String vs. Numeric Comparisons: awk uses == for string equality and >/< for numeric comparisons. Be mindful of the context. awk will try to convert strings to numbers for numeric comparisons.
Regular Expression Syntax: awk uses Extended Regular Expressions (ERE). Some characters might need escaping (e.g., . becomes \. to match a literal dot).
gawk Extensions: If you need features like case-insensitive matching (IGNORECASE=1), more advanced functions, or better array handling, consider using gawk (GNU awk) and explicitly invoking gawk instead of awk.