What it is
awk is a powerful text-processing utility for pattern scanning and processing, often used for data extraction, transformation, and reporting from structured text files.
Installation
Linux:
awk is typically pre-installed on most Linux distributions. If not, you can install it using your package manager:
sudo apt update && sudo apt install gawk # Debian/Ubuntu
sudo yum install gawk # CentOS/RHEL
sudo dnf install gawk # Fedora
macOS:
awk is pre-installed on macOS. You can also install gawk (GNU awk) for extended features:
brew install gawk
Windows:
awk is not natively available on Windows. You can install it via:
- Git Bash: Comes with
awk. - Cygwin: Provides a Linux-like environment including
awk. - WSL (Windows Subsystem for Linux): Install a Linux distribution (e.g., Ubuntu) and use its
awk.
Core Concepts
- Records (Lines):
awkprocesses input line by line by default. Each line is considered a record. The special variableNRkeeps track of the current record number. - Fields: Within each record,
awksplits the line into fields based on a field separator. By default, whitespace (spaces and tabs) is the separator. Fields are accessed using$1,$2,$3, etc., where$0represents the entire record. - Patterns:
awkprograms consist ofpattern { action }pairs. A pattern can be a regular expression, a comparison, a range, or a special pattern likeBEGINorEND. If a pattern matches the current record, the associated action is executed. - Actions: An action is a block of code enclosed in curly braces
{}thatawkexecutes when a pattern matches. Actions can include printing fields, performing calculations, manipulating strings, and controlling program flow. - Field Separator (
FS): The character or pattern used to split records into fields. Defaults to whitespace. - Record Separator (
RS): The character or pattern used to split input into records. Defaults to a newline character. - Output Field Separator (
OFS): The character used to join fields when printing multiple items. Defaults to a space. - Output Record Separator (
ORS): The character appended after each printed record. Defaults to a newline.
Commands / Usage
awk commands are typically structured as awk 'program' filename(s). The program is enclosed in single quotes.
Basic Data Extraction
-
Print the entire line:
awk '{ print }' data.txtPrints every line from
data.txt. -
Print a specific field:
awk '{ print $1 }' data.txtPrints the first field of each line from
data.txt. -
Print multiple fields:
awk '{ print $1, $3 }' data.txtPrints the first and third fields of each line from
data.txt, separated by theOFS(space by default). -
Print fields with custom formatting:
awk '{ printf "Name: %s, Age: %s\n", $1, $2 }' users.txtPrints the first field as "Name:" and the second as "Age:" for each line in
users.txt.
Pattern Matching
-
Print lines containing a specific string:
awk '/error/ { print }' log.txtPrints all lines from
log.txtthat contain the substring "error". -
Print lines matching a regular expression:
awk '/^[0-9]{3}-[0-9]{2}-[0-9]{4}/ { print }' data.txtPrints lines from
data.txtthat start with a pattern resembling an SSN (e.g., 123-45-6789). -
Print lines NOT matching a pattern:
awk '!/debug/ { print }' log.txtPrints all lines from
log.txtthat do not contain the substring "debug". -
Print lines based on field content:
awk '$3 > 100 { print }' data.txtPrints lines where the third field’s value is greater than 100.
-
Print lines based on string comparison:
awk '$1 == "apple" { print }' fruits.txtPrints lines where the first field is exactly "apple".
-
Print lines based on multiple conditions (AND):
awk '$1 == "banana" && $2 < 50 { print }' data.txtPrints lines where the first field is "banana" AND the second field is less than 50.
-
Print lines based on multiple conditions (OR):
awk '$1 == "orange" || $1 == "grape" { print }' fruits.txtPrints lines where the first field is "orange" OR "grape".
-
Print lines within a range of record numbers:
awk 'NR >= 10 && NR <= 20 { print }' data.txtPrints lines from record number 10 to 20 (inclusive).
-
Print lines between two patterns:
awk '/START_SECTION/, /END_SECTION/ { print }' config.txtPrints lines from the one that matches
/START_SECTION/up to and including the one that matches/END_SECTION/.
Special Patterns: BEGIN and END
-
Execute before processing any lines:
awk 'BEGIN { print "Starting processing..." } { print }' data.txtPrints "Starting processing…" before reading any lines from
data.txt. -
Execute after processing all lines:
awk '{ sum += $1 } END { print "Total sum:", sum }' numbers.txtCalculates the sum of the first field of all lines in
numbers.txtand prints the total at the end. -
Initialize variables in BEGIN:
awk 'BEGIN { FS=":"; count=0 } { count++ } END { print "Total records:", count }' /etc/passwdSets the field separator to ":" and initializes a counter to 0, then counts records and prints the total.
Field and Record Separators
-
Set input field separator (e.g., comma):
awk -F',' '{ print $1, $2 }' data.csvProcesses
data.csvusing commas as field separators. -
Set input field separator to multiple characters:
awk -F'[ :]' '{ print $1, $2 }' data.txtUses either a space or a colon as the field separator.
-
Set input record separator (e.g., blank line):
awk -v RS='\n\n' '{ print $1 }' data.txtTreats paragraphs separated by blank lines as records.
-
Set output field separator:
awk 'BEGIN { OFS=" | " } { print $1, $2, $3 }' data.txtPrints the first three fields separated by " | ".
Variables and Arithmetic
-
Counting lines:
awk 'END { print NR }' data.txtPrints the total number of records (lines) processed.
-
Summing values:
awk '{ total += $2 } END { print total }' prices.txtCalculates the sum of the second column in
prices.txt. -
Calculating averages:
awk '{ sum += $1; count++ } END { if (count > 0) print sum / count }' numbers.txtCalculates the average of the first column.
-
Counting occurrences of a pattern:
awk '/warning/ { count++ } END { print count }' log.txtCounts how many lines contain the word "warning".
-
Counting unique values in a field:
awk '{ seen[$1]++ } END { for (key in seen) print key, seen[key] }' data.txtCounts the occurrences of each unique value in the first field.
String Manipulation
-
Concatenating strings:
awk '{ print $1 "-" $2 }' data.txtPrints the first and second fields joined by a hyphen.
-
Getting string length:
awk '{ print length($1) }' data.txtPrints the length of the first field.
-
Substrings:
awk '{ print substr($1, 1, 3) }' data.txtPrints the first 3 characters of the first field.
-
Finding string position:
awk '/example/ { print index($0, "example") }' data.txtPrints the starting position of "example" within lines that contain it.
-
Replacing strings:
awk '{ gsub(/old/, "new"); print }' data.txtGlobally replaces all occurrences of "old" with "new" in each line and prints the modified line.
-
Replacing the first occurrence:
awk '{ sub(/old/, "new"); print }' data.txtReplaces only the first occurrence of "old" with "new" in each line.
Control Flow
-
If-else statements:
awk '{ if ($1 > 10) print $1, "High"; else print $1, "Low" }' numbers.txtPrints "High" or "Low" based on the value of the first field.
-
Loops (for):
awk 'BEGIN { for (i=1; i<=5; i++) print i }'Prints numbers 1 through 5.
-
Loops (while):
awk '{ i = 1; while (i <= NF) { print "Field " i ":", $i; i++ } }' data.txtIterates through all fields (
NFis Number of Fields) of each line. -
Next statement (skip to next record):
awk '/skip_this/ { next } { print }' data.txtSkips processing for lines containing "skip_this" and proceeds to the next record.
-
Exit statement:
awk '/END_MARKER/ { print "Found marker, exiting."; exit } { print }' data.txtPrints lines until "END_MARKER" is found, then prints a message and exits.
Arrays
-
Associative arrays (key-value pairs):
awk '{ counts[$1]++ } END { for (word in counts) print word, counts[word] }' words.txtCounts the frequency of each word in
words.txt. -
Using array elements in patterns:
awk '/apple|banana/ { fruits[$0]++ } END { for (f in fruits) print f }' data.txtPrints lines containing "apple" or "banana" (effectively de-duplicating if they appear multiple times).
Built-in Variables
NR: Number of the current record (line) processed.FNR: Number of the current record within the current file (useful when processing multiple files).NF: Number of fields in the current record.FS: Input Field Separator (default is whitespace).OFS: Output Field Separator (default is space).RS: Input Record Separator (default is newline).ORS: Output Record Separator (default is newline).FILENAME: Name of the current input file.ARGC: Argument count (number of command-line arguments).ARGV: Argument vector (the actual command-line arguments).
Common Patterns
-
Extracting specific columns from a CSV file:
awk -F',' '{ print $1, $3 }' users.csv -
Finding lines with values greater than a threshold in a specific column:
awk '$4 > 1000 { print $1, $4 }' sales_data.txt -
Calculating the sum of a column:
awk '{ sum += $2 } END { print "Total:", sum }' data.txt -
Calculating the average of a column:
awk '{ sum += $1; count++ } END { print sum / count }' data.txt -
Printing unique values from a column:
awk '!seen[$1]++' data.txtThis is a concise way to print unique lines based on the first field.
-
Counting occurrences of specific values:
awk '{ status[$3]++ } END { for (s in status) print s, status[s] }' log.txt -
Reformatting data (e.g., space-separated to tab-separated):
awk 'BEGIN { OFS="\t" } { print $1, $2, $3 }' data.txt -
Processing multiple files and keeping track of line numbers:
awk '{ print FILENAME ": Line " FNR ": " $0 }' file1.txt file2.txt -
Filtering logs for specific IP addresses and counting them:
awk '/192\.168\.1\.100/ { ip_counts["192.168.1.100"]++ } END { print "IP Count:", ip_counts["192.168.1.100"] }' access.log -
Extracting data between two markers, ignoring lines outside:
awk '/START_DATA/, /END_DATA/ { if (NR > 1 && !/^START_DATA$/ && !/^END_DATA$/) print }' config.txt
Gotchas
- Whitespace as default
FS:awktreats multiple whitespace characters (spaces and tabs) as a single delimiter by default. If your data has mixed spaces and tabs, this is convenient. However, if you explicitly setFS=" "and have multiple spaces, it might create empty fields. UseFS="[ \t]+"for robust whitespace splitting or rely on the default. - Quoting: Always enclose your
awkprogram in single quotes ('...') to prevent the shell from interpreting special characters within the program. If yourawkprogram itself needs to contain single quotes, you’ll need to escape them carefully (e.g.,'this is a '\'quote'\'' example'). printvsprintf:printautomatically adds theORS(newline by default), whileprintfrequires you to explicitly add newlines (\n) if needed.printfoffers more control over formatting.- Arithmetic Operations:
awkattempts to interpret fields as numbers when used in arithmetic contexts. If a field cannot be converted to a number, it’s treated as 0. - Array Initialization: Associative arrays in
awkare created automatically when you first access an element. You don’t need to declare them explicitly, but it’s good practice to initialize counters or sums in aBEGINblock. - String vs. Numeric Comparisons:
awkuses==for string equality and>/<for numeric comparisons. Be mindful of the context.awkwill try to convert strings to numbers for numeric comparisons. - Regular Expression Syntax:
awkuses Extended Regular Expressions (ERE). Some characters might need escaping (e.g.,.becomes\.to match a literal dot). gawkExtensions: If you need features like case-insensitive matching (IGNORECASE=1), more advanced functions, or better array handling, consider usinggawk(GNU awk) and explicitly invokinggawkinstead ofawk.