awk Text Processing

awk cheatsheet — print columns, filter rows, sum values, use regex. awk '{print $2}', awk -F: '{print $1}', awk '/pattern/ {sum+=$3}'. Real examples.

10 min read

What it is

awk is a powerful text-processing utility for pattern scanning and processing, often used for data extraction, transformation, and reporting from structured text files.

Installation

Linux: awk is typically pre-installed on most Linux distributions. If not, you can install it using your package manager:

sudo apt update && sudo apt install gawk # Debian/Ubuntu
sudo yum install gawk # CentOS/RHEL
sudo dnf install gawk # Fedora

macOS: awk is pre-installed on macOS. You can also install gawk (GNU awk) for extended features:

brew install gawk

Windows: awk is not natively available on Windows. You can install it via:

  • Git Bash: Comes with awk.
  • Cygwin: Provides a Linux-like environment including awk.
  • WSL (Windows Subsystem for Linux): Install a Linux distribution (e.g., Ubuntu) and use its awk.

Core Concepts

  • Records (Lines): awk processes input line by line by default. Each line is considered a record. The special variable NR keeps track of the current record number.
  • Fields: Within each record, awk splits the line into fields based on a field separator. By default, whitespace (spaces and tabs) is the separator. Fields are accessed using $1, $2, $3, etc., where $0 represents the entire record.
  • Patterns: awk programs consist of pattern { action } pairs. A pattern can be a regular expression, a comparison, a range, or a special pattern like BEGIN or END. If a pattern matches the current record, the associated action is executed.
  • Actions: An action is a block of code enclosed in curly braces {} that awk executes when a pattern matches. Actions can include printing fields, performing calculations, manipulating strings, and controlling program flow.
  • Field Separator (FS): The character or pattern used to split records into fields. Defaults to whitespace.
  • Record Separator (RS): The character or pattern used to split input into records. Defaults to a newline character.
  • Output Field Separator (OFS): The character used to join fields when printing multiple items. Defaults to a space.
  • Output Record Separator (ORS): The character appended after each printed record. Defaults to a newline.

Commands / Usage

awk commands are typically structured as awk 'program' filename(s). The program is enclosed in single quotes.

Basic Data Extraction

  • Print the entire line:

    awk '{ print }' data.txt
    

    Prints every line from data.txt.

  • Print a specific field:

    awk '{ print $1 }' data.txt
    

    Prints the first field of each line from data.txt.

  • Print multiple fields:

    awk '{ print $1, $3 }' data.txt
    

    Prints the first and third fields of each line from data.txt, separated by the OFS (space by default).

  • Print fields with custom formatting:

    awk '{ printf "Name: %s, Age: %s\n", $1, $2 }' users.txt
    

    Prints the first field as "Name:" and the second as "Age:" for each line in users.txt.

Pattern Matching

  • Print lines containing a specific string:

    awk '/error/ { print }' log.txt
    

    Prints all lines from log.txt that contain the substring "error".

  • Print lines matching a regular expression:

    awk '/^[0-9]{3}-[0-9]{2}-[0-9]{4}/ { print }' data.txt
    

    Prints lines from data.txt that start with a pattern resembling an SSN (e.g., 123-45-6789).

  • Print lines NOT matching a pattern:

    awk '!/debug/ { print }' log.txt
    

    Prints all lines from log.txt that do not contain the substring "debug".

  • Print lines based on field content:

    awk '$3 > 100 { print }' data.txt
    

    Prints lines where the third field’s value is greater than 100.

  • Print lines based on string comparison:

    awk '$1 == "apple" { print }' fruits.txt
    

    Prints lines where the first field is exactly "apple".

  • Print lines based on multiple conditions (AND):

    awk '$1 == "banana" && $2 < 50 { print }' data.txt
    

    Prints lines where the first field is "banana" AND the second field is less than 50.

  • Print lines based on multiple conditions (OR):

    awk '$1 == "orange" || $1 == "grape" { print }' fruits.txt
    

    Prints lines where the first field is "orange" OR "grape".

  • Print lines within a range of record numbers:

    awk 'NR >= 10 && NR <= 20 { print }' data.txt
    

    Prints lines from record number 10 to 20 (inclusive).

  • Print lines between two patterns:

    awk '/START_SECTION/, /END_SECTION/ { print }' config.txt
    

    Prints lines from the one that matches /START_SECTION/ up to and including the one that matches /END_SECTION/.

Special Patterns: BEGIN and END

  • Execute before processing any lines:

    awk 'BEGIN { print "Starting processing..." } { print }' data.txt
    

    Prints "Starting processing…" before reading any lines from data.txt.

  • Execute after processing all lines:

    awk '{ sum += $1 } END { print "Total sum:", sum }' numbers.txt
    

    Calculates the sum of the first field of all lines in numbers.txt and prints the total at the end.

  • Initialize variables in BEGIN:

    awk 'BEGIN { FS=":"; count=0 } { count++ } END { print "Total records:", count }' /etc/passwd
    

    Sets the field separator to ":" and initializes a counter to 0, then counts records and prints the total.

Field and Record Separators

  • Set input field separator (e.g., comma):

    awk -F',' '{ print $1, $2 }' data.csv
    

    Processes data.csv using commas as field separators.

  • Set input field separator to multiple characters:

    awk -F'[ :]' '{ print $1, $2 }' data.txt
    

    Uses either a space or a colon as the field separator.

  • Set input record separator (e.g., blank line):

    awk -v RS='\n\n' '{ print $1 }' data.txt
    

    Treats paragraphs separated by blank lines as records.

  • Set output field separator:

    awk 'BEGIN { OFS=" | " } { print $1, $2, $3 }' data.txt
    

    Prints the first three fields separated by " | ".

Variables and Arithmetic

  • Counting lines:

    awk 'END { print NR }' data.txt
    

    Prints the total number of records (lines) processed.

  • Summing values:

    awk '{ total += $2 } END { print total }' prices.txt
    

    Calculates the sum of the second column in prices.txt.

  • Calculating averages:

    awk '{ sum += $1; count++ } END { if (count > 0) print sum / count }' numbers.txt
    

    Calculates the average of the first column.

  • Counting occurrences of a pattern:

    awk '/warning/ { count++ } END { print count }' log.txt
    

    Counts how many lines contain the word "warning".

  • Counting unique values in a field:

    awk '{ seen[$1]++ } END { for (key in seen) print key, seen[key] }' data.txt
    

    Counts the occurrences of each unique value in the first field.

String Manipulation

  • Concatenating strings:

    awk '{ print $1 "-" $2 }' data.txt
    

    Prints the first and second fields joined by a hyphen.

  • Getting string length:

    awk '{ print length($1) }' data.txt
    

    Prints the length of the first field.

  • Substrings:

    awk '{ print substr($1, 1, 3) }' data.txt
    

    Prints the first 3 characters of the first field.

  • Finding string position:

    awk '/example/ { print index($0, "example") }' data.txt
    

    Prints the starting position of "example" within lines that contain it.

  • Replacing strings:

    awk '{ gsub(/old/, "new"); print }' data.txt
    

    Globally replaces all occurrences of "old" with "new" in each line and prints the modified line.

  • Replacing the first occurrence:

    awk '{ sub(/old/, "new"); print }' data.txt
    

    Replaces only the first occurrence of "old" with "new" in each line.

Control Flow

  • If-else statements:

    awk '{ if ($1 > 10) print $1, "High"; else print $1, "Low" }' numbers.txt
    

    Prints "High" or "Low" based on the value of the first field.

  • Loops (for):

    awk 'BEGIN { for (i=1; i<=5; i++) print i }'
    

    Prints numbers 1 through 5.

  • Loops (while):

    awk '{ i = 1; while (i <= NF) { print "Field " i ":", $i; i++ } }' data.txt
    

    Iterates through all fields (NF is Number of Fields) of each line.

  • Next statement (skip to next record):

    awk '/skip_this/ { next } { print }' data.txt
    

    Skips processing for lines containing "skip_this" and proceeds to the next record.

  • Exit statement:

    awk '/END_MARKER/ { print "Found marker, exiting."; exit } { print }' data.txt
    

    Prints lines until "END_MARKER" is found, then prints a message and exits.

Arrays

  • Associative arrays (key-value pairs):

    awk '{ counts[$1]++ } END { for (word in counts) print word, counts[word] }' words.txt
    

    Counts the frequency of each word in words.txt.

  • Using array elements in patterns:

    awk '/apple|banana/ { fruits[$0]++ } END { for (f in fruits) print f }' data.txt
    

    Prints lines containing "apple" or "banana" (effectively de-duplicating if they appear multiple times).

Built-in Variables

  • NR: Number of the current record (line) processed.
  • FNR: Number of the current record within the current file (useful when processing multiple files).
  • NF: Number of fields in the current record.
  • FS: Input Field Separator (default is whitespace).
  • OFS: Output Field Separator (default is space).
  • RS: Input Record Separator (default is newline).
  • ORS: Output Record Separator (default is newline).
  • FILENAME: Name of the current input file.
  • ARGC: Argument count (number of command-line arguments).
  • ARGV: Argument vector (the actual command-line arguments).

Common Patterns

  • Extracting specific columns from a CSV file:

    awk -F',' '{ print $1, $3 }' users.csv
    
  • Finding lines with values greater than a threshold in a specific column:

    awk '$4 > 1000 { print $1, $4 }' sales_data.txt
    
  • Calculating the sum of a column:

    awk '{ sum += $2 } END { print "Total:", sum }' data.txt
    
  • Calculating the average of a column:

    awk '{ sum += $1; count++ } END { print sum / count }' data.txt
    
  • Printing unique values from a column:

    awk '!seen[$1]++' data.txt
    

    This is a concise way to print unique lines based on the first field.

  • Counting occurrences of specific values:

    awk '{ status[$3]++ } END { for (s in status) print s, status[s] }' log.txt
    
  • Reformatting data (e.g., space-separated to tab-separated):

    awk 'BEGIN { OFS="\t" } { print $1, $2, $3 }' data.txt
    
  • Processing multiple files and keeping track of line numbers:

    awk '{ print FILENAME ": Line " FNR ": " $0 }' file1.txt file2.txt
    
  • Filtering logs for specific IP addresses and counting them:

    awk '/192\.168\.1\.100/ { ip_counts["192.168.1.100"]++ } END { print "IP Count:", ip_counts["192.168.1.100"] }' access.log
    
  • Extracting data between two markers, ignoring lines outside:

    awk '/START_DATA/, /END_DATA/ { if (NR > 1 && !/^START_DATA$/ && !/^END_DATA$/) print }' config.txt
    

Gotchas

  • Whitespace as default FS: awk treats multiple whitespace characters (spaces and tabs) as a single delimiter by default. If your data has mixed spaces and tabs, this is convenient. However, if you explicitly set FS=" " and have multiple spaces, it might create empty fields. Use FS="[ \t]+" for robust whitespace splitting or rely on the default.
  • Quoting: Always enclose your awk program in single quotes ('...') to prevent the shell from interpreting special characters within the program. If your awk program itself needs to contain single quotes, you’ll need to escape them carefully (e.g., 'this is a '\'quote'\'' example').
  • print vs printf: print automatically adds the ORS (newline by default), while printf requires you to explicitly add newlines (\n) if needed. printf offers more control over formatting.
  • Arithmetic Operations: awk attempts to interpret fields as numbers when used in arithmetic contexts. If a field cannot be converted to a number, it’s treated as 0.
  • Array Initialization: Associative arrays in awk are created automatically when you first access an element. You don’t need to declare them explicitly, but it’s good practice to initialize counters or sums in a BEGIN block.
  • String vs. Numeric Comparisons: awk uses == for string equality and >/< for numeric comparisons. Be mindful of the context. awk will try to convert strings to numbers for numeric comparisons.
  • Regular Expression Syntax: awk uses Extended Regular Expressions (ERE). Some characters might need escaping (e.g., . becomes \. to match a literal dot).
  • gawk Extensions: If you need features like case-insensitive matching (IGNORECASE=1), more advanced functions, or better array handling, consider using gawk (GNU awk) and explicitly invoking gawk instead of awk.