Regular expressions (regex) are powerful string matching tools that use pattern-matching techniques for complex searching and manipulating. They occur across many programming languages, but also dominate command-line text-processing tools such as sed and awk in Unix-like systems.

What are Regular Expressions?

Regular expressions are a sequence of characters representing a search pattern. This pattern can be used within string searching algorithms to “find” or “find and replace” textual data.

Common Regular Expressions

The table below is based on RegExr’s cheatsheet.

Type Syntax Description Example pattern Example First
Match
Character
Classes
`.` Any character except newline `a.` ` abc` `ab`
`\w` `\d` `\s` Word, digit, whitespace `\s\d\w` `November, 5th` ` 5t`
`\W` `\D` `\S` Not word, digit, whitespace `\W\D\S` `November, 5th` `, 5`
`[xyz]` Any of x, y or z `[abc]` `carrot` `c`
`[^xyz]` Not x, y or z `[^abc]` `carrot` `r`
`[x-y]` Character between x & y `[abc]` `orange` `a`
Anchors `^a` `c$` Start / end of the string `[^abc]` `abcde` `d`
`\b` `\B` Word, not-word boundary `\Ba.*\b` `orange` `ange`
Escaped
Characters
`\.` `\*` `\\` Escaped special characters `\..*\*` `*data.frame*` `.frame*`
`\t` `\n` `\r` `.\t.*\n.` `apples oranges
bananas`
`s oranges
b`
Groups &
Lookaround
`(abc)` Capture group `(ana).*` `banana` `anana`
`\1` Backreference to group #1 `s/([A-Z]{3}).([0-9]{2})/\1 20\2/` `08/MAY/24` `MAY 2024`
`(?:abc)` Non-capturing group `s/(?:ha)+ (ha+)/\1/` `hahaha haa hah!` `haa`
`(?=abc)` Positive lookahead `.{6}(?= hah)` `hahaha haa hah!` `a haa`
`(?!abc)` Negative lookahead `.{6}(?= hah)` `hahaha haa hah!` `hahaha`
Quantifiers &
Alternation
`a*` `a+` `a?` 0 or more, 1 or more, 0 or 1 "abc" `(ha)? (ha*) ([ha]+)` `hahaha haa hah!` `ha haa hah`
`a{3}` `a{2,}` Exactly three, two or more `(ha){3} ha{2,}` `hahaha haa hah!` `hahaha haa`
`a{1,3}` Between 1 and 3 ` ha{1,3}` `hahaha haa hah!` haa
`a+?` `a{2,}?` Match as few as possible `[car]{3,}? `carrots` `car`
`ab|cd` Match ab or cd `ana|ange` `oranges bananas` `ange

Using regex with bash

Within the universe of Bash, several commands harness the power of regular expressions to manipulate text. Star players in this line-up include sed, awk, and grep, each offering unique functionalities to elevate your command-line proficiency.

Files in a Directory

Imagine having a folder with hundreds of files, and you’re on a mission to dig out all CSV files whose name contains the word “orange”. Simply do:

cd path/to/folder
ls -l | grep "orange.*\.cvs"

The given code functions to locate all files containing the word “orange” in their name and has an extension of .cvs within a specified directory (path/to/folder).

Here’s what each line does:

  • cd path/to/folder: Changes the current working directory to path/to/folder.

  • ls -l | grep "orange.*\.cvs": The ls -l lists all files in the current directory in long format, providing details like permissions, number of links, owner, group, size, and time of last modification of each file. This list is then piped (|) to the grep command.

The grep command is used to search the input it receives for lines containing a match to the specified pattern. The pattern defined here “orange.*.cvs” uses regex where:

  • orange matches lines containing “orange”.

  • .* matches any character (.) for any number of times (*).

  • \. is a literal dot and not a special character.

  • cvs matches lines ending with “cvs”.

Combining these, the grep command filters through the output of ls -l looking for filenames that contain “orange” and end with .cvs.

PDF Text

Suppose you have a directory filled with literature review papers, and you recall reading about Global Value Chains (GVCs) but can’t pinpoint the exact document. You can utilize pdfgrep in the following way:

pdfgrep -i "(Global Value Chain|GVC)s{0,1}" path/to/file.pdf

This command uses pdfgrep, a tool used to search text in PDF files:

  • -i flag indicates the search is case-insensitive.

  • path/to/file.pdf is the PDF file in which to search for the keyword.

  • (Global Value Chain|GVC)s{0,1} is the regex and it operates as follows:

    • (Global Value Chain|GVC): This part of the pattern uses the | operator which acts like a logical OR. Matches the pattern before the | (Global Value Chain) or the pattern after the | (GVC).
    • s{0,1}: This part of the pattern looks for an ’s’ character. The {0,1} decides how many instances of ’s’ to match. 0 means it’s okay if ’s’ is not there, and 1 means it can match at most one ’s'.

So the whole regular expression is saying to look for the strings ‘Global Value Chain’ or ‘GVC’ and it’s okay if there’s an ’s’ character after those, making it plural, or if there’s no ’s’ character after them.

The pdfgrep command will then search through the specified PDF file, and output any lines containing the regex, regardless of case.

Search and Replace

Line x Line: The sed Command

The sed command, short for “stream editor,” processes text line by line and modifies it based on particular commands. It employs regular expressions to match patterns within a file.

An example of sed with regex is replacing all occurrences of the word “apple” with “orange” in a file:

sed 's/apple/orange/g' filename

Here, s stands for “substitute,” the g at the end enables global replacement for all matches, and the words “apple” and “orange” represent the search and replacement terms, respectively.

Subsetting data with awk

awk is a powerful text-processing command-line tool. It traverses the file line by line, splits each line into fields, and checks for pattern matching in each field.

Consider the following text files:

  • list.txt
John
Jane
Sara

containing a list of names, and

  • data.txt
1 John Doe
2 Jane Smith
3 Bob Builder
4 Sara Parker
5 Tim Cook1 John Doe
2 Jane Smith
3 Bob Builder
4 Sara Parker
5 Tim Cook

a dataset with 3 fields: id, first name and last name.

Now suppose that you to create a new dataset named shorlist.txt that contains only the observations in data.txt that match a first name in list.txt. This can be achieved with the following command:

awk 'NR==FNR{a[$1]; next} $2 in a' list.txt data.txt

Here:

  • 'NR==FNR{a[$1]; next} $2 in a': is the awk program that contains the condition-action pairs:

  • NR==FNR{a[$1]; next}: This reads entries from list.txt (which is the first file to be read). The NR and FNR are built-in awk variables. NR represents the total number of records processed and FNR the number of records processed in the current input file. Thus, NR==FNR is only true for the first file. {a[$1]; next} creates an array a where each line from list.txt contents is a key. After that it goes to the next record without executing the rest of the commands.

  • $2 in a: This condition is checked for each row from data.txt (second file to be read). It checks if the second field $2 (which corresponds to the first name in this case) exists in the array a that we created earlier.

  • list.txt data.txt: These are the input files. awk treats the multiple file inputs as a continuation of the first one, which is why we can use NR==FNR to perform actions only on the first file.

Wrapping Up

Regular expressions can appear daunting, but they are an invaluable tool in text processing. The power of sed and awk combined with regex allows Linux users to manipulate textual data with great efficiency and flexibility. To boost your programming toolbox, consider adopting these skills and mastering the art of pattern matching.

Where Can I Dive Deeper into Regex Learning?

If you want to practice the use of regex, I recommend using online tools like RegExr, regex101, or RegexPlanet. These are exelent options to understand regex, as well as testing them with your own text/code.