Regular expressions (regex) are powerful string matching tools that use pattern-matching techniques for complex searching and manipulating. They occur across many programming languages, but also dominate command-line text-processing tools such as sed
and awk
in Unix-like systems.
What are Regular Expressions?
Regular expressions are a sequence of characters representing a search pattern. This pattern can be used within string searching algorithms to “find” or “find and replace” textual data.
Common Regular Expressions
The table below is based on RegExr’s cheatsheet.
Type | Syntax | Description | Example pattern | Example | First Match |
---|---|---|---|---|---|
Character Classes |
`.` | Any character except newline | `a.` | ` abc` | `ab` |
`\w` `\d` `\s` | Word, digit, whitespace | `\s\d\w` | `November, 5th` | ` 5t` | |
`\W` `\D` `\S` | Not word, digit, whitespace | `\W\D\S` | `November, 5th` | `, 5` | |
`[xyz]` | Any of x, y or z | `[abc]` | `carrot` | `c` | |
`[^xyz]` | Not x, y or z | `[^abc]` | `carrot` | `r` | |
`[x-y]` | Character between x & y | `[abc]` | `orange` | `a` | |
Anchors | `^a` `c$` | Start / end of the string | `[^abc]` | `abcde` | `d` |
`\b` `\B` | Word, not-word boundary | `\Ba.*\b` | `orange` | `ange` | |
Escaped Characters |
`\.` `\*` `\\` | Escaped special characters | `\..*\*` | `*data.frame*` | `.frame*` |
`\t` `\n` `\r` | `.\t.*\n.` | `apples oranges bananas` |
`s oranges b` |
||
Groups & Lookaround |
`(abc)` | Capture group | `(ana).*` | `banana` | `anana` |
`\1` | Backreference to group #1 | `s/([A-Z]{3}).([0-9]{2})/\1 20\2/` | `08/MAY/24` | `MAY 2024` | |
`(?:abc)` | Non-capturing group | `s/(?:ha)+ (ha+)/\1/` | `hahaha haa hah!` | `haa` | |
`(?=abc)` | Positive lookahead | `.{6}(?= hah)` | `hahaha haa hah!` | `a haa` | |
`(?!abc)` | Negative lookahead | `.{6}(?= hah)` | `hahaha haa hah!` | `hahaha` | |
Quantifiers & Alternation |
`a*` `a+` `a?` | 0 or more, 1 or more, 0 or 1 "abc" | `(ha)? (ha*) ([ha]+)` | `hahaha haa hah!` | `ha haa hah` |
`a{3}` `a{2,}` | Exactly three, two or more | `(ha){3} ha{2,}` | `hahaha haa hah!` | `hahaha haa` | |
`a{1,3}` | Between 1 and 3 | ` ha{1,3}` | `hahaha haa hah!` | haa | |
`a+?` `a{2,}?` | Match as few as possible | `[car]{3,}? | `carrots` | `car` | |
`ab|cd` | Match ab or cd | `ana|ange` | `oranges bananas` | `ange |
Using regex with bash
Within the universe of Bash, several commands harness the power of regular expressions to manipulate text. Star players in this line-up include sed
, awk
, and grep
, each offering unique functionalities to elevate your command-line proficiency.
Search
Files in a Directory
Imagine having a folder with hundreds of files, and you’re on a mission to dig out all CSV files whose name contains the word “orange”. Simply do:
cd path/to/folder
ls -l | grep "orange.*\.cvs"
The given code functions to locate all files containing the word “orange” in their name and has an extension of .cvs within a specified directory (path/to/folder).
Here’s what each line does:
-
cd path/to/folder
: Changes the current working directory topath/to/folder
. -
ls -l | grep "orange.*\.cvs"
: Thels -l
lists all files in the current directory in long format, providing details like permissions, number of links, owner, group, size, and time of last modification of each file. This list is then piped (|
) to thegrep
command.
The grep
command is used to search the input it receives for lines containing a match to the specified pattern. The pattern defined here “orange.*.cvs” uses regex where:
-
orange
matches lines containing “orange”. -
.*
matches any character (.
) for any number of times (*
). -
\.
is a literal dot and not a special character. -
cvs
matches lines ending with “cvs”.
Combining these, the grep command filters through the output of ls -l looking for filenames that contain “orange” and end with .cvs.
PDF Text
Suppose you have a directory filled with literature review papers, and you recall reading about Global Value Chains (GVCs) but can’t pinpoint the exact document. You can utilize pdfgrep
in the following way:
pdfgrep -i "(Global Value Chain|GVC)s{0,1}" path/to/file.pdf
This command uses pdfgrep
, a tool used to search text in PDF files:
-
-i
flag indicates the search is case-insensitive. -
path/to/file.pdf
is the PDF file in which to search for the keyword. -
(Global Value Chain|GVC)s{0,1}
is the regex and it operates as follows:(Global Value Chain|GVC)
: This part of the pattern uses the|
operator which acts like a logical OR. Matches the pattern before the|
(Global Value Chain) or the pattern after the|
(GVC).s{0,1}
: This part of the pattern looks for an ’s’ character. The{0,1}
decides how many instances of ’s’ to match.0
means it’s okay if ’s’ is not there, and1
means it can match at most one ’s'.
So the whole regular expression is saying to look for the strings ‘Global Value Chain’ or ‘GVC’ and it’s okay if there’s an ’s’ character after those, making it plural, or if there’s no ’s’ character after them.
The pdfgrep
command will then search through the specified PDF file, and output any lines containing the regex, regardless of case.
Search and Replace
Line x Line: The sed
Command
The sed
command, short for “stream editor,” processes text line by line and modifies it based on particular commands. It employs regular expressions to match patterns within a file.
An example of sed
with regex is replacing all occurrences of the word “apple” with “orange” in a file:
sed 's/apple/orange/g' filename
Here, s
stands for “substitute,” the g
at the end enables global replacement for all matches, and the words “apple” and “orange” represent the search and replacement terms, respectively.
Subsetting data with awk
awk
is a powerful text-processing command-line tool. It traverses the file line by line, splits each line into fields, and checks for pattern matching in each field.
Consider the following text files:
- list.txt
John
Jane
Sara
containing a list of names, and
- data.txt
1 John Doe
2 Jane Smith
3 Bob Builder
4 Sara Parker
5 Tim Cook1 John Doe
2 Jane Smith
3 Bob Builder
4 Sara Parker
5 Tim Cook
a dataset with 3 fields: id
, first name
and last name
.
Now suppose that you to create a new dataset named shorlist.txt
that contains only the observations in data.txt
that match a first name in list.txt
. This can be achieved with the following command:
awk 'NR==FNR{a[$1]; next} $2 in a' list.txt data.txt
Here:
-
'NR==FNR{a[$1]; next} $2 in a'
: is the awk program that contains the condition-action pairs: -
NR==FNR{a[$1]; next}
: This reads entries fromlist.txt
(which is the first file to be read). TheNR
andFNR
are built-inawk
variables.NR
represents the total number of records processed andFNR
the number of records processed in the current input file. Thus,NR==FNR
is only true for the first file.{a[$1]; next}
creates an arraya
where each line fromlist.txt
contents is a key. After that it goes to the next record without executing the rest of the commands. -
$2 in a
: This condition is checked for each row fromdata.txt
(second file to be read). It checks if the second field$2
(which corresponds to the first name in this case) exists in the arraya
that we created earlier. -
list.txt data.txt
: These are the input files. awk treats the multiple file inputs as a continuation of the first one, which is why we can useNR==FNR
to perform actions only on the first file.
Wrapping Up
Regular expressions can appear daunting, but they are an invaluable tool in text processing. The power of sed and awk combined with regex allows Linux users to manipulate textual data with great efficiency and flexibility. To boost your programming toolbox, consider adopting these skills and mastering the art of pattern matching.
Where Can I Dive Deeper into Regex Learning?
If you want to practice the use of regex, I recommend using online tools like RegExr, regex101, or RegexPlanet. These are exelent options to understand regex, as well as testing them with your own text/code.