Regular expressions

Regular expressions (regex) are a domain-specific language for finding patterns and are one of the key functionalities in scripting languages such as Python, as well as the UNIX utilities sed, awk, and grep. We’ll just cover the basic use of regular expressions in bash, but once you know that, it would be easy to use them elsewhere (Python, R, etc.). At the level we’ll consider them, the syntax is quite similar.

Warning

POSIX.2 regular expressions come in two flavors: extended regular expressions and basic (or obsolete) regular expressions. The extended syntax has metacharacters () and {}, while the basic syntax requires the metacharacters to be designated  and \{\}. In addition to the POSIX standard, Perl regular expressions are also widely used. While we won’t go into detail, we will see some examples of each syntax. In the examples that follow we’ll generally use the extended syntax by using the -E flag to grep.

1 Overview and core syntax

The basic idea of regular expressions is that they allow us to find matches of strings or patterns in strings, as well as do substitution. Regular expressions are good for tasks such as:

extracting pieces of text - for example finding all the phone numbers in a document;
creating variables from information found in text;
cleaning and transforming text into a uniform format;
mining text by treating documents as data; and
scraping the web for data.

Regular expressions are constructed from three things:

Literal characters are matched only by the characters themselves,
Character classes are matched by any single member in the class, and
Modifiers operate on either of the above or combinations of them.

Note that the syntax is very concise, so it’s helpful to break down individual regular expressions into the component parts to understand them. Since regex are their own language, it’s a good idea to build up a regex in pieces as a way of avoiding errors just as we would with any computer code. You’ll also want to test your regex on examples, for which this online testing tool is helpful.

It is also helpful to search for common regex online before trying to craft your own. For instance, if you wanted to use a regex that matches valid email addresses, you would need to match anything that complies with the RFC 822 grammar. If you look over that document, you will quickly realize that implementing a correct regular expression to validate email addresses is extremely complex. So if you are writing a website that validates email addresses, it is best to look for a bug-vetted implementation rather than creating your own.

The special characters (meta-characters) used for defining regular expressions are:

* . ^ $ + ? ( ) [ ] { } | \

To use these characters literally as characters, we have to ‘escape’ them. In bash, you escape these characters by placing a single backslash before the character you want to escape. In R, we have to use two backslashes instead of a single backslash because R uses a single backslash to symbolize certain control characters, such as \n for newline.

To learn more about regular expressions, you can type:

$ man 7 regex

2 Character sets and character classes

We can use character sets to match any of the characters in a set.

Operators	Description
`[abc]`	Match any single character from from the listed characters
`[a-z]`	Match any single character from the range of characters
`[^abc]`	Match any single character not among listed characters
`[^a-z]`	Match any single character not among listed range of characters
`.`	Match any single character except a newline
`\`	Turn off (escape) the special meaning of a metacharacter

If we want to search for any one of a set of characters, we use a character set, such as [13579] or [abcd] or [0-9] (where the dash indicates a sequence) or [0-9a-z]. To indicate any character not in a set, we place a ^ just inside the first bracket: [^abcd].

Here’s an example of using regex with grep to find all lines in test.txt that contain at least one numeric digit.

$ grep -E [0-9] test.txt     

or with the -o flag to find and return only the actual digits

$ grep -E -o [0-9] test.txt     

There are a bunch of named character classes so that we don’t have write out common sets of characters. The syntax is [:CLASS:] where CLASS is one of the following values:

"alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
"lower", "print", "punct", "space", "upper", "word" or "xdigit".

So to find any line that contains a punctuation symbol:

$ grep -E [[:punct:]] test.txt

Note that to make a character set with a character class you need two square brackets, e.g., with the digit class: [[:digit:]]. Or we can make a combined character set such as [[:alnum:]_] (to find any alphabetic or numeric characters or an underscore). Or here, any line with a digit, a period, or a comma.

$ grep -E [[:digit:].,] test.txt

Interestingly, we don’t need to escape the period or comma inside the character set, despite both of them being meta-characters.

3 Location-specific matches

We can use position anchors to make location-specific matches.

Operators	Description
`^`	Match the beginning of a line.
`$`	Match the end of a line.

To find a pattern at the beginning of the string, we use ^ (note this was also used for negation, but in that case occurs only inside square brackets) and to find it at the end we use $.

Here we’ll search for lines that start with a digit and for lines that end with a digit.

$ grep -E ^[0-9] test.txt
$ grep -E [0-9]$ test.txt

4 Repetitions, Grouping, and References

Now suppose I wanted to be able to detect phone numbers, email addresses, etc. I often need to be able to deal with repetitions of characters or character sets.

Modifiers

Operators	Description
`*`	Match zero or more instances of the preceding character or regex.
`?`	Match zero or one instance of the preceding character or regex.
`+`	Match one or more instances of the preceding character or regex.
`{n,m}`	Match a range of occurrences (at least n, no more than m) of preceding character of regex.
`\|`	Match the character or expression to the left or right of the vertical bar.

Here are some examples of repetitions:

[[:digit:]]* : any number of digits (zero or more)
[[:digit:]]+ : at least one digit
[[:digit:]]? : zero or one digits
[[:digit:]]{1,3} : at least one and no more than three digits
[[:digit:]]{2,} : two or more digits

Another example is that \[.*\] is the pattern of closed square brackets with any number of characters (.*) inside:

﻿$ grep -E "\[.*\]" test.txt

Note that the quotations ensured that the backslashes are passed into grep and not simply interpreted by the shell, while the \ is needed so that [ and ] are treated as simple characters since they are meta-characters in the regex syntax.

As shown above, we can use | to mean “or”. For example, to match one or more occurrences of “http” or “ftp”:

$ grep -E -o "(http|ftp)" test.txt

Parentheses are also used with a pipe (|) when working with multi-character sequences, such as (http|ftp). Also, here we need double quotes or the shell tries to interpret the ( as part of the regular expression and not shell syntax.

Next let’s see the use of repitition to look for more complicated multi-character patterns. For example, if you wanted to match phone numbers whether they start with 1- or not you could use the following:

(1-)?[[:digit:]]{3}-[[:digit:]]{3}-[[:digit:]]{4}

The first part of the pattern (1-)? matches 0 or 1 occurrences of 1-. Then the pattern [[:digit:]]{3} matches any 3 digits. Similarly, the pattern [[:digit:]]{4} matches any 4 digits. So the whole pattern matches any three digits followed by -, then another three digits, and then followed by four digits when it is preceded by 0 or 1 occurrences of 1-.

Now let’s consider a file named file2.txt with the following content:

    Here is my number: 919-543-3300.
    hi John, good to meet you
    They bought 731 bananas
    Please call 1.919.554.3800
    I think he said it was 337.4355

Let’s use a regular expression pattern to print all lines containing phone numbers:

$ grep '(1-)?[[:digit:]]{3}-[[:digit:]]{4}' file2.txt

You will notice that this doesn’t match any lines. The reason is that the group syntax (1-) and the {} notation are not part of the extended syntax. To have grep use the extended syntax, you can either use the -E option (as we’ve been doing above):

$ grep -E '(1-)?[[:digit:]]{3}-[[:digit:]]{4}' file2.txt
Here is my number: 919-543-3300.

or use the egrep command:

$ egrep  '(1-)?[[:digit:]]{3}-[[:digit:]]{4}' file2.txt
Here is my number: 919-543-3300.

If we want to match regardless of whether the phone number is separated by a minus - or a period ., we could use the pattern [-.]:

$ egrep  '(1[-.])?[[:digit:]]{3}[-.][[:digit:]]{4}' file2.txt
Here is my number: 919-543-3300.
Please call 1.919.554.3800
I think he said it was 337.4355

Exercise

Explain what the following regular expression matches:

$ grep '^[^T]*is.*$' file1.txt

5 Greedy matching

Regular expression pattern matching is greedy—by default, the longest matching string is chosen.

Suppose we have the following file:

$ cat file1.txt
Do an internship <b> in place </b> of <b> one </b> course.

If we want to match the html tags (e.g., <b> and </b>, we might be tempted to use the pattern <.*>. Using the -o option to grep, we can have grep print out just the part of the text that the pattern matches:

$ grep -o "<.*>" file1.txt
<b> in place </b> of <b> one </b>

To get a non-greedy match, you can use the modifier ? after the quantifier. However, this requires that we use the Perl syntax. In order for grep to use the Perl syntax, we need to use the -P option:

$ grep -P -o "<.*?>" file1.txt
<b>
</b>
<b>
</b>

However, one can often avoid greedy matching by being more clever.

Challenge: How could we change our regexp to avoid the greedy matching without using the ? modifier? Hint: Is there some character set that we don’t want to be inside the angle brackets?

Tip: globs vs. regex

Be sure you understand the difference between filename globbing and regular expressions. Filename globbing only works for filenames, while regular expressions are used to match patterns in text more generally. While they both use the same set of symbols, they mean different things (e.g., * matches 0 or more characters when globbing but matches 0 or more repetitions of the character that precedes it when used in a regular expression).