next up previous contents index
Next: Advanced Files Up: Advanced Topics Previous: Advanced Topics   Contents   Index

Regular Expressions

A regular expression is a description of a set of characters. This description can be used to search through a file by looking for text that matches the regular expression. Regular expressions are analogous to shell wildcards (see section 6.6 on page [*]), but they are both more complicated and more powerful.

A regular expression is made up of text and metacharacters. A metacharacter is just a character with a special meaning. Metacharacters include the following: . * [] - \^ $.

If a regular expression contains only text (no metacharacters), it matches that text. For example, the regular expression ``my regular expression'' matches the text ``my regular expression,'' and nothing else. Regular expressions are usually case sensitive.

You can use the egrep command to display all lines in a file that contain a regular expression. Its syntax is as follows:

egrep 'regexp' filename1 ...
The single quotation marks are not always needed, but they never hurt.

For example, to find all lines in the GPL that contain the word GNU, you type

egrep 'GNU' /usr/doc/copyright/GPL
egrep will print the lines to standard output. If you want all lines that contain freedom followed by some indeterminate text, followed by GNU, you can do this:

egrep 'freedom.*GNU' /usr/doc/copyright/GPL 
The . means ``any character,'' and the * means ``zero or more of the preceding thing,'' in this case ``zero or more of any character.'' So .* matches pretty much any text at all. egrep only matches on a line-by-line basis, so freedom and GNU have to be on the same line.

Here's a summary of regular expression metacharacters:

.
Matches any single character except newline.
*
Matches zero or more occurrences of the preceding thing. So the expression a* matches zero or more lowercase a, and .* matches zero or more characters.
[characters]
The brackets must contain one or more characters; the whole bracketed expression matches exactly one character out of the set. So [abc]matches one a, one b, or one c; it does not match zero characters, and it does not match a character other than these three.
^
Anchors your search at the beginning of the line. The expression ^The matches The when it appears at the beginning of a line; there can't be spaces or other text before The. If you want to allow spaces, you can permit 0 or more space characters like this: ^ *The.
$
Anchors at the end of the line. end$ requires the text end to be at the end of the line, with no intervening spaces or text.
[^characters]
This reverses the sense of a bracketed character list. So [^abc] matches any single character, except a, b, or c.
[character-character]
You can include ranges in a bracketed character list. To match any lowercase letter, use [a-z]. You can have more than one range; so to match the first three or last three letters of the alphabet, try [a-cx-z]. To get any letter, any case, try [a-zA-Z]. You can mix ranges with single characters and with the ^metacharacter; for example, [^a-zBZ]means ``anything except a lowercase letter, capital B, or capital Z.''
()
You can use parentheses to group parts of the regular expression, just as you do in a mathematical expression.
|
|means ``or.'' You can use it to provide a series of alternative expressions. Usually you want to put the alternatives in parentheses, like this: c(ad|ab|at)matches cad or cab or cat. Without the parentheses, it would match cad or ab or at instead
\
Escapes any special characters; if you want to find a literal *, you type \*. The slash means to ignore *'s usual special meaning.
Here are some more examples to help you get a feel for things:
c.pe
matches cope, cape, caper.
c\ .pe
matches c.pe, c.per.
sto*p
matches stp, stop, stoop.
car.*n
matches carton, cartoon, carmen.
xyz.*
matches xyz and anything after it; some tools, like egrep, only match until the end of the line.
^The
matches The at the beginning of a line.
atime$
matches atime at the end of a line.
^Only$
matches a line that consists solely of the word Only - no spaces, no other characters, nothing. Only Only is allowed.
b[aou]rn
matches barn, born, burn.
Ver[D-F]
matches VerD, VerE, VerF.
Ver[^0-9]
matches Ver followed by any non-digit.
the[ir][re]
matches their, therr, there, theie.
[A-Za-z][A-Za-z]*
matches any word which consists of only letters, and at least one letter. It will not match numbers or spaces.


next up previous contents index
Next: Advanced Files Up: Advanced Topics Previous: Advanced Topics   Contents   Index
John Goerzen / Ossama Othman