Next: Advanced Files
Up: Advanced Topics
Previous: Advanced Topics
  Contents
  Index
A regular expression is a description of a set of characters. This description
can be used to search through a file by looking for text that matches
the regular expression. Regular expressions are analogous to shell wildcards
(see section 6.6 on page ), but they are both
more complicated and more powerful.
A regular expression is made up of text and metacharacters. A metacharacter
is just a character with a special meaning. Metacharacters include the following:
. * [] - \^ $.
If a regular expression contains only text (no metacharacters), it matches that
text. For example, the regular expression ``my regular expression''
matches the text ``my regular expression,'' and nothing else. Regular
expressions are usually case sensitive.
You can use the egrep command to display all lines in a file that contain
a regular expression. Its syntax is as follows:
-
- egrep 'regexp' filename1 ...
The single quotation marks are not always needed, but they never hurt.
For example, to find all lines in the GPL that contain the word GNU, you type
-
- egrep 'GNU' /usr/doc/copyright/GPL
egrep will print the lines to standard output.
If you want all lines that contain freedom followed by some indeterminate
text, followed by GNU, you can do this:
-
- egrep 'freedom.*GNU' /usr/doc/copyright/GPL
The . means ``any character,'' and the * means ``zero
or more of the preceding thing,'' in this case ``zero or more of any character.''
So .* matches pretty much any text at all. egrep only matches
on a line-by-line basis, so freedom and GNU have to be on
the same line.
Here's a summary of regular expression metacharacters:
- .
- Matches any single character except newline.
- *
- Matches zero or more occurrences of the preceding thing. So the
expression a* matches zero or more lowercase a, and .*
matches zero or more characters.
- [characters]
- The brackets must contain
one or more characters; the whole bracketed expression matches exactly one character
out of the set. So [abc]matches one a, one b, or one c; it does
not match zero characters, and it does not match a character other than these
three.
- ^
- Anchors your search at the beginning of the line. The expression
^The matches The when it appears at the beginning of a
line; there can't be spaces or other text before The. If you want to
allow spaces, you can permit 0 or more space characters like this: ^
*The.
- $
- Anchors at the end of the line. end$ requires the text
end to be at the end of the line, with no intervening spaces or text.
- [^characters]
- This reverses the sense
of a bracketed character list. So [^abc] matches any single
character, except a, b, or c.
- [character-character]
- You
can include ranges in a bracketed character list. To match any lowercase letter,
use [a-z]. You can have more than one range; so to match the first
three or last three letters of the alphabet, try [a-cx-z]. To get
any letter, any case, try [a-zA-Z]. You can mix ranges with single
characters and with the ^metacharacter; for example, [^a-zBZ]means
``anything except a lowercase letter, capital B, or capital Z.''
- ()
- You can use parentheses to group parts of the regular expression,
just as you do in a mathematical expression.
- |
- |means ``or.'' You can use it to provide a series of
alternative expressions. Usually you want to put the alternatives in parentheses,
like this: c(ad|ab|at)matches cad or cab or cat. Without the parentheses,
it would match cad or ab or at instead
- \
- Escapes any special characters; if you want to find
a literal *, you type \*. The slash means
to ignore *'s usual special meaning.
Here are some more examples to help you get a feel for things:
- c.pe
- matches cope, cape, caper.
- c\ .pe
- matches c.pe, c.per.
- sto*p
- matches stp, stop, stoop.
- car.*n
- matches carton, cartoon, carmen.
- xyz.*
- matches xyz and anything after it; some tools, like egrep,
only match until the end of the line.
- ^The
- matches The at the beginning of a line.
- atime$
- matches atime at the end of a line.
- ^Only$
- matches a line that consists solely of the word Only -
no spaces, no other characters, nothing. Only Only is allowed.
- b[aou]rn
- matches barn, born, burn.
- Ver[D-F]
- matches VerD, VerE, VerF.
- Ver[^0-9]
- matches Ver followed by any non-digit.
- the[ir][re]
- matches their, therr, there, theie.
- [A-Za-z][A-Za-z]*
- matches any word which consists of only
letters, and at least one letter. It will not match numbers or spaces.
Next: Advanced Files
Up: Advanced Topics
Previous: Advanced Topics
  Contents
  Index
John Goerzen / Ossama Othman