Purchase  Copyright © 2002 Paul Sheer. Click here for copying permissions.  Home 

next up previous contents
Next: 9. Processes, Environment Variables Up: rute Previous: 7. Shell Scripting   Contents


8. Streams and sed -- The Stream Editor

The ability to use pipes is one of the powers of UNIX. This is one of the principle deficiencies of some non-UNIX systems. Pipes used on the command-line as explained in this chapter are a neat trick, but pipes used inside C programs enormously simplify program interaction. Without pipes, huge amounts of complex and buggy code usually needs to be written to perform simple tasks. It is hoped that this chapter will give the reader an idea of why UNIX is such a ubiquitous and enduring standard.

8.1 Introduction

The commands grep, echo, df and so on print some output to the screen. In fact, what is happening on a lower level is that they are printing characters one by one into a theoretical data stream (also called a pipe) called the stdout pipe. The shell itself performs the action of reading those characters one by one and displaying them on the screen. The word pipe itself means exactly that: A program places data in the one end of a funnel while another program reads that data from the other end. Pipes allow two separate programs to perform simple communications with each other. In this case, the program is merely communicating with the shell in order to display some output.

The same is true with the cat command explained previously. This command, when run with no arguments, reads from the stdin pipe. By default, this pipe is the keyboard. One further pipe is the stderr pipe to which a program writes error messages. It is not possible to see whether a program message is caused by the program writing to its stderr or stdout pipe because usually both are directed to the screen. Good programs, however, always write to the appropriate pipes to allow output to be specially separated for diagnostic purposes if need be.

8.2 Tutorial

Create a text file with lots of lines that contain the word GNU and one line that contains the word GNU as well as the word Linux. Then run grep GNU myfile.txt. The result is printed to stdout as usual. Now try grep GNU myfile.txt > gnu_lines.txt. What is happening here is that the output of the grep command is being redirected into a file. The > gnu_lines.txt tells the shell to create a new file gnu_lines.txt and to fill it with any output from stdout instead of displaying the output as it usually does. If the file already exists, it will be truncated. [Shortened to zero length.]

Now suppose you want to append further output to this file. Using >> instead of > does not truncate the file, but appends output to it. Try

echo "morestuff" >> gnu_lines.txt

then view the contents of gnu_lines.txt.

8.3 Piping Using | Notation

The real power of pipes is realized when one program can read from the output of another program. Consider the grep command, which reads from stdin when given no arguments; run grep with one argument on the command-line:

[root@cericon]# grep GNU
A line without that word in it
Another line without that word in it
A line with the word GNU in it
A line with the word GNU in it
I have the idea now

grep's default behavior is to read from stdin when no files are given. As you can see, it is doing its usual work of printing lines that have the word GNU in them. Hence, lines containing GNU will be printed twice--as you type them in and again when grep reads them and decides that they contain GNU.

Now try grep GNU myfile.txt | grep Linux. The first grep outputs all lines with the word GNU in them to stdout. The | specifies that all stdout is to be typed as stdin (as we just did above) into the next command, which is also a grep command. The second grep command scans that data for lines with the word Linux in them. grep is often used this way as a filter [Something that screens data.]and can be used multiple times, for example,

grep L myfile.txt | grep i | grep n | grep u | grep x

The < character redirects the contents of a file in place of stdin. In other words, the contents of a file replace what would normally come from a keyboard. Try

grep GNU < gnu_lines.txt

8.4 A Complex Piping Example

In Chapter 5 we used grep on a dictionary to demonstrate regular expressions. This is how a dictionary of words can be created (your dictionary might be under /var/share/ or under /usr/lib/aspell instead):

cat /usr/lib/ispell/english.hash | strings | tr 'A-Z' 'a-z' \
| grep '^[a-z]' | sort -u > mydict

[A backslash \ as the last character on a line indicates that the line is to be continued. You can leave out the \ but then you must leave out the newline as well -- this is known as line continuation.]

The file english.hash contains the UNIX dictionary normally used for spell checking. With a bit of filtering, you can create a dictionary that will make solving crossword puzzles a breeze. First, we use the command strings, explained previously, to extract readable bits of text. Here we are using its alternate mode of operation where it reads from stdin when no files are specified on its command-line. The command tr (abbreviated from translate--see tr(1)) then converts upper to lower case. The grep command then filters out lines that do not start with a letter. Finally, the sort command sorts the words in alphabetical order. The -u option stands for unique, and specifies that duplicate lines of text should be stripped. Now try less mydict.

8.5 Redirecting Streams with >&

Try the command ls nofile.txt > A. We expect that ls will give an error message if the file doesn't exist. The error message is, however, displayed and not written into the file A. The reason is that ls has written its error message to stderr while > has only redirected stdout. The way to get both stdout and stderr to both go to the same file is to use a redirection operator. As far as the shell is concerned, stdout is called 1 and stderr is called 2, and commands can be appended with a redirection like 2>&1 to dictate that stderr is to be mixed into the output of stdout. The actual words stderr and stdout are only used in C programming, where the number 1, 2 are known as file numbers or file descriptors. Try the following:

touch existing_file
rm -f non-existing_file
ls existing_file non-existing_file

ls will output two lines: a line containing a listing for the file existing_file and a line containing an error message to explain that the file non-existing_file does not exist. The error message would have been written to stderr or file descriptor number 2, and the remaining line would have been written to stdout or file descriptor number 1.

Next we try

ls existing_file non-existing_file 2>A
cat A

Now A contains the error message, while the remaining output came to the screen. Now try

ls existing_file non-existing_file 1>A
cat A

The notation 1>A is the same as >A because the shell assumes that you are referring to file descriptor 1 when you don't specify a file descriptor. Now A contains the stdout output, while the error message has been redirected to the screen.

Now try

ls existing_file non-existing_file 1>A 2>&1
cat A

Now A contains both the error message and the normal output. The >& is called a redirection operator. x >&y tells the shell to write pipe x into pipe y. Redirection is specified from right to left on the command-line. Hence, the above command means to mix stderr into stdout and then to redirect stdout to the file A.


ls existing_file non-existing_file 2>A 1>&2
cat A

We notice that this has the same effect, except that here we are doing the reverse: redirecting stdout into stderr and then redirecting stderr into a file A.

To see what happens if we redirect in reverse order, we can try,

ls existing_file non-existing_file 2>&1 1>A
cat A

which means to redirect stdout into a file A, and then to redirect stderr into stdout. This command will therefore not mix stderr and stdout because the redirection to A came first.

8.6 Using sed to Edit Streams

ed used to be the standard text editor for UNIX. It is cryptic to use but is compact and programmable. sed stands for stream editor and is the only incarnation of ed that is commonly used today. sed allows editing of files non-interactively. In the way that grep can search for words and filter lines of text, sed can do search-replace operations and insert and delete lines into text files. sed is one of those programs with no man page to speak of. Do info sed to see sed's comprehensive info pages with examples.

The most common usage of sed is to replace words in a stream with alternative words. sed reads from stdin and writes to stdout. Like grep, it is line buffered, which means that it reads one line in at a time and then writes that line out again after performing whatever editing operations. Replacements are typically done with

cat <file> | sed -e 's/<search-regexp>/<replace-text>/<option>' \
> <resultfile>

where <search-regexp> is a regular expression, <replace-text> is the text you would like to replace each occurrence with, and <option> is nothing or g, which means to replace every occurrence in the same line (usually sed just replaces the first occurrence of the regular expression in each line). (There are other <option>; see the sed info page.) For demonstration, type

sed -e 's/e/E/g'

and type out a few lines of English text.

8.7 Regular Expression Subexpressions

The section explains how to do the apparently complex task of moving text around within lines. Consider, for example, the output of ls: say you want to automatically strip out only the size column-- sed can do this sort of editing if you use the special \( \) notation to group parts of the regular expression together. Consider the following example:

sed -e 's/\(<[^ ]*>\)\([ ]*\)\(<[^ ]*>\)/\3\2\1/g'

Here sed is searching for the expression \<.*\>[ ]*\<.*\>. From the chapter on regular expressions, we can see that it matches a whole word, an arbitrary amount of whitespace, and then another whole word. The \( \) groups these three so that they can be referred to in <replace-text>. Each part of the regular expression inside \( \) is called a subexpression of the regular expression. Each subexpression is numbered--namely, \1, \2, etc. Hence, \1 in <replace-text> is the first \<[^ ]*\>, \2 is [ ]*, and \3 is the second \<[^ ]*\>.

Now test to see what happens when you run this:

sed -e 's/\(<[^ ]*>\)\([ ]*\)\(<[^ ]*>\)/\3\2\1/g'
GNU Linux is cool
Linux GNU cool is

To return to our ls example (note that this is just an example, to count file sizes you should instead use the du command), think about how we could sum the bytes sizes of all the files in a directory:

expr 0 `ls -l | grep '^-' | \
    sed 's/^\([^ ]*[ ]*\){4,4}\([0-9]*\).*$/ + \2/'`

We know that ls -l output lines start with - for ordinary files. So we use grep to strip lines not starting with -. If we do an ls -l, we see that the output is divided into four columns of stuff we are not interested in, and then a number indicating the size of the file. A column (or field) can be described by the regular expression [^ ]*[ ]*, that is, a length of text with no whitespace, followed by a length of whitespace. There are four of these, so we bracket it with \( \) and then use the \{ \} notation to specify that we want exactly 4. After that come our number [0-9]*, and then any trailing characters, which we are not interested in, .*$. Notice here that we have neglected to use \< \> notation to indicate whole words. The reason is that sed tries to match the maximum number of characters legally allowed and, in the situation we have here, has exactly the same effect.

If you haven't yet figured it out, we are trying to get that column of byte sizes into a format like

+ 438
+ 1525
+ 76
+ 92146

so that expr can understand it. Hence, we replace each line with subexpression \2 and a leading + sign. Backquotes give the output of this to expr, which studiously sums them, ignoring any newline characters as though the summation were typed in on a single line. There is one minor problem here: the first line contains a + with nothing before it, which will cause expr to complain. To get around this, we can just add a 0 to the expression, so that it becomes 0 + ....

8.8 Inserting and Deleting Lines

sed can perform a few operations that make it easy to write scripts that edit configuration files for you. For instance,

sed -e '7a\
an extra line.\
another one.\
one more.'

appends three lines after line 7, whereas

sed -e '7i\
an extra line.\
another one.\
one more.'

inserts three lines before line 7. Then

sed -e '3,5D'

Deletes lines 3 through 5.

In sed terminology, the numbers here are called addresses, which can also be regular expressions matches. To demonstrate:

sed -e '/Dear Henry/,/Love Jane/D'

deletes all the lines starting from a line matching the regular expression Dear Henry up to a line matching Love Jane (or the end of the file if one does not exist).

This behavior applies just as well to to insertions:

sed -e '/Love Jane/i\
Love Carol\
Love Beth'

Note that the $ symbol indicates the last line:

sed -e '$i\
The new second last line\
The new last line.'

and finally, the negation symbol, !, is used to match all lines not specified; for instance,

sed -e '7,11!D'

deletes all lines except lines 7 through 11.

next up previous contents
Next: 9. Processes, Environment Variables Up: rute Previous: 7. Shell Scripting   Contents