6. Command-line Options and Typed Variables

Contents:
Command-line Options
Integer Variables and Arithmetic
Arrays

You should have a healthy grasp of shell programming techniques now that you have gone through the previous chapters. What you have learned up to this point enables you to write many nontrivial, useful shell scripts and functions.

Still, you may have noticed some remaining gaps in the knowledge you need to write shell code that behaves like the UNIX commands you are used to. In particular, if you are an experienced UNIX user, it might have occurred to you that none of the example scripts shown so far have the ability to handle options (preceded by a dash (-)) on the command line. And if you program in a conventional language like C or Pascal, you will have noticed that the only type of data that we have seen in shell variables is character strings; we haven't seen how to do arithmetic, for example.

These capabilities are certainly crucial to the shell's ability to function as a useful UNIX programming language. In this chapter, we will show how the Korn shell supports these and related features.

6.1 Command-line Options

We have already seen many examples of the positional parameters (variables called 1, 2, 3, etc.) that the shell uses to store the command-line arguments to a shell script or function when it runs. We have also seen related variables like * (for the string of all arguments) and # (for the number of arguments).

Indeed, these variables hold all of the information on the user's command-line. But consider what happens when options are involved. Typical UNIX commands have the form command [-options]args, meaning that there can be 0 or more options. If a shell script processes the command fred bob pete, then $1 is "bob" and $2 is "pete". But if the command is fred -o bob pete, then $1 is -o, $2 is "bob", and $3 is "pete".

You might think you could write code like this to handle it:

if [[ $1 = -o ]]; then
    code that processes the -o option
    1=$2
    2=$3
fi

normal processing of $1 and $2...

But this code has several problems. First, assignments like 1=$2 are illegal because positional parameters are read-only. Even if they were legal, another problem is that this kind of code imposes limitations on how many arguments the script can handle-which is very unwise. Furthermore, if this command had several possible options, the code to handle all of them would get very messy very quickly.

6.1.1 shift

Luckily, the shell provides a way around this problem. The command shift performs the function of:

1=$2
2=$3
...

for every argument, regardless of how many there are. If you supply a numeric argument to shift, it will shift the arguments that many times over; for example, shift 3 has this effect:

1=$4
2=$5
...

This leads immediately to some code that handles a single option (call it -o) and arbitrarily many arguments:

if [[ $1 = -o ]]; then
    process the -o option
    shift
fi

normal processing of arguments...

After the if construct, $1, $2, etc., are set to the correct arguments.

We can use shift together with the programming features we have seen so far to implement simple option schemes. However, we will need additional help when things get more complex. The getopts built-in command, which we will introduce later, provides this help.

shift by itself gives us enough power to implement the -N option to the highest script we saw in Chapter 4, Basic Shell Programming (Task 4-1). Recall that this script takes an input file that lists artists and the number of albums you have by them. It sorts the list and prints out the N highest numbers, in descending order. The code that does the actual data processing is:

filename=$1
howmany=${2:-10}
sort -nr $filename | head -$howmany

Our original syntax for calling this script was highest filename [-N], where N defaults to 10 if omitted. Let's change this to a more conventional UNIX syntax, in which options are given before arguments: highest [-N] filename. Here is how we would write the script with this syntax:

if [[ $1 = -+([0-9]) ]]; then
    howmany=$1
    shift
elif [[ $1 = -* ]]; then
    print 'usage: highest [-N] filename'
    return 1
else
    howmany="-10"
fi
filename=$1
sort -nr $filename | head $howmany

In this code, the option is considered to be supplied if $1 matches the pattern -+([0-9]). This uses one of the Korn shell's regular expression operators, which we saw in Chapter 4. Notice that we didn't surround the pattern with quotes (even double quotes); if we did, the shell would interpret it literally, not as a pattern. This pattern means "A dash followed by one or more digits." If $1 matches, then we assign it to the variable howmany.

If $1 doesn't match, we test to see if it's an option at all, i.e., if it matches the pattern -*. If it does, then it's invalid; we print an error message and exit with error status. If we reach the final (else) case, we assume that $1 is a filename and treat it as such in the ensuing code. The rest of the script processes the data as before.

We can extend what we have learned so far to a general technique for handling multiple options. For the sake of concreteness, assume that our script is called bob and we want to handle the options -a, -b, and -c:

while [[ $1 = -* ]]; do
    case $1 in 
	-a ) process option -a ;;
	-b ) process option -b ;;
	-c ) process option -c ;;
	*  ) print 'usage: bob [-a] [-b] [-c] args...'
	     return 1
    esac
    shift
done

normal processing of arguments...

This code checks $1 repeatedly as long as it starts with a dash (-). Then the case construct runs the appropriate code depending on which option $1 is. If the option is invalid - i.e., if it starts with a dash but isn't -a, -b, or -c - then the script prints a usage message and returns with an error exit status. After each option is processed, the arguments are shifted over. The result is that the positional parameters are set to the actual arguments when the while loop finishes.

Notice that this code is capable of handling options of arbitrary length, not just one letter (e.g., -fred instead of -a).

6.1.2 Options with Arguments

We need to add one more ingredient to make option processing really useful. Recall that many commands have options that take their own arguments. For example, the cut command, on which we relied heavily in Chapter 4, accepts the option -d with an argument that determines the field delimiter (if it is not the default TAB). To handle this type of option, we just use another shift when we are processing the option.

Assume that, in our bob script, the option -b requires its own argument. Here is the modified code that will process it:

while [[ $1 = -* ]]; do
    case $1 in 
	-a ) process option -a ;;
	-b ) process option -b 
	     $2 is the option's argument
	     shift ;;
	-c ) process option -c ;;
	*  ) print 'usage: bob [-a] [-b barg] [-c] args...'
	     return 1
    esac
    shift
done

normal processing of arguments...

6.1.3 getopts

So far, we have a complete, though still constrained, way of handling command-line options. The above code does not allow a user to combine arguments with a single dash, e.g., -abc instead of -a -b -c. It also doesn't allow one to specify arguments to options without a space in between, e.g., -barg in addition to -b arg. [1]

[1] Although most UNIX commands allow this, it is actually contrary to the Command Syntax Standard Rules in intro(1) of the User's Manual.

The shell provides a built-in way to deal with multiple complex options without these constraints. The built-in command getopts [2] can be used as the condition of the while in an option-processing loop. Given a specification of which options are valid and which require their own arguments, it sets up the body of the loop to process each option in turn.

[2] getopts replaces the external command getopt(1), used in Bourne shell programming; getopts is better integrated into the shell's syntax and runs more efficiently. C programmers will recognize getopts as very similar to the standard library routine getopt(3).

getopts takes two arguments. The first is a string that can contain letters and colons. Each letter is a valid option; if a letter is followed by a colon, the option requires an argument. getopts picks options off the command line and assigns each one (without the leading dash) to a variable whose name is getopts' second argument. As long as there are options left to process, getopts will return exit status 0; when the options are exhausted, it returns exit status 1, causing the while loop to exit.

getopts does a few other things that make option processing easier; we'll encounter them as we examine how to use getopts in the preceding example:

while getopts ":ab:c" opt; do
    case $opt in 
	a  ) process option -a ;;
	b  ) process option -b 
	     $OPTARG is the option's argument ;;
	c  ) process option -c ;;
	\? ) print 'usage: bob [-a] [-b barg] [-c] args...'
	     return 1
    esac
done
shift $(($OPTIND - 1))

normal processing of arguments...

The call to getopts in the while condition sets up the loop to accept the options -a, -b, and -c, and specifies that -b takes an argument. (We will explain the : that starts the option string in a moment.) Each time the loop body is executed, it will have the latest option available, without a dash (-), in the variable opt.

If the user types an invalid option, getopts normally prints an unfortunate error message (of the form cmd: getopts: o bad option(s)) and sets opt to ?. However-now here's an obscure kludge-if you begin the option letter string with a colon, getopts won't print the message. [3] We recommend that you specify the colon and provide your own error message in a case that handles ?, as above.

[3] Evidently this was deemed necessary because you can't redirect getopts' standard error output to /dev/null; the result is (usually) a core dump.

We have modified the code in the case construct to reflect what getopts does. But notice that there are no more shift statements inside the while loop: getopts does not rely on shifts to keep track of where it is. It is unnecessary to shift arguments over until getopts is finished, i.e., until the while loop exits.

If an option has an argument, getopts stores it in the variable OPTARG, which can be used in the code that processes the option.

The one shift statement left is after the while loop. getopts stores in the variable OPTIND the number of the next argument to be processed; in this case, that's the number of the first (non-option) command-line argument. For example, if the command line were bob -ab pete, then $OPTIND would be "2". If it were bob -a -b pete, then $OPTIND would be "3".

The expression $(($OPTIND - 1)) is an arithmetic expression (as we'll see later in this chapter) equal to $OPTIND minus 1. This value is used as the argument to shift. The result is that the correct number of arguments are shifted out of the way, leaving the "real" arguments as $1, $2, etc.

Before we continue, now is a good time to summarize everything that getopts does:

Its first argument is a string containing all valid option letters. If an option requires an argument, a colon follows its letter in the string. An initial colon causes getopts not to print an error message when the user gives an invalid option.
Its second argument is the name of a variable that will hold each option letter (without any leading dash) as it is processed.
If an option takes an argument, the argument is stored in the variable OPTARG.
The variable OPTIND contains a number equal to the next command-line argument to be processed. After getopts is done, it equals the number of the first "real" argument.

The advantages of getopts are that it minimizes extra code necessary to process options and fully supports the standard UNIX option syntax (as specified in intro(1) of the User's Manual).

As a more concrete example, let's return to our C compiler front end (Task 4-2). So far, we have given our script the ability to process C source files (ending in .c), assembly code files (.s), and object code files (.o). Here is the latest version of the script:

objfiles=""
for filename in "$@"; do
    case $filename in 
        *.c ) 
            objname=${filename%.c}.o
            compile $filename $objname ;;
        *.s )
            objname=${filename%.s}.o
            assemble $filename $objname ;;
        *.o ) 
            objname=$filename ;;
        *   ) 
            print "error: $filename is not a source or object file."
            return 1 ;;
    esac
    objfiles="$objfiles $objname"
done
ld $objfiles

Now we can give the script the ability to handle options. To know what options we'll need, we'll have to discuss further what compilers do.

6.1.3.1 More About C Compilers

The C compiler on a typical modern UNIX system (ANSI C on System V Release 4) has roughly 30 different command-line options, but we'll limit ourselves to the most widely-used ones.

Here's what we'll implement. All compilers provide the ability to eliminate the final linking step, i.e., the call to the linker ld. This is useful for compiling C code into object code files that will be linked later, and for taking advantage of the compiler's error checking separately before trying to link. The -c option suppresses the link step, producing only the compiled object code files.

C compilers are also capable of including lots of extra information in an object code file that can be used by a debugger (though it is ignored by the linker and the running program). If you don't know what a debugger is, see Chapter 9, Debugging Shell Programs. The debugger needs lots of information about the original C code to be able to do its job; the option -g directs the compiler to include this information in its object-code output.

If you aren't already familiar with UNIX C compilers, you may have thought it strange when you saw in the last chapter that the linker puts its output (the executable program) in a file called a.out. This convention is a historical relic that no one has bothered to change. Although it's certainly possible to change the executable's name with the mv command, the C compiler provides the option -o filename, which uses filename instead of a.out.

Another option we will support here has to do with libraries. A library is a collection of object code, some of which is to be included in the executable at link time. (This is in contrast to a precompiled object code file, all of which is linked in.) Each library includes a large amount of object code that supports a certain type of interface or activity; typical UNIX systems have libraries for things like networking, math functions, and graphics.

Libraries are extremely useful as building blocks that help programmers write complex programs without having to "reinvent the wheel" every time. The C compiler option -l name tells the linker to include whatever code is necessary from the library name [4] in the executable it builds. One particular library called c (the file libc.a) is always included. This is known as the C runtime library; it contains code for C's standard input and output capability, among other things.

[4] This is actually a file called libname.a in a standard library directory such as /lib.

Finally, it is possible for a good C compiler to do certain things that make its output object code smaller and more efficient. Collectively, these things are called optimization. You can think of an optimizer as an extra step in the compilation process that looks back at the object-code output and changes it for the better. The option -O invokes the optimizer.

Table 6.1 summarizes the options we will build into our C compiler front end.

Table 6.1: Popular C Compiler Options
Option	Meaning
-c	Produce object code only; do not invoke the linker
-g	Include debugging information in object code files
-l lib	Include the library lib when linking
-o exefile	Produce the executable file exefile instead of the default a.out
-O	Invoke the optimizer

You should also bear in mind this information about the options:

The options -o and -l lib are merely passed on to the linker (ld), which processes them on its own.
The -l lib option can be used multiple times to link in multiple libraries.
The -g option is passed to the ccom command (the program that does the actual C compilation).
We will assume that the optimizer is a separate program called optimize that accepts an object file as argument and optimizes it "in place," i.e., without producing a separate output file.

Here is the code for the script occ that includes option processing:

# initialize option-related variables
do_link=true
debug=""
link_libs="-l c"
exefile=""
opt=false

# process command-line options
while getopts ":cgl:o:O" opt; do
    case $opt in 
        c )    do_link=false ;;
        g )    debug="-g" ;;
        l )    link_libs="$link_libs -l $OPTARG" ;;
        o )    exefile="-o $OPTARG" ;;
        O )    opt=true ;;
        \? )    print 'usage: occ [-cgO] [-l lib] [-o file] files...'
               return 1 ;;
    esac
done
shift $(($OPTIND - 1))

# process the input files
objfiles=""
for filename in "$@"; do
    case $filename in 
        *.c ) 
            objname=${filename%.c}.o
            ccom $debug $filename $objname 
            if [[ $opt = true ]]; then
                optimize $objname 
            fi ;;
        *.s )
            objname=${filename%.s}.o
            as $filename $objname ;;
        *.o ) 
            objname=$filename ;;
        *   ) 
            print "error: $filename is not a source or object file."
            return 1 ;;
    esac
    objfiles="$objfiles $objname"
done

if [[ $do_link = true ]]; then
    ld $exefile $link_libs $objfiles
fi

Let's examine the option-processing part of this code. The first several lines initialize variables that we will use later to store the status of each of the options. We use "true" and "false" for truth values for readability; they are just strings and otherwise have no special meaning. The initializations reflect these assumptions:

We will want to link.
We will not want the compiler to generate space-consuming debugger information.
The only object-code library we will need is c, the standard C runtime library that is automatically linked in.
The executable file that the linker creates will be the linker's default file, a.out.
We will not want to invoke the optimizer.

The while, getopts, and case constructs process the options in the same way as the previous example. Here is what the code that handles each option does:

If the -c option is given, the do_link flag is set to "false," which will cause the if condition at the end of the script to be false, meaning that the linker will not run.
If -g is given, the debug variable is set to "-g". This is passed on the command line to the compiler.
Each -l lib that is given is appended to the variable link_libs, so that when the while loop exits, $link_libs is the entire string of -l options. This string is passed to the linker.
If -o file is given, the exefile variable is set to "-o file". This string is passed to the linker.
If -O is specified, the opt flag will be set. This specification causes the conditional if [[ $opt = true ]] to be true, which means that the optimizer will run.

The remainder of the code is a modification of the for loop we have already seen; the modifications are direct results of the above option processing and should be self-explanatory.