[Chapter 28] 28.2 Tokenizing Rules

28.2 Tokenizing Rules

The sendmail program views the text that makes up rules and addresses as being composed of individual tokens. Rules are tokenized - divided up into individual parts - while the configuration file is being read and while they are being normalized. Addresses are tokenized at another time (as we'll show later), but the process is the same for both.

The text our.domain, for example, is composed of three tokens: our, a dot, and domain. These 10 characters are divided into tokens by the list of separation characters defined by the OperatorChars (pre-V8.7 $o) option (see Section 34.8.45, OperatorChars or $o):

Do.:%@!^=/[]                          prior to V8.7
O OperatorChars=.:%@!^/[]             V8.7 and above

When any of these separation characters are recognized in text, they are considered individual tokens. Any leftover text is then combined into the remaining tokens.

xxx@yyy;zzz    becomes   xxx  @   yyy;zzz

@ is defined to be a token, but ; is not. Therefore, the text is divided into three tokens. However, in addition to the characters in the OperatorChars (pre-V8.7 $o) option, sendmail defines 10 tokenizing characters internally (in parseaddr.c):

()<>,;\"\r\n

These two lists are combined into one master list that is used for all tokenizing. The above example, when divided by using this master list, becomes five tokens instead of just three:

xxx@yyy;zzz    becomes   xxx  @   yyy  ;  zzz

In rules, quotation marks can be used to override the meaning of tokenizing characters defined in the master list. For example,

"xxx@yyy";zzz    becomes   "xxx@yyy"  ;  zzz

Here, three tokens are produced, because the @ appears inside quotation marks. Note that the quotation marks are retained.

Because the configuration file is read sequentially from start to finish, the OperatorChars (pre-V8.7 $o) option should be defined before any rules are declared. But note that beginning with V8.7 sendmail, omission of this option cause the separation characters to default to

. : % @ ! ^ / [ ]

28.2.1 $ Operators Are Tokens

As we progress into the details of rules, you will see that certain characters become operators when prefixed with a $ character. Operators cause sendmail to perform actions, such as looking for a match ($* is a wildcard operator) or replacing tokens with others by position ($1 is a replacement operator).

For tokenizing purposes, operators always divide one token from another, just as the characters in the master list did. For example

xxx$*zzz    becomes   xxx  $*  zzz

28.2.2 The Space Character Is Special

The space character is special for two reasons. First, although the space character is not in the master list, it always separates one token from another:

xxx zzz    becomes   xxx  zzz

Second, although the space character separates tokens, it is not itself a token. That is, in the above example the seven characters on the left (the seventh is the space in the middle) become two tokens of three letters each, not three tokens. Therefore the space character can be used inside the LHS or RHS of rules for improved clarity but does not itself become a token or change the meaning of the rule.

28.2.3 Pasting Addresses Back Together

After an address has passed through all the rules (and has been modified by rewriting), the tokens that form it are pasted back together to form a single string. The pasting process is very straightforward in that it mirrors the tokenizing process:

xxx  @  yyy   becomes    xxx@yyy

The only exception to this straightforward pasting process occurs when two adjoining tokens are both simple text. Simple text is anything other than the separation characters (defined by the OperatorChars (pre-V8.7, $o) option, see Section 34.8.45, and internally by sendmail) or the operators (characters prefixed by a $ character). The xxx and yyy above are both simple text.

When two tokens of simple text are pasted together, the character defined by the BlankSub (B) option (see Section 34.8.5, BlankSub (B)) is inserted between them. [4] Usually, that option is defined as a dot, so two tokens of simple text would have a dot inserted between them when they are joined:

[4] In the old days (RFC733), usernames to the left of the @ could contain spaces. But UNIX also uses spaces as command-line argument separators, so option B was introduced.

xxx  yyy   becomes    xxx.yyy

Note that the improper use of a space character in the LHS or RHS of rules can lead to addresses that have a dot (or other character) inserted where one was not intended.


28.1 Overview		28.3 The Workspace