Syntax

Syntax

Thanks to lexical analysis, we can split up input streams into more structured units: lexical units. We still need to know how to assemble these units so that they amount to syntactically correct sentences in a given language. The syntactic assembly rules are defined by grammar rules. This formalism was originally developed in the field of linguistics, and has proven immensely useful to language-theoretical mathematicians and computer scientists in that field. We have already seen on page ?? an instance of a grammar for the Basic language. We will resume this example to introduce the basic concepts for grammars.

Grammar

Formally, a grammar is made up of four elements:

a set of symbols called terminals. Such symbols represent the lexical units of the language. In Basic, the lexical units (terminals) are: the operator- and arithmetical and logical relation-symbols (+, &, <, <=, ..), the keywords of the language (GOTO, PRINT, IF, THEN, ..), integers (integer units) and variables (variable units).
A set of symbols called non-terminals. Such symbols stand for syntactic terms of the language. For example, a Basic program is composed of lines (and thus we have the term Line), a line may contain and Expression, etc.
A set of so-called production rules. These describe how terminal and non-terminals symbols may be combined to produce a syntactic term. A Basic line is made up of a number followed by an instruction. This is expressed in the following rule:

Line ::= integer Instruction

For any given term, there may be several alternative ways to form that term. We separate the alternatives with the symbol | as in

Instruction ::= LET variable = Expression

| GOTO integer

| PRINT Expression

etc.
Finally, we designate a particular non-terminal as the start symbol. The start symbol identifies a complete translation unit (program) in our language, and the corresponding production rule is used as the starting point for parsing.

Production and Recognition

Production rules allow recognition that a sequence of lexemes belongs to a particular language.

Let's consider, for instance, a simple language for arithmetic expressions:

Exp	::=	integer	(R1)
	\|	Exp `+` Exp	(R2)
	\|	Exp `*` Exp	(R3)
	\|	`(` Exp `)`	(R4)

where (R1) (R2) (R3) and (R4) are the names given to our rules. After lexical analysis, the expression 1*(2+3) becomes the sequence of lexemes:

integer * ( integer + integer )

To analyze this sentence and recognize that it really belongs to the language of arithmetic expressions, we are going to use the rules from right to left: if a subexpression matches the right-side member of a rule, we replace it with the corresponding left-side member and we re-run the process until reducing the expression to the non-terminal start (here Exp). Here are the stages of such an analysis⁴:

integer `` `(` integer* `+` integer `)`	¬¾^(R1)	Exp `` `(` integer* `+` integer `)`
	¬¾^(R1)	Exp `` `(` Exp `+` integer* `)`
	¬¾^(R1)	Exp `*` `(` Exp `+` Exp `)`
	¬¾^(R2)	Exp `*` `(` Exp `)`
	¬¾^(R4)	Exp `*` Exp
	¬¾^(R3)	Exp

Starting from the last line containing only Exp and following the arrows upwards we read how our expression could be produced from the start rule Exp: therefore it is a well-formed sentence of the language defined by the grammar.

The translation of grammars into programs capable of recognizing that a sequence of lexemes belongs to the language defined by a grammar is a much more complex problem than that of using regular expressions. Indeed, a mathematical result tells us that all sets (of words) defined by means of a regular expression formalism can also be defined by another formalism: deterministic finite automata. And these latter are easy to explain as programs taking as input a sequence of characters. We do not have a similar result for the case of generic grammars. However, we have weaker results establishing the equivalence between certain classes of grammars and somewhat richer automata: pushdown automata. We do not want to enter into the details of such results, nor give an exact definition of what an automaton is. Still, we need to identify a class of grammars that may be used with parser-generating tools or parsed directly.

Top-down Parsing

The analysis of the expresion 1*(2+3) introduced in the previous paragraph is not unique: it could also have started by reducing integers from right to left, which would have permitted rule (R2) to reduce 2+3 from the beginning instead. These two ways to proceed constitute two types of analysis: top-down parsing (right-to-left) and bottom-up parsing (left-to-right). The latter is easily realizable with lexeme streams using module Stream. Bottom-up parsing is that carried-out by the ocamlyacc tool. It uses an explicit stack mechanism like the one already described for the parsing of Basic programs. The choice of parsing type is significant, as top-down analysis may or may not be possible given the form of the grammar used to specify the language.

A Simple Case

The canonical example for top-down parsing is the prefix notation of arithmetic expressions defined by:

Expr	::=	integer
	\|	`+` Expr Expr
	\|	`*` Expr Expr

In this case, knowing the first lexeme is enough to decide which production rule can be used. This immediate predictability obviates managing the parse stack explicitly by instead using the stack of recursive calls in the parser. Therefore, it is very easy to write a program implementing top-down analysis using the features in modules Genlex and Stream. Function infix_of is an example; it takes a prefix expression and returns its equivalent infix expression.


# let lexer s =
   let ll = Genlex.make_lexer ["+";"*"] 
   in  ll (Stream.of_string s) ;;
val lexer : string -> Genlex.token Stream.t = <fun>
# let rec stream_parse s = 
   match s with parser
     [<'Genlex.Ident x>] -> x
   | [<'Genlex.Int n>] -> string_of_int n
   | [<'Genlex.Kwd "+"; e1=stream_parse; e2=stream_parse>] -> "("^e1^"+"^e2^")"
   | [<'Genlex.Kwd "*"; e1=stream_parse; e2=stream_parse>] -> "("^e1^"*"^e2^")"
   | [<>] -> failwith "Parse error"
 ;;
val stream_parse : Genlex.token Stream.t -> string = <fun>
# let infix_of s = stream_parse (lexer s) ;; 
val infix_of : string -> string = <fun>
# infix_of "* +3 11 22";;
- : string = "((3+11)*22)"

One has to be careful, because this parser is rather unforgiving. It is advisable to introduce a blank between lexical units in the input string systematically.


# infix_of "*+3 11 22";;
- : string = "*+"

A Less Simple Case

Parsing using streams is predictive. It imposes two conditions on grammars.

There must be no left-recursive rules in the grammar. A rule is left-recursive when a right-hand expression starts with a non-terminal which is the left-hand member of the expression, as in Exp ::= Exp + Exp;
No two rules may start with the same expression.

The usual grammar for arithmetical expressions on page ?? is not directly suitable for top-down analysis: it does not satisfy any of the above-stated criteria. To be able to use top-down parsing, we must reformulate the grammar so as to suppress left-recursion and non-determinism in the rules. For arithmetic expressions, we may use, for instance:

Expr	::=	Atom NextExpr
NextExpr	::=	`+` Atom
	\|	`-` Atom
	\|	`*` Atom
	\|	`/` Atom
	\|	e
Atom	::=	integer
	\|	`(` Expr `)`

Note that the use of the empty word e in the definition of NextExpr is compulsory if we want a single integer to be an expression.

Our grammar allows the implementation of the following parser which is a simple translation of the production rules. This parser produces the abstract syntax tree of arithmetic expressions.


# let rec rest = parser 
     [< 'Lsymbol "+"; e2 = atom >]  -> Some (PLUS,e2)
   | [< 'Lsymbol "-"; e2 = atom >]  -> Some (MINUS,e2)
   | [< 'Lsymbol "*"; e2 = atom >]  -> Some (MULT,e2)
   | [< 'Lsymbol "/"; e2 = atom >]  -> Some (DIV,e2)
   | [< >] -> None
 and atom = parser 
     [< 'Lint i >] -> ExpInt i
   | [< 'Lsymbol "("; e = expr ; 'Lsymbol ")" >] -> e
 and expr s =
   match s with parser
     [< e1 = atom >] ->  
       match rest s with
           None -> e1
         | Some (op,e2) -> ExpBin(e1,op,e2) ;;
val rest : lexeme Stream.t -> (bin_op * expression) option = <fun>
val atom : lexeme Stream.t -> expression = <fun>
val expr : lexeme Stream.t -> expression = <fun>

The problem with using top-down parsing is that it forces us to use a grammar which is very restricted in its form. Moreover, when the object language is naturally described with a left-recursive grammar (as in the case of infix expressions) it is not always trivial to find an equivalent grammar (i.e. one defining the same language) that satisfies the requirements of top-down parsing. This is the reason why tools such as yacc and ocamlyacc use a bottom-up parsing mechanism which allows the definition of more natural-looking grammars. We will see, however, that not everything is possible with them, either.

Bottom-up Parsing

On page ??, we introduced intuitively the actions of bottom-up parsing: shift and reduce. With each of these actions the state of the stack is modified. We can deduce from this sequence of actions the grammar rules, provided the grammar allows it, as in the case of top-down parsing. Here, also, the difficulty lies in the non-determinism of the rules which prevents choosing between shifting and reducing. We are going to illustrate the inner workings of bottom-up parsing and its failures by considering those pervasive arithmetic expressions in postfix and prefix notation.

The Good News

The simplified grammar for postfix arithmetic expressions is:

Expr	::=	integer	(R1)
	\|	Expr Expr `+`	(R2)
	\|	Expr Expr `*`	(R3)

This grammar is dual to that of prefix expressions: it is necessary to wait until the end of each analysis to know which rule has been used, but then one knows exactly what to do. In fact, the bottom-up analysis of such expressions resembles quite closely a stack-based evaluation mechanism. Instead of pushing the results of each calculation, we simply push the grammar symbols. The idea is to start with an empty stack, then obtain a stack which contains only the start symbol once the input is used up. The modifications to the stack are the following: when we shift, we push the present non-terminal; if we may reduce, it is because the first elements in the stack match the right-hand member of a rule (in reverse order), in which case we replace these elements by the corresponding left-hand non-terminal.

Figure 11.2 illustrates how bottom-up parsing processes expression: 1 2 + 3 * 4 +. The input lexical unit is underlined. The end of input is noted with a $ sign.

Action Input Stack

12+3*4+$ []

Shift

2+3*4+$ [1]

Reduce (R1)

2+3*4+$ [Expr]

Shift

+3*4+$ [2Expr]

Reduce (R1)

+3*4+$ [Expr Expr]

Shift, Reduce (R2)

3*4+$ [Expr]

Shift, Reduce (R1)

*4+$ [Expr Expr]

Shift, Reduce (R3)

4+$ [Expr]

Shift, Reduce (R1)

+$ [Expr Expr]

Shift, Reduce (R2)

$ [Expr]

Figure 11.2: Bottom-up parsing example.

The Bad News

The difficulty of migrating the grammar into the recognition program is determining which type of action to apply. We will illustrate this difficulty with three examples which generate three types of indeterminacy.

The first example is a grammar for expressions using only addition:

E0	::=	integer	(R1)
	\|	E0 `+` E0	(R2)

The indeterminacy in this grammar stems from rule (R2). Let's suppose the following situation:

Action	Input	Stack
:
	`+`...	[E0 `+` E0 ...]
:

In such a case, it is impossible to determine whether we have to shift and push the + or to reduce using (R2) both E0's and the + in the stack. We are in the presence of a shift-reduce conflict. This is because expression integer + integer + integer can be produced in two ways by right-derivation.

First way:	E0	¾®^(R2)	E0 `+` E0
		¾®^(R1)	E0 `+` integer
		¾®^(R2)	E0 `+` E0 `+` integer
	etc.

Second way:	E0	¾®^(R2)	E0 `+` E0
		¾®^(R2)	E0 `+` E0 `+` E0
		¾®^(R1)	E0 `+` E0 `+` integer
	etc.

The expressions obtained by each derivation may look similar from the point of view of expression evaluation:

(integer + integer) + integer and integer + (integer + integer)

but different for building a syntax tree (see figure 6.3 on page ??).

The second instance of a grammar generating a conflict between shifting and reducing has the same type of ambiguity: an implicit parenthesizing. But contrary to the previous case, the choice between shifting and reducing modifies the meaning of the parsed expression. Let's consider the following grammar:

E1	::=	integer	(R1)
	\|	E1 `+` E1	(R2)
	\|	E1 `*` E1	(R3)

We find in this grammar the above-mentioned conflict both for + and for *. But there is an added conflict between + and *. Here again, an expression may be produced in two ways. There are two right-hand derivations of

integer + integer * integer

First way:	E1	¾®^(R3)	E1 `*` E1
		¾®^(R1)	E1 `` integer*
		¾®^(R2)	E1 `+` E1 `` integer*
	etc.

Second way:	E1	¾®^(R2)	E1 `+` E1
		¾®^(R3)	E1 `+` E1 `*` E1
		¾®^(R1)	E1 `+` E1 `` integer*
	etc.

Here both pairs of parenthesis (implicit) are not equivalent:

(integer + integer) * integer ¹ integer + (integer * integer)

This problem has already been cited for Basic expressions (see page ??). It was solved by attributing different precedence to each operator: we reduce (R3) before (R2), which is equivalent to parenthesizing products.

We can also solve the problem of choosing between + and * by modifying the grammar. We introduce two new terminals: T (for terms), and F (for factors), which gives:

E	::=	E `+` T	(R1)
	\|	T	(R2)
T	::=	T `*` F	(R3)
	\|	F	(R4)
F	::=	integer	(R5)

There is now but a single way to reach the production sequence integer + integer * integer: using rule (R1).

The third example concerns conditional instructions in programming languages. A language such as Pascal offers two conditionals : if .. then and if .. then .. else. Let's imagine the following grammar:

Instr	::=	`if` Exp `then` Instr	(R1)
	\|	`if` Exp `then` Instr `else` Instr	(R2)
	\|	etc...

In the following situation:

Action	Input	Stack
:
	`else`...	[Instr `then` Exp `if`...]
:

We cannot decide whether the first elements in the stack relate to conditional (R1), in which case it must be reduced, or to the first Instr in rule (R2), in which case it must be shifted.

Besides shift-reduce conflicts, bottom-up parsing may also generate reduce-reduce conflicts.

We now introduce the ocamlyacc tool which uses the bottom-up parsing technique and may find these conflicts.

The `ocamlyacc` Tool

The ocamlyacc tools is built with the same principles as ocamllex: it takes as input a file describing a grammar whose rules have semantic actions attached, and returns two Objective CAML files with extensions .ml ant .mli containing a parsing function and its interface.

General format

The syntax description files for ocamlyacc use extension .mly by convention and they have the following structure:

`%{`
	header
`}%`
	declarations
`%%`
	rules
`%%`
	trailer-and-end

The rule format is:

non-terminal	`:`	symbol...symbol	`{` semantic action `}`
	`\|`	...
	`\|`	symbol...symbol	`{` semantic action `}`
	`;`

A symbol is either a terminal or a non-terminal. Sections ``header'' and ``trailer-and-end'' play the same role as in ocamllex with the only exception that the header is only visible by the rules and not by declarations. In particular, this implies that module openings (open) are not taken into consideration in the declaration part and the types must therefore be fully qualified.

Semantic actions

Semantic actions are pieces of Objective CAML code executed when the parser reduces the rule they are associated with. The body of a semantic action may reference the components of the right-hand term of the rule. These are numbered from left to right starting with 1. The first component is referenced by $1, the second by $2, etc.

Start Symbols

We may declare several start symbols in the grammar, by writing in the declaration section:

 %start non-terminal .. non-terminal

For each of them a parsing function will be generated. We must precisely note, always in the declaration section, the output type of these functions.

 %type <output-type> non-terminal

The output-type must be qualified.

Warning

Non-terminal symbols become the name of parsing functions. Therefore, they must not start with a capital letter which is reserved for constructors.

Lexical units

Grammar rules make reference to lexical units, the terminals or terminal symbols in the rules.

One (or several) lexemes are declared in the following fashion:

 %token PLUS MINUS MULT DIV MOD

Certain lexical units, like identifiers, represent a set of (character) strings. When we find an identifier we may be interested in recovering its character string. We specify in the parser that these lexemes have an associated value by enclosing the type of this value between < and >:

 %token <string> IDENT

Warning

After being processed by ocamlyacc all these declarations are transformed into constructors of type token. Therefore, they must start with a capital letter.

We may use character strings as implicit terminals as in:

expr : expr "+" expr   { ... }
     | expr "*" expr   { ... }
     | ...             
     ;

in which case it is pointless to declare a symbol which represents them: they are directly processed by the parser without passing through the lexer. In the interest of uniformity, we do not advise this procedure.

Precedence, associativity

We have seen that many bottom-up parsing conflicts arise from implicit operator association rules or precedence conflicts between operators. To handle these conflicts, we may declare default associativity rules (left-to-right or non-associative) for operators as well as precedence rules. The following declaration states that operators + (lexeme PLUS) and * (lexeme MULT) associate to the right by default and * has a higher precedence than + because MULT is declared after PLUS.

 %left PLUS
 %left MULT

Two operators declared on the same line have the same precedence.

Command options

ocamlyacc has two options:

-b name: the generated Objective CAML files are name.ml and name.mli;
-v: create a file with extension .output contaning rule numeration, the states in the automaton recognizing the grammar and the sources of conflicts.

Joint usage with `ocamllex`

We may compose both tools ocamllex and ocamlyacc so that the transformation of a character stream into a lexeme stream is the input to the parser. To do this, type lexeme should be known to both. This type is defined in the files with extensions .mli and .ml generated by ocamlyacc from the declaration of the tokens in the matching file with extension .mly. The .mll file imports this type; ocamllex translates this file into an Objective CAML function of type Lexing.lexbuf -> lexeme. The example on page ?? illustrates this interaction and describes the different phases of compilation.

Contextual Grammars

Types generated by ocamlyacc process languages produced by so-called context-free grammars. A parser for such a grammar does not depend on previously processed syntactic values to process the next lexeme. This is not the case of the language L described by the following formula:

L ::= wCw | w with w Î (A|B)^*

where A, B and C are terminal symbols. We have written wCw (with w Î (A|B)^*) and not simply (A|B)^*C(A|B)^* because we want the same word to the left and right of the middle C.

To parse the words in L, we must remember what has already been found before letter C to verify that we find exactly the same thing afterwards. Here is a solution for this problem based on ``visiting'' a stream. The general idea of the algorithm is to build a stream parsing function which will recognize exactly the subword before the possible occurrence of C.

We use the type:


# type token = A | B | C ;;

Function parse_w1 builds the memorizing function for the first w under the guise of a list of atomic stream parsers (i.e. for a single token):


# let rec parse_w1 s =
   match s with parser
     [<'A; l = parse_w1  >] -> (parser [<'A >] -> "a")::l
   | [<'B; l = parse_w1  >] -> (parser [<'B >] -> "b")::l
   | [< >] -> [] ;;
val parse_w1 : token Stream.t -> (token Stream.t -> string) list = <fun>

The result of the function returned by parse_w1 is simply the character string containing the parsed lexical unit.

Function parse_w2 takes as argument a list built by parse_w1 to compose each of its elements into a single parsing function:


# let rec parse_w2 l =
   match l with
     p::pl -> (parser [< x = p; l = (parse_w2 pl)  >] -> x^l)
   | [] -> parser [<>] -> "" ;;
val parse_w2 : ('a Stream.t -> string) list -> 'a Stream.t -> string = <fun>

The result of applying parse_w2 will be the string representing subword w. By construction, function parse_w2 will not be able to recognize anything but the subword visited by parse_w1.

Using the ability to name intermediate results in streams, we write the recognition function for the words in the language L:


# let parse_L = parser  [< l = parse_w1 ; 'C; r = (parse_w2 l) >] -> r ;;
val parse_L : token Stream.t -> string = <fun>

Here are two small examples. The first results in the string surrounding C, the second fails because the words surrounding C are different:


# parse_L [< 'A; 'B; 'B; 'C; 'A; 'B; 'B >];;
- : string = "abb"
# parse_L [< 'A; 'B; 'C; 'B; 'A >];;
Uncaught exception: Stream.Error("")

Instruction	::=	`LET` variable `=` Expression
	\|	`GOTO` integer
	\|	`PRINT` Expression
etc.

Syntax

Grammar

Production and Recognition

Top-down Parsing

A Simple Case

A Less Simple Case

Bottom-up Parsing

The Good News

The Bad News

The ocamlyacc Tool

General format

Semantic actions

Start Symbols

Warning

Lexical units

Warning

Precedence, associativity

Command options

Joint usage with ocamllex

Contextual Grammars

The `ocamlyacc` Tool

Joint usage with `ocamllex`