Syntax
Thanks to lexical analysis, we can split up input streams into more
structured units: lexical units. We still need to know how to assemble
these units so that they amount to syntactically correct sentences in
a given language. The syntactic assembly rules are defined by
grammar rules. This formalism was originally developed in the
field of linguistics, and has proven immensely useful to
language-theoretical mathematicians and computer scientists in that
field. We have already seen on page ?? an
instance of a grammar for the Basic language. We will resume this
example to introduce the basic concepts for grammars.
Grammar
Formally, a grammar is made up of four elements:
- a set of symbols called terminals. Such symbols
represent the lexical units of the language. In Basic,
the lexical units (terminals) are: the
operator- and arithmetical and logical relation-symbols (+, &, <, <=, ..), the keywords of the language (GOTO, PRINT, IF, THEN, ..), integers (integer
units) and variables (variable units).
- A set of symbols called non-terminals. Such symbols
stand for syntactic terms of the language. For example, a
Basic program is composed of lines (and thus we have the term Line), a line may contain and Expression, etc.
- A set of so-called production rules. These describe how
terminal and non-terminals symbols may be combined to produce a syntactic
term. A Basic line is made up of a number followed by an
instruction. This is expressed in the following rule:
Line |
::= |
integer Instruction |
For any given term, there may be several alternative ways to form that
term. We separate the alternatives with the symbol | as in
Instruction |
::= |
LET variable = Expression |
|
| |
GOTO integer |
|
| |
PRINT Expression |
etc. |
- Finally, we designate a particular non-terminal as the
start symbol. The start symbol identifies a complete
translation unit (program) in our language, and the corresponding
production rule is used as the starting point for parsing.
Production and Recognition
Production rules allow recognition that a sequence of
lexemes belongs to a particular language.
Let's consider, for instance, a simple language for arithmetic
expressions:
Exp |
::= |
integer |
(R1) |
|
| |
Exp + Exp |
(R2) |
|
| |
Exp * Exp |
(R3) |
|
| |
( Exp ) |
(R4) |
where (R1) (R2) (R3) and (R4) are the names given to our rules. After
lexical analysis, the expression 1*(2+3)
becomes the sequence
of lexemes:
integer * ( integer + integer )
To analyze this sentence and recognize that it really belongs to the
language of arithmetic expressions, we are going to use the rules
from right to left: if a subexpression matches the right-side member
of a rule, we replace it with the corresponding left-side member and
we re-run the process until reducing the expression to the
non-terminal start (here Exp). Here are the stages of such an
analysis4:
integer * ( integer + integer ) |
¬¾(R1) |
Exp * ( integer + integer ) |
|
¬¾(R1) |
Exp * ( Exp + integer ) |
|
¬¾(R1) |
Exp * ( Exp + Exp ) |
|
¬¾(R2) |
Exp * ( Exp ) |
|
¬¾(R4) |
Exp * Exp |
|
¬¾(R3) |
Exp |
Starting from the last line containing only Exp and following
the arrows upwards we read how our expression could be produced from
the start rule Exp: therefore it is a well-formed sentence of
the language defined by the grammar.
The translation of grammars into programs capable of recognizing that
a sequence of lexemes belongs to the language defined by a grammar is
a much more complex problem than that of using regular
expressions. Indeed, a mathematical result tells us that all sets (of
words) defined by means of a regular expression formalism can also be
defined by another formalism: deterministic finite
automata. And these latter are easy to explain as programs taking as
input a sequence of characters. We do not have a similar result for
the case of generic grammars. However, we have weaker results
establishing the equivalence between certain classes of grammars and
somewhat richer automata: pushdown automata. We do not want
to enter into the details of such results, nor give an exact
definition of what an automaton is. Still, we need to identify
a class of grammars that may be used with parser-generating tools or
parsed directly.
Top-down Parsing
The analysis of the expresion 1*(2+3)
introduced in the
previous paragraph is not unique: it could also have started by
reducing integers from right to left, which would have permitted
rule (R2) to reduce 2+3
from the beginning instead. These two
ways to proceed constitute two types of analysis: top-down parsing
(right-to-left) and bottom-up parsing (left-to-right). The latter is
easily realizable with lexeme streams using module
Stream. Bottom-up parsing is that carried-out by the
ocamlyacc tool. It uses an explicit stack mechanism like the
one already described for the parsing of Basic programs. The choice
of parsing type is significant, as top-down analysis may or
may not be possible given the form of the grammar used to specify the
language.
A Simple Case
The canonical example for top-down parsing is the prefix notation of
arithmetic expressions defined by:
Expr |
::= |
integer |
|
| |
+ Expr Expr |
|
| |
* Expr Expr |
In this case, knowing the first lexeme is enough to decide which
production rule can be used. This immediate predictability
obviates managing the parse stack explicitly by instead using the
stack of recursive calls in the parser. Therefore, it is
very easy to write a program implementing top-down analysis using the
features in modules Genlex and
Stream. Function infix_of is an example; it takes a
prefix expression and returns its equivalent infix expression.
# let
lexer
s
=
let
ll
=
Genlex.make_lexer
[
"+"
;"*"
]
in
ll
(Stream.of_string
s)
;;
val lexer : string -> Genlex.token Stream.t = <fun>
# let
rec
stream_parse
s
=
match
s
with
parser
[<
'Genlex.
Ident
x>]
->
x
|
[<
'Genlex.
Int
n>]
->
string_of_int
n
|
[<
'Genlex.
Kwd
"+"
;
e1=
stream_parse;
e2=
stream_parse>]
->
"("
^
e1^
"+"
^
e2^
")"
|
[<
'Genlex.
Kwd
"*"
;
e1=
stream_parse;
e2=
stream_parse>]
->
"("
^
e1^
"*"
^
e2^
")"
|
[<>]
->
failwith
"Parse error"
;;
val stream_parse : Genlex.token Stream.t -> string = <fun>
# let
infix_of
s
=
stream_parse
(lexer
s)
;;
val infix_of : string -> string = <fun>
# infix_of
"* +3 11 22"
;;
- : string = "((3+11)*22)"
One has to be careful, because this parser is rather unforgiving.
It is advisable to
introduce a blank between lexical units in the input string
systematically.
# infix_of
"*+3 11 22"
;;
- : string = "*+"
A Less Simple Case
Parsing using streams is predictive. It imposes two conditions on
grammars.
-
There must be no left-recursive rules in the grammar. A
rule is left-recursive when a right-hand expression starts with a
non-terminal which is the left-hand member of the expression, as in
Exp ::= Exp + Exp;
- No two rules may start with the same expression.
The usual grammar for arithmetical expressions on page
?? is not directly suitable for top-down analysis:
it does not satisfy any of the above-stated criteria. To be able to
use top-down parsing, we must reformulate the grammar so as to
suppress left-recursion and non-determinism in the rules. For
arithmetic expressions, we may use, for instance:
Expr |
::= |
Atom NextExpr |
NextExpr |
::= |
+ Atom |
|
| |
- Atom |
|
| |
* Atom |
|
| |
/ Atom |
|
| |
e |
Atom |
::= |
integer |
|
| |
( Expr ) |
Note that the use of the empty word e in the definition of
NextExpr is compulsory if we want a single integer to be an
expression.
Our grammar allows the implementation of the following parser which is
a simple translation of the production rules. This parser produces the
abstract syntax tree of arithmetic expressions.
# let
rec
rest
=
parser
[<
'Lsymbol
"+"
;
e2
=
atom
>]
->
Some
(PLUS,
e2)
|
[<
'Lsymbol
"-"
;
e2
=
atom
>]
->
Some
(MINUS,
e2)
|
[<
'Lsymbol
"*"
;
e2
=
atom
>]
->
Some
(MULT,
e2)
|
[<
'Lsymbol
"/"
;
e2
=
atom
>]
->
Some
(DIV,
e2)
|
[<
>]
->
None
and
atom
=
parser
[<
'Lint
i
>]
->
ExpInt
i
|
[<
'Lsymbol
"("
;
e
=
expr
;
'Lsymbol
")"
>]
->
e
and
expr
s
=
match
s
with
parser
[<
e1
=
atom
>]
->
match
rest
s
with
None
->
e1
|
Some
(op,
e2)
->
ExpBin(e1,
op,
e2)
;;
val rest : lexeme Stream.t -> (bin_op * expression) option = <fun>
val atom : lexeme Stream.t -> expression = <fun>
val expr : lexeme Stream.t -> expression = <fun>
The problem with using top-down parsing is that it forces us to
use a grammar which is very restricted in its form. Moreover, when the
object language is naturally described with a left-recursive grammar
(as in the case of infix expressions) it is not always trivial to find
an equivalent grammar (i.e. one defining the same language) that
satisfies the requirements of top-down parsing. This is the reason why
tools such as yacc and ocamlyacc use a bottom-up
parsing mechanism which allows the definition of more natural-looking
grammars. We will see, however, that not everything is possible
with them, either.
Bottom-up Parsing
On page ??, we introduced intuitively the
actions of bottom-up parsing: shift and
reduce. With each of these actions the state of the stack is
modified. We can deduce from this sequence of actions the grammar
rules, provided the grammar allows it, as in the case of top-down
parsing. Here, also, the difficulty lies in the non-determinism of
the rules which prevents choosing between shifting and reducing. We
are going to illustrate the inner workings of bottom-up parsing and its
failures by considering those pervasive arithmetic expressions in
postfix and prefix notation.
The Good News
The simplified grammar for postfix arithmetic expressions is:
Expr |
::= |
integer |
(R1) |
|
| |
Expr Expr + |
(R2) |
|
| |
Expr Expr * |
(R3) |
This grammar is dual to that of prefix expressions: it is necessary
to wait until the end of each analysis to know which rule
has been used, but then one knows exactly what to
do. In fact, the bottom-up analysis of such expressions resembles
quite closely a stack-based evaluation mechanism. Instead of pushing
the results of each calculation, we simply push the grammar
symbols. The idea is to start with an empty stack, then obtain a stack
which contains only the start symbol once the input is used
up. The modifications to the stack are the following: when we shift,
we push the present non-terminal; if we may reduce, it is because the
first elements in the stack match the right-hand member of a rule (in
reverse order), in which case we replace these elements by the
corresponding left-hand non-terminal.
Figure 11.2 illustrates how bottom-up parsing processes
expression: 1 2 + 3 * 4 +.
The input lexical unit is underlined. The end of input is noted with a
$ sign.
Action |
Input |
Stack |
|
12+3*4+$ |
[] |
Shift |
|
|
2+3*4+$ |
[1] |
Reduce (R1) |
|
|
2+3*4+$ |
[Expr] |
Shift |
|
|
|
+3*4+$ |
[2Expr] |
Reduce (R1) |
|
|
+3*4+$ |
[Expr Expr] |
Shift, Reduce (R2) |
|
|
3*4+$ |
[Expr] |
Shift, Reduce (R1) |
|
|
*4+$ |
[Expr Expr] |
Shift, Reduce (R3) |
|
|
4+$ |
[Expr] |
Shift, Reduce (R1) |
|
|
+$ |
[Expr Expr] |
Shift, Reduce (R2) |
|
|
$ |
[Expr] |
Figure 11.2: Bottom-up parsing example
.
The Bad News
The difficulty of migrating the grammar into the recognition program
is determining which type of action to apply. We will illustrate this
difficulty with three examples which generate three types of
indeterminacy.
The first example is a grammar for expressions using only addition:
E0 |
::= |
integer |
(R1) |
|
| |
E0 + E0 |
(R2) |
The indeterminacy in this grammar stems from rule (R2). Let's suppose
the following situation:
Action |
Input |
Stack |
: |
|
|
|
+... |
[E0 + E0 ...] |
: |
|
|
In such a case, it is impossible to determine whether we have to shift
and push the + or to reduce using (R2) both E0's and the
+ in the stack. We are in the presence of a shift-reduce
conflict. This is because expression integer +
integer + integer can be produced in two ways by
right-derivation.
First way: |
E0 |
¾®(R2) |
E0 + E0 |
|
|
¾®(R1) |
E0 + integer |
|
|
¾®(R2) |
E0 + E0 + integer |
|
etc. |
Second way: |
E0 |
¾®(R2) |
E0 + E0 |
|
|
¾®(R2) |
E0 + E0 + E0 |
|
|
¾®(R1) |
E0 + E0 + integer |
|
etc. |
The expressions obtained by each derivation may look similar from the
point of view of expression evaluation:
(integer + integer) + integer and
integer + (integer + integer)
but different for building a syntax tree (see figure
6.3 on page ??).
The second instance of a grammar generating a conflict between
shifting and reducing has the same type of ambiguity: an implicit
parenthesizing. But contrary to the previous case, the choice between
shifting and reducing modifies the meaning of the parsed
expression. Let's consider the following grammar:
E1 |
::= |
integer |
(R1) |
|
| |
E1 + E1 |
(R2) |
|
| |
E1 * E1 |
(R3) |
We find in this grammar the above-mentioned conflict both for +
and for *. But there is an added conflict between + and
*. Here again, an expression may be produced in two ways. There
are two right-hand derivations of
integer + integer * integer
First way: |
E1 |
¾®(R3) |
E1 * E1 |
|
|
¾®(R1) |
E1 * integer |
|
|
¾®(R2) |
E1 + E1 * integer |
|
etc. |
Second way: |
E1 |
¾®(R2) |
E1 + E1 |
|
|
¾®(R3) |
E1 + E1 * E1 |
|
|
¾®(R1) |
E1 + E1 * integer |
|
etc. |
Here both pairs of parenthesis (implicit) are not equivalent:
(integer + integer) * integer
¹ integer + (integer * integer)
This problem has already been cited for Basic expressions (see page
??). It was solved by attributing different
precedence to each operator: we reduce (R3) before (R2), which
is equivalent to parenthesizing products.
We can also solve the problem of choosing between + and *
by modifying the grammar. We introduce two new terminals: T
(for terms), and F (for factors), which gives:
E |
::= |
E + T |
(R1) |
|
| |
T |
(R2) |
T |
::= |
T * F |
(R3) |
|
| |
F |
(R4) |
F |
::= |
integer |
(R5) |
There is now but a single way to reach the production sequence
integer + integer * integer: using rule (R1).
The third example concerns conditional instructions in programming
languages. A language such as Pascal offers two conditionals :
if .. then and if .. then .. else. Let's imagine the
following grammar:
Instr |
::= |
if Exp then Instr |
(R1) |
|
| |
if Exp then Instr else Instr |
(R2) |
|
| |
etc... |
In the following situation:
Action |
Input |
Stack |
: |
|
|
|
else... |
[Instr then Exp if...] |
: |
|
|
We cannot decide whether the first elements in the stack relate to
conditional (R1), in which case it must be reduced, or to the first
Instr in rule (R2), in which case it must be shifted.
Besides shift-reduce conflicts, bottom-up parsing may also generate
reduce-reduce conflicts.
We now introduce the ocamlyacc tool which uses the bottom-up
parsing technique and may find these conflicts.
The ocamlyacc Tool
The ocamlyacc tools is built with the same principles as ocamllex: it takes as input a file describing a grammar whose rules
have semantic actions attached, and returns two Objective CAML files with
extensions .ml ant .mli containing a parsing function
and its interface.
General format
The syntax description files for ocamlyacc use extension .mly by convention and they have the following structure:
%{ |
|
|
header |
}% |
|
|
declarations |
%% |
|
|
rules |
%% |
|
|
trailer-and-end |
The rule format is:
non-terminal |
: |
symbol...symbol |
{ semantic action } |
|
| |
... |
|
| |
symbol...symbol |
{ semantic action } |
|
; |
A symbol is either a terminal or a non-terminal. Sections ``header''
and ``trailer-and-end'' play the same role as in ocamllex with
the only exception that the header is only visible by the rules and
not by declarations. In particular, this implies that module openings
(open) are not taken into consideration in the declaration
part and the types must therefore be fully qualified.
Semantic actions
Semantic actions are pieces of Objective CAML code executed when the parser
reduces the rule they are associated with. The body of a semantic
action may reference the components of the right-hand term of the
rule. These are numbered from left to right starting with 1. The first
component is referenced by $1, the second by $2, etc.
Start Symbols
We may declare several start symbols in the grammar, by writing
in the declaration section:
%start non-terminal .. non-terminal
For each of them a parsing function will be generated. We must
precisely note, always in the declaration section, the
output type of these functions.
%type <output-type> non-terminal
The output-type
must be qualified.
Warning
Non-terminal symbols become the name of parsing functions. Therefore,
they must not start with a capital letter which is reserved for
constructors.
Lexical units
Grammar rules make reference to lexical units, the terminals or
terminal symbols in the rules.
One (or several) lexemes are declared in the following fashion:
%token PLUS MINUS MULT DIV MOD
Certain lexical units, like identifiers, represent a set of
(character) strings. When we find an identifier we may be interested
in recovering its character string. We specify in the parser that
these lexemes have an associated value by enclosing the type of this
value between <
and >
:
%token <string> IDENT
Warning
After being processed by ocamlyacc all these declarations are
transformed into constructors of type token. Therefore,
they must start with a capital letter.
We may use character strings as implicit terminals as in:
expr : expr "+" expr { ... }
| expr "*" expr { ... }
| ...
;
in which case it is pointless to declare a symbol which represents
them: they are directly processed by the parser without passing
through the lexer. In the interest of uniformity, we do not advise
this procedure.
Precedence, associativity
We have seen that many bottom-up parsing conflicts arise from implicit
operator association rules or precedence conflicts between operators. To
handle these conflicts, we may declare default associativity
rules (left-to-right or non-associative) for operators as well as
precedence rules. The following declaration states that operators + (lexeme PLUS) and * (lexeme MULT) associate to
the right by default and * has a higher precedence than +
because MULT is declared after PLUS.
%left PLUS
%left MULT
Two operators declared on the same line have the same precedence.
Command options
ocamlyacc has two options:
-
-b name: the generated Objective CAML files are name.ml
and name.mli;
- -v: create a file with extension .output contaning
rule numeration, the states in the automaton recognizing the grammar
and the sources of conflicts.
Joint usage with ocamllex
We may compose both tools ocamllex and ocamlyacc so that
the transformation of a character stream into a lexeme stream is the
input to the parser. To do this, type lexeme should be known
to both. This type is defined in the files with extensions .mli
and .ml generated by ocamlyacc from the declaration of
the tokens in the matching file with extension .mly. The .mll file imports this type; ocamllex
translates this file into an Objective CAML function of type
Lexing.lexbuf -> lexeme. The example on page
?? illustrates this interaction and describes the
different phases of compilation.
Contextual Grammars
Types generated by ocamlyacc process languages produced by
so-called context-free grammars. A parser for such a grammar
does not depend on previously processed syntactic values to process
the next lexeme. This is not the case of the language L described by
the following formula:
L ::= wCw | w with w Î (A|B)*
where A, B and C are terminal symbols. We have written wCw
(with w Î (A|B)*) and not simply (A|B)*C(A|B)* because we
want the same word to the left and right of the middle C.
To parse the words in L, we must remember what has already
been found before letter C to verify that we find exactly the same
thing afterwards. Here is a solution for this problem based on
``visiting'' a stream. The general idea of the algorithm is to build a
stream parsing function which will recognize exactly the subword
before the possible occurrence of C.
We use the type:
# type
token
=
A
|
B
|
C
;;
Function parse_w1 builds the memorizing function for the
first w under the guise of a list of atomic stream parsers
(i.e. for a single token):
# let
rec
parse_w1
s
=
match
s
with
parser
[<
'A;
l
=
parse_w1
>]
->
(parser
[<
'A
>]
->
"a"
)::l
|
[<
'B;
l
=
parse_w1
>]
->
(parser
[<
'B
>]
->
"b"
)::l
|
[<
>]
->
[]
;;
val parse_w1 : token Stream.t -> (token Stream.t -> string) list = <fun>
The result of the function returned by parse_w1 is simply the
character string containing the parsed lexical unit.
Function parse_w2 takes as argument a list built by
parse_w1 to compose each of its elements into a single
parsing function:
# let
rec
parse_w2
l
=
match
l
with
p::pl
->
(parser
[<
x
=
p;
l
=
(parse_w2
pl)
>]
->
x^
l)
|
[]
->
parser
[<>]
->
""
;;
val parse_w2 : ('a Stream.t -> string) list -> 'a Stream.t -> string = <fun>
The result of applying parse_w2 will be the string
representing subword w. By construction, function parse_w2
will not be able to recognize anything but the subword visited by
parse_w1.
Using the ability to name intermediate results in streams, we
write the recognition function for the words in the language L:
# let
parse_L
=
parser
[<
l
=
parse_w1
;
'C;
r
=
(parse_w2
l)
>]
->
r
;;
val parse_L : token Stream.t -> string = <fun>
Here are two small examples. The first results in the string
surrounding C, the second fails because the words surrounding C
are different:
# parse_L
[<
'A;
'B;
'B;
'C;
'A;
'B;
'B
>]
;;
- : string = "abb"
# parse_L
[<
'A;
'B;
'C;
'B;
'A
>]
;;
Uncaught exception: Stream.Error("")