Basic Revisited

Basic Revisited

We now want to use ocamllex and ocamlyacc to replace function parse on page ?? for Basic by some functions generated from files specifying the lexicon and syntax of the language.

To do this, we may not re-use as-is the type of lexical units that we have defined. We will be forced to define a more precise type which permits us to distinguish between operators, commands and keywords.

We will also need to isolate the type declarations describing abstract syntax within a file basic_types.mli. This will contain the declaration of type sentences and all types needed by it.

File `basic_parser.mly`

Header

The file header imports the types needed for the abstract syntax as well as two auxiliary functions to convert from character strings to their equivalent in the abstract syntax.


%{
open Basic_types ;;

let phrase_of_cmd c =
 match c with
   "RUN" -> Run
 | "LIST" -> List
 | "END" -> End
 | _ -> failwith "line : unexpected command"
;;

let bin_op_of_rel r =
 match r with
   "=" -> EQUAL
 | "<" -> INF
 | "<=" -> INFEQ
 | ">" -> SUP
 | ">=" -> SUPEQ
 | "<>" -> DIFF
 | _ -> failwith "line : unexpected relation symbol"
;;

%}

Declarations

contains three sections: lexeme declarations, their rule associativity and precedence declarations, and the declaration of the start symbol line which stands for the parsing of a command or program line.

Lexical units are the following:


 %token <int> Lint
 %token <string> Lident
 %token <string> Lstring
 %token <string> Lcmd
 %token Lplus Lminus Lmult Ldiv Lmod
 %token <string> Lrel
 %token Land Lor Lneg
 %token Lpar Rpar
 %token <string> Lrem 
 %token Lrem Llet Lprint Linput Lif Lthen Lgoto
 %token Lequal
 %token Leol

Their names are self-explanatory and they are described in file basic_lexer.mll (see page ??).

Precedence rules between operators once again take the values assigned by functions priority_uop and priority_binop defined when first giving the grammar for our Basic (see page ??).


 %right Lneg
 %left Land Lor
 %left Lequal Lrel
 %left Lmod
 %left Lplus Lminus
 %left Lmult Ldiv
 %nonassoc Lop

Symbol Lop will be used to process unary minus. It is not a terminal in the grammar, but a ``pseudo non-terminal'' which allows overloading of operators when two uses of an operator should not receive the same precedence depending on context. This is the case with the minus symbol (-). We will reconsider this point once we have specified the rules in the grammar.

Since the start symbol is line, the function generated will return the syntax tree for the parsed line.

 %start line
 %type <Basic_types.phrase> line

Grammar rules

are decomposed into three non-terminals: line for a line; inst for an instruction in the language; exp for expressions. The action associated with each rule simply builds the corresponding abstract syntax tree.


 %%
line : 
   Lint inst Leol               { Line {num=$1; inst=$2} }
 | Lcmd Leol                    { phrase_of_cmd $1 }
 ;

inst :
   Lrem                         { Rem $1 }
 | Lgoto Lint                   { Goto $2 }
 | Lprint exp                   { Print $2 }
 | Linput Lident                { Input $2 }
 | Lif exp Lthen Lint           { If ($2, $4) }
 | Llet Lident Lequal exp        { Let ($2, $4) }
 ;

exp :
   Lint                         { ExpInt $1 }
 | Lident                       { ExpVar $1 }
 | Lstring                      { ExpStr $1 }
 | Lneg exp                     { ExpUnr (NOT, $2) }
 | exp Lplus exp                { ExpBin ($1, PLUS, $3) }
 | exp Lminus exp               { ExpBin ($1, MINUS, $3) }
 | exp Lmult exp                { ExpBin ($1, MULT, $3) }
 | exp Ldiv exp                 { ExpBin ($1, DIV, $3) }
 | exp Lmod exp                 { ExpBin ($1, MOD, $3) }
 | exp Lequal exp                { ExpBin ($1, EQUAL, $3) }
 | exp Lrel exp                 { ExpBin ($1, (bin_op_of_rel $2), $3) }
 | exp Land exp                 { ExpBin ($1, AND, $3) }
 | exp Lor exp                  { ExpBin ($1, OR, $3) }
 | Lminus exp %prec Lop        { ExpUnr(OPPOSITE, $2) }
 | Lpar exp Rpar                { $2 }
 ;
 %%

These rules do not call for particular remarks except:

exp :
 ...
 | Lminus exp %prec Lop { ExpUnr(OPPOSITE, $2) }

It concerns the use of unary -. Keyword %prec that we find in it declares that this rule should receive the precedence of Lop (here the highest precedence).

File `basic_lexer.mll`

Lexical analysis only contains one rule, lexer, which corresponds closely to the old function lexer (see page ??). The semantic action associated with the recognition of each lexical unit is simply the emission of the related constructor. As the type of lexical units is declared in the syntax rule file, we have to include the file here. We add a simple auxiliary function that strips double quotation marks from character strings.


{
 open Basic_parser ;;

 let string_chars s = String.sub s 1 ((String.length s)-2) ;;
}

rule lexer = parse
   [' ' '\t']            { lexer lexbuf }

 | '\n'                  { Leol }

 | '!'                   { Lneg }
 | '&'                   { Land }
 | '|'                   { Lor }
 | '='                   { Lequal }
 | '%'                   { Lmod }
 | '+'                   { Lplus }
 | '-'                   { Lminus }
 | '*'                   { Lmult }
 | '/'                   { Ldiv }

 | ['<' '>']             { Lrel (Lexing.lexeme lexbuf) }
 | "<="                  { Lrel (Lexing.lexeme lexbuf) }
 | ">="                  { Lrel (Lexing.lexeme lexbuf) }

 | "REM" [^ '\n']*       { Lrem (Lexing.lexeme lexbuf) }
 | "LET"                 { Llet }
 | "PRINT"               { Lprint }
 | "INPUT"               { Linput }
 | "IF"                  { Lif }
 | "THEN"                { Lthen }
 | "GOTO"                { Lgoto }

 | "RUN"                { Lcmd (Lexing.lexeme lexbuf) }
 | "LIST"               { Lcmd (Lexing.lexeme lexbuf) }
 | "END"                { Lcmd (Lexing.lexeme lexbuf) }
 
 | ['0'-'9']+           { Lint (int_of_string (Lexing.lexeme lexbuf)) }
 | ['A'-'z']+           { Lident (Lexing.lexeme lexbuf) }
 | '"' [^ '"']* '"'     { Lstring (string_chars (Lexing.lexeme lexbuf)) }

Note that we isolated symbol = which is used in both expressions and assignments.

Only two of these regular expressions need further remarks. The first concerns comment lines ("REM" [^ '\n']*). This rule recognizes keyword REM followed by an arbitrary number of characters other than '\n'. The second remark concerns character strings ('"' [^ '"']* '"') considered as sequences of characters different from " and contained between two ".

Compiling, Linking

The compilation of the lexer and parser must be carried out in a definite order. This is due to the mutual dependency between the declaration of lexemes. To compile our example, we must enter the following sequence of commands:

ocamlc -c basic_types.mli
ocamlyacc basic_parser.mly
ocamllex basic_lexer.mll
ocamlc -c basic_parser.mli
ocamlc -c basic_lexer.ml
ocamlc -c basic_parser.ml

Which will generate files basic_lexer.cmo and basic_parser.cmo which may be linked into an application.

We now have at our disposal all the material needed to reimplement the application.

We suppress all types and all functions in paragraphs ``lexical analysis'' (on page ??) and ``parsing'' ( on page ??) of our Basic application; in function one_command (on page ??), we replace expression


    match parse (input_line stdin) with

with


    match line lexer (Lexing.from_string ((input_line stdin)^"\n")) with

We need to remark that we must put back at the end of the line the character '\n' which function input_line had filtered out. This is necessary because the '\n' character indicates the end of a command line (Leol).

Basic Revisited

File basic_parser.mly

Header

Declarations

Grammar rules

File basic_lexer.mll

Compiling, Linking

File `basic_parser.mly`

File `basic_lexer.mll`