Previous Contents Next

Database queries

The implementation of a database, its interface, and its query language is a project far too ambitious for the scope of this book and for the Objective CAML knowledge of the reader at this point. However, restricting the problem and using the functional programming style at its best allows us to create an interesting tool for query processing. For instance, we show how to use iterators as well as partial application to formulate and execute queries. We also show the use of a data type encapsulating functional values.

For this application, we use as an example a database on the members of an association. It is presumed to be stored in the file association.dat.

Data format

Most database programs use a ``proprietary'' format to store the data they manipulate. However, it is usually possible to store the data as some text that has the following structure: The association data file starts with:
Num|Lastname|Firstname|Address|Tel|Email|Pref|Date|Amount
0:Chailloux:Emmanuel:Université P6:0144274427:[email protected]:email:25.12.1998:100.00
1:Manoury:Pascal:Laboratoire PPS::[email protected]:mail:03.03.1997:150.00
2:Pagano:Bruno:Cristal:0139633963::mail:25.12.1998:150.00
3:Baro:Sylvain::0144274427:[email protected]:email:01.03.1999:50.00
The meaning of the fields is the following: We need to decide what represention the program should use internally for a database. We could use either a list of cards or an array of cards. On the one hand, a list has the nice property of being easily modified: adding and removing a card are simple operations. On the other hand, an array allows constant access time to any card. Since our goal is to work on all the cards and not on some of them, each query accesses all the cards. Thus a list is a good choice. The same issue arises concerning the cards themselves: should they be lists or arrays of strings? This time an array is a good choice, since the format of a card is fixed for the whole database. It not possible to add a new field. Since a query might access only a few fields, it is important for this access to be fast.

The most natural solution for a card would be to use an array indexed by the names of the fields. Since such a type is not available in Objective CAML, we can use an array (indexed by integers) and a function associating a field name with the array index corresponding to the field.

# type data_card = string array ;;
# type data_base = { card_index : string -> int ; data : data_card list } ;;


Access to the field named n of a card dc of the database db is implemented by the function:

# let field db n (dc : data_card) = dc.(db.card_index n) ;;
val field : data_base -> string -> data_card -> string = <fun>
The type of dc has been set to data_card to constrain the function field to only accept string arrays and not arrays of other types.

Here is a small example:

# let base_ex =
{ data = [ [|"Chailloux"; "Emmanuel"|] ; [|"Manoury"; "Pascal"|] ] ;
card_index = function "Lastname"->0 | "Firstname"->1
| _->raise Not_found } ;;
val base_ex : data_base =
{card_index=<fun>;
data=[[|"Chailloux"; "Emmanuel"|]; [|"Manoury"; "Pascal"|]]}
# List.map (field base_ex "Lastname") base_ex.data ;;
- : string list = ["Chailloux"; "Manoury"]


The expression field base_ex "Lastname" evaluates to a function which takes a card and returns the value of its "Lastname" field. The library function List.map applies the function to each card of the database base_ex, and returns the list of the results: a list of the "Lastname" fields of the database.

This example shows how we wish to use the functional style in our program. Here, the partial application of field allows us to define an access function for a given field, which we can use on any number of cards. This also shows us that the implementation of the field function is not very efficient, since although we are always accessing the same field, its index is computed for each access. The following implementation is better:

# let field base name =
let i = base.card_index name in fun (card : data_card) -> card.(i) ;;
val field : data_base -> string -> data_card -> string = <fun>
Here, after applying the function to two arguments, the index of the field is computed and is used for any subsequent application.

Reading a database from a file

As seen from Objective CAML, a file containing a database is just a list of lines. The first work that needs to be done is to read each line as a string, split it into smaller parts according to the separating character, and then extract the corresponding data as well as the field indexing function.

Tools for processing a line

We need a function split that splits a string at every occurrence of some separating character. This function uses the function suffix which returns the suffix of a string s after some position i. To do this, we use three predefined functions:

# let suffix s i = try String.sub s i ((String.length s)-i)
with Invalid_argument("String.sub") -> "" ;;
val suffix : string -> int -> string = <fun>
# let split c s =
let rec split_from n =
try let p = String.index_from s n c
in (String.sub s n (p-n)) :: (split_from (p+1))
with Not_found -> [ suffix s n ]
in if s="" then [] else split_from 0 ;;
val split : char -> string -> string list = <fun>


The only remarkable characteristic in this implementation is the use of exceptions, specifically the exception Not_found.

Computing the data_base structure
There is no difficulty in creating an array of strings from a list of strings, since this is what the of_list function in the Array module does. It might seem more complicated to compute the index function from a list of field names, but the List module provides all the needed tools.

Starting from a list of strings, we need to code a function that associates each string with an index corresponding to its position in the list.

# let mk_index list_names =
let rec make_enum a b = if a > b then [] else a::(make_enum (a+1) b) in
let list_index = (make_enum 0 ((List.length list_names) - 1)) in
let assoc_index_name = List.combine list_names list_index in
function name -> List.assoc name assoc_index_name ;;
val mk_index : 'a list -> 'a -> int = <fun>
To create the association function between field names and indexes, we combine the list of indexes and the list of names to obtain a list of associations of the type string * int list. To look up the index associated with a name, we use the function assoc from the List library. The function mk_index returns a function that takes a name and calls assoc on this name and the previously built association list.

It is now possible to create a function that reads a file of the given format.

# let read_base filename =
let channel = open_in filename in
let split_line = split ':' in
let list_names = split '|' (input_line channel) in
let rec read_file () =
try
let data = Array.of_list (split_line (input_line channel )) in
data :: (read_file ())
with End_of_file -> close_in channel ; []
in
{ card_index = mk_index list_names ; data = read_file () } ;;
val read_base : string -> data_base = <fun>
The auxiliary function read_file reads records from the file, and works recursively on the input channel. The base case of the recursion corresponds to the end of the file, signaled by the End_of_file exception. In this case, the empty list is returned after closing the channel.

The association's file can now be loaded:

# let base_ex = read_base "association.dat" ;;
val base_ex : data_base =
{card_index=<fun>;
data=
[[|"0"; "Chailloux"; "Emmanuel"; "Universit\233 P6"; "0144274427";
"[email protected]"; "email"; "25.12.1998"; "100.00"|];
[|"1"; "Manoury"; "Pascal"; "Laboratoire PPS"; ...|]; ...]}


General principles for database processing

The effectiveness and difficulty of processing the data in a database is proportional to the power and complexity of the query language. Since we want to use Objective CAML as query language, there is no limit a priori on the requests we can express! However, we also want to provide some simple tools to manipulate cards and their data. This desire for simplicity requires us to limit the power of the Objective CAML language, through the use of general goals and principles for database processing.

The goal of database processing is to obtain a state of the database. Building such a state may be decomposed into three steps:
  1. selecting, according to some given criterion, a set of cards;
  2. processing each of the selected cards;
  3. processing all the data collected on the cards.
Figure 6.1 illustrates this decomposition.


Figure 6.1: Processing a request.


According to this decomposition, we need three functions of the following types:
  1. (data_card -> bool) -> data_card list -> data_card list
  2. (data_card -> 'a) -> data_card list -> 'a list
  3. ('a -> 'b -> 'b) -> 'a list -> 'b -> 'b
Objective CAML provides us with three higher-order function, also known as iterators, introduced page ??, that satisfy our specification:

# List.find_all ;;
- : ('a -> bool) -> 'a list -> 'a list = <fun>
# List.map ;;
- : ('a -> 'b) -> 'a list -> 'b list = <fun>
# List.fold_right ;;
- : ('a -> 'b -> 'b) -> 'a list -> 'b -> 'b = <fun>
We will be able to use them to implement the three steps of building a state by choosing the functions they take as an argument.

For some special requests, we will also use:

# List.iter ;;
- : ('a -> unit) -> 'a list -> unit = <fun>
Indeed, if the required processing consists only of displaying some data, there is nothing to compute.

In the next paragraphs, we are going to see how to define functions expressing simple selection criteria, as well as simple queries. We conclude this section with a short example using these functions according to the principles stated above.

Selection criteria

Concretely, the boolean function corresponding to the selection criterion of a card is a boolean combination of properties of some or all of the fields of the card. Each field of a card, even though it is a string, can contain some information of another type: a float, a date, etc.

Selection criteria on a field

Selecting on some field is usually done using a function of the type data_base -> 'a -> string -> data_card -> bool. The 'a type parameter corresponds to the type of the information contained in the field. The string argument corresponds to the name of the field.

String fields
We define two simple tests on strings: equality with another string, and non-emptiness.

# let eq_sfield db s n dc = (s = (field db n dc)) ;;
val eq_sfield : data_base -> string -> string -> data_card -> bool = <fun>
# let nonempty_sfield db n dc = ("" <> (field db n dc)) ;;
val nonempty_sfield : data_base -> string -> data_card -> bool = <fun>


Float fields
To implement tests on data of type float, it is enough to translate the string representation of a decimal number into its float value. Here are some examples obtained from a generic function tst_ffield:

# let tst_ffield r db v n dc = r v (float_of_string (field db n dc)) ;;
val tst_ffield :
('a -> float -> 'b) -> data_base -> 'a -> string -> data_card -> 'b = <fun>
# let eq_ffield = tst_ffield (=) ;;
# let lt_ffield = tst_ffield (<) ;;
# let le_ffield = tst_ffield (<=) ;;
(* etc. *)
These three functions have type:

data_base -> float -> string -> data_card -> bool.

Dates
This kind of information is a little more complex to deal with, as it depends on the representation format of dates, and requires that we define date comparison.

We decide to represent dates in a card as a string with format dd.mm.yyyy. In order to be able to define additional comparisons, we also allow the replacement of the day, month or year part with the underscore character ('_'). Dates are compared according to the lexicographic order of lists of integers of the form [year; month; day]. To express queries such as: ``is before July 1998'', we use the date pattern: "_.07.1998". Comparing a date with a pattern is accomplished with the function tst_dfield which analyses the pattern to create the ad hoc comparison function. To define this generic test function on dates, we need a few auxiliary functions.

We first code two conversion functions from dates (ints_of_string) and date patterns (ints_of_dpat) to lists of ints. The character '_' of a pattern will be replaced by the integer 0:

# let split_date = split '.' ;;
val split_date : string -> string list = <fun>
# let ints_of_string d =
try match split_date d with
[d;m;y] -> [int_of_string y; int_of_string m; int_of_string d]
| _ -> failwith "Bad date format"
with Failure("int_of_string") -> failwith "Bad date format" ;;
val ints_of_string : string -> int list = <fun>

# let ints_of_dpat d =
let int_of_stringpat = function "_" -> 0 | s -> int_of_string s
in try match split_date d with
[d;m;y] -> [ int_of_stringpat y; int_of_stringpat m;
int_of_stringpat d ]
| _ -> failwith "Bad date format"
with Failure("int_of_string") -> failwith "Bad date pattern" ;;
val ints_of_dpat : string -> int list = <fun>


Given a relation r on integers, we now code the test function. It simply consists of implementing the lexicographic order, taking into account the particular case of 0:

# let rec app_dtst r d1 d2 = match d1, d2 with
[] , [] -> false
| (0::d1) , (_::d2) -> app_dtst r d1 d2
| (n1::d1) , (n2::d2) -> (r n1 n2) || ((n1 = n2) && (app_dtst r d1 d2))
| _, _ -> failwith "Bad date pattern or format" ;;
val app_dtst : (int -> int -> bool) -> int list -> int list -> bool = <fun>


We finally define the generic function tst_dfield which takes as arguments a relation r, a database db, a pattern dp, a field name nm, and a card dc. This function checks that the pattern and the field from the card satisfy the relation.

# let tst_dfield r db dp nm dc =
r (ints_of_dpat dp) (ints_of_string (field db nm dc)) ;;
val tst_dfield :
(int list -> int list -> 'a) ->
data_base -> string -> string -> data_card -> 'a = <fun>


We now apply it to three relations.

# let eq_dfield = tst_dfield (=) ;;
# let le_dfield = tst_dfield (<=) ;;
# let ge_dfield = tst_dfield (>=) ;;
These three functions have type:
data_base -> string -> string -> data_card -> bool.

Composing criteria

The tests we have defined above all take as first arguments a database, a value, and the name of a field. When we write a query, the value of these three arguments are known. For instance, when we work on the database base_ex, the test ``is before July 1998'' is written

# ge_dfield base_ex "_.07.1998" "Date" ;;
- : data_card -> bool = <fun>


Thus, we can consider a test as a function of type data_card -> bool. We want to obtain boolean combinations of the results of such functions applied to a given card. To this end, we implement the iterator:

# let fold_funs b c fs dc =
List.fold_right (fun f -> fun r -> c (f dc) r) fs b ;;
val fold_funs : 'a -> ('b -> 'a -> 'a) -> ('c -> 'b) list -> 'c -> 'a = <fun>
Where b is the base value, the function c is the boolean operator, fs is the list of test functions on a field, and dc is a card.

We can obtain the conjunction and the disjunction of a list of tests with:

# let and_fold fs = fold_funs true (&) fs ;;
val and_fold : ('a -> bool) list -> 'a -> bool = <fun>
# let or_fold fs = fold_funs false (or) fs ;;
val or_fold : ('a -> bool) list -> 'a -> bool = <fun>


We easily define the negation of a test:

# let not_fun f dc = not (f dc) ;;
val not_fun : ('a -> bool) -> 'a -> bool = <fun>


For instance, we can use these combinators to define a selection function for cards whose date field is included in a given range:

# let date_interval db d1 d2 =
and_fold [(le_dfield db d1 "Date"); (ge_dfield db d2 "Date")] ;;
val date_interval : data_base -> string -> string -> data_card -> bool =
<fun>


Processing and computation

It is difficult to guess how a card might be processed, or the data that would result from that processing. Nevertheless, we can consider two common cases: numerical computation and data formatting for printing. Let's take an example for each of these two cases.

Data formatting

In order to print, we wish to create a string containing the name of a member of the association, followed by some information.

We start with a function that reverses the splitting of a line using a given separating character:

# let format_list c =
let s = String.make 1 c in
List.fold_left (fun x y -> if x="" then y else x^s^y) "" ;;
val format_list : char -> string list -> string = <fun>


In order to build the list of fields we are interested in, we code the function extract that returns the fields associated with a given list of names in a given card:

# let extract db ns dc =
List.map (fun n -> field db n dc) ns ;;
val extract : data_base -> string list -> data_card -> string list = <fun>


We can now write the line formatting function:

# let format_line db ns dc =
(String.uppercase (field db "Lastname" dc))
^" "^(field db "Firstname" dc)
^"\t"^(format_list '\t' (extract db ns dc))
^"\n" ;;
val format_line : data_base -> string list -> data_card -> string = <fun>
The argument ns is the list of requested fields. In the resulting string, fields are separated by a tab ('\t') and the string is terminated with a newline character.

We display the list of last and first names of all members with:

# List.iter print_string (List.map (format_line base_ex []) base_ex.data) ;;
CHAILLOUX Emmanuel
MANOURY Pascal
PAGANO Bruno
BARO Sylvain
- : unit = ()


Numerical computation

We want to compute the total amount of received fees for a given set of cards. This is easily done by composing the extraction and conversion of the correct field with the addition. To get nicer code, we define an infix composition operator:

# let (++) f g x = g (f x) ;;
val ++ : ('a -> 'b) -> ('b -> 'c) -> 'a -> 'c = <fun>
We use this operator in the following definition:

# let total db dcs =
List.fold_right ((field db "Amount") ++ float_of_string ++ (+.)) dcs 0.0 ;;
val total : data_base -> data_card list -> float = <fun>
We can now apply it to the whole database:

# total base_ex base_ex.data ;;
- : float = 450


An example

To conclude, here is a small example of an application that uses the principles described in the paragraphs above.

We expect two kinds of queries on our database:

List of addresses

To create these lists, we first select the relevant cards according to the field "Pref", then we use the formatting function format_line:

# let mail_addresses db =
let dcs = List.find_all (eq_sfield db "mail" "Pref") db.data in
List.map (format_line db ["Mail"]) dcs ;;
val mail_addresses : data_base -> string list = <fun>

# let email_addresses db =
let dcs = List.find_all (eq_sfield db "email" "Pref") db.data in
List.map (format_line db ["Email"]) dcs ;;
val email_addresses : data_base -> string list = <fun>


State of received fees

Computing the state of the received fees uses the same technique: selection then processing. In this case however the processing part is twofold: line formatting followed by the computation of the total amount.

# let fees_state db d1 d2 =
let dcs = List.find_all (date_interval db d1 d2) db.data in
let ls = List.map (format_line db ["Date";"Amount"]) dcs in
let t = total db dcs in
ls, t ;;
val fees_state : data_base -> string -> string -> string list * float = <fun>
The result of this query is a tuple containing a list of strings with member information, and the total amount of received fees.

Main program

The main program is essentially an interactive loop that displays the result of queries asked by the user through a menu. We use here an imperative style, except for the display of the results which uses an iterator.

# let main() =
let db = read_base "association.dat" in
let finished = ref false in
while not !finished do
print_string" 1: List of mail addresses\n";
print_string" 2: List of email addresses\n";
print_string" 3: Received fees\n";
print_string" 0: Exit\n";
print_string"Your choice: ";
match read_int() with
0 -> finished := true
| 1 -> (List.iter print_string (mail_addresses db))
| 2 -> (List.iter print_string (email_addresses db))
| 3
-> (let d1 = print_string"Start date: "; read_line() in
let d2 = print_string"End date: "; read_line() in
let ls, t = fees_state db d1 d2 in
List.iter print_string ls;
print_string"Total: "; print_float t; print_newline())
| _ -> ()
done;
print_string"bye\n" ;;
val main : unit -> unit = <fun>


This example will be extended in chapter 21 with an interface using a web browser.

Further work

A natural extension of this example would consist of adding type information to every field of the database. This information would be used to define generic comparison operators with type data_base -> 'a -> string -> data_card -> bool where the name of the field (the third argument) would trigger the correct conversion and test functions.


Previous Contents Next