Previous Contents Next

Exploring Objective CAML values from C

The machine representation of Objective CAML values differs from that of C values, even for fundamental types such as integers. This is because the Objective CAML garbage collector needs to record additional information in values. Since Objective CAML values are represented uniformly, their representations all belong to the same C type, named (unsurprisingly) value.

When Objective CAML calls a C function, passing it one or several arguments, those arguments must be decoded before using them in the C function. Similarly, the result of this C function must be encoded before being returned to Objective CAML.

These conversions (decoding and encoding) are performed by a number of macros and C functions provided by the Objective CAML runtime system. These macros and functions are declared in the include files listed in figure 12.3. These include files are part of the Objective CAML installation, and can be found in the directory where Objective CAML libraries are installed4


   
caml/mlvalues.h definition of the value type and basic value conversion macros.
caml/alloc.h functions for allocating Objective CAML values.
caml/memory.h macros for interfacing with the Objective CAML garbage collector.

Figure 12.3: Include files for the C interface.


Classification of Objective CAML representations

An Objective CAML representation, that is, a C datum of type value, is one of: The Objective CAML heap is the memory area that is managed by the Objective CAML garbage collector. C code can also allocate and manipulate data structures in its own memory space, and communicate pointers to these data structures to Objective CAML.

Figure 12.4 shows the macros for classifying representations and converting between C integers and their Objective CAML representation.

   
Is_long(v) is v an Objective CAML integer?
Is_block(v) is v an Objective CAML pointer?
   
Long_val(v) extract the integer contained in v, as a C "long"
Int_val(v) extract the integer contained in v, as a C "int"
Bool_val(v) extract the boolean contained in v (0 if false, non-zero if true)

Figure 12.4: Classification of representations and conversion of immediate values.


Note that C offers several integer types of varying sizes (short, int, long, etc), while Objective CAML has only one integer type, int.

Accessing immediate values

All Objective CAML immediate values are represented as integers: The following program defines a C function inspect that inspects the representation of its argument:
#include <stdio.h>
#include <caml/mlvalues.h>
value inspect (value v)
{
if (Is_long(v))
printf ("v is an integer (%ld) : %ld", (long) v, Long_val(v));
else if (Is_block(v))
printf ("v is a pointer");
else
printf ("v is neither an integer nor a pointer (???)");
printf(" ");
fflush(stdout) ;
return v ;
}
The function inspect tests whether its argument is an Objective CAML integer. If so, it prints the integer twice, first viewed as a C long integer (without conversion), then converted by the Long_val macro, which extracts the actual integer represented in the argument.

On the following example, we see that the machine representation of integers in Objective CAML differs from that of C:

# external inspect : 'a -> 'a = "inspect" ;;
external inspect : 'a -> 'a = "inspect"
# inspect 123 ;;
v is an integer (247) : 123 - : int = 123
# inspect max_int;;
v is an integer (2147483647) : 1073741823 - : int = 1073741823
We can also inspect values of other predefined types, such as char and bool:

# inspect 'A' ;;
v is an integer (131) : 65 - : char = 'A'
# inspect true ;;
v is an integer (3) : 1 - : bool = true
# inspect false ;;
v is an integer (1) : 0 - : bool = false
# inspect [] ;;
v is an integer (1) : 0 - : '_a list = []


Consider the Objective CAML type foo defined thus:

# type foo = C1 | C2 of int | C3 | C4 ;;


The inspect function shows that constant constructors and non-constant constructors of this type are represented differently:

# inspect C1 ;;
v is an integer (1) : 0 - : foo = C1
# inspect C4 ;;
v is an integer (5) : 2 - : foo = C4
# inspect (C2 1) ;;
v is a pointer - : foo = C2 1


When the function inspect detects an immediate value, it prints first the ``physical'' representation of this value (i.e. the representation viewed as a word-sized C integer of C type long); then it prints the ``logical'' contents of this value (i.e. the Objective CAML integer it represents, as returned by the decoding macro Long_val). The examples above show that the ``physical'' and the ``logical'' contents differ. This difference is due to the tag bit6 used by the garbage collector to distinguish immediate values from pointers (see chapter 9, page ??).

Representation of structured values

Non-immediate Objective CAML values are said to be structured values. Those values are allocated in the Objective CAML heap and represented as a pointer to the corresponding memory block. All memory blocks contain a header word indicating the kind of the block as well as its size expressed in machine words. Figure 12.5 shows the structure of a block for a 32-bit machine.


Figure 12.5: Structure of an Objective CAML heap block.


The two ``color'' bits are used by the garbage collector for walking the memory graph (see chapter 9, page ??). The ``tag'' field, or ``tag'' for short, contains the kind of the block. The ``size'' field contains the size of the block, in words, excluding the header. The macros listed in figure 12.6 return the tag and size of a block.

   
Wosize_val(v) return the size of the block v (header excluded)
Tag_val(v) return the tag of the block v

Figure 12.6: Accessing header information in memory blocks.


The tag of a memory block can take the values listed in figure 12.7.

from 0 to No_scan_tag-1 an array of Objective CAML value representations
Closure_tag a function closure
String_tag a character string
Double_tag a double-precision float
Double_array_tag an array of float
Abstract_tag an abstract data type
Final_tag an abstract data type equipped with a finalization function

Figure 12.7: Tags of memory blocks.


Depending on the block tag, different macros are used to access the contents of the blocks. These macros are described in figure 12.8. When the tag is less than No_scan_tag, the heap block is structured as an array of Objective CAML value representations. Each element of the array is called a ``field'' of the memory block. In accordance with C and Objective CAML conventions, the first field is at index 0, and the last field is at index Wosize_val(v) - 1.

   
Field(v,n) return the nth field of v.
Code_val(v) return the code pointer for a closure.
string_length(v) return the length of a string.
Byte(v,n) return the n th character of a string, with C type char.
Byte_u(v,n) same, but result has C type unsigned char.
String_val(v) return the contents of a string with C type (char *).
Double_val(v) return the float contained in v.
Double_field(v,n) return the n th float contained in the float array v.

Figure 12.8: Accessing the content of a memory block.


As we did earlier for immediate values, we now define a function to inspect memory blocks. The C function print_block takes an Objective CAML value representation, tests whether it is an immediate value or a memory block, and in the latter case prints the kind and contents of the block. It is called from the wrapper function inspect_block, which can be called from Objective CAML.

#include <stdio.h>
#include <caml/mlvalues.h>

void margin (int n)
{ while (n-- > 0) printf("."); return; }

void print_block (value v,int m)
{
int size, i;
margin(m);
if (Is_long(v))
{ printf("immediate value (%d)\n", Long_val(v)); return; };
printf ("memory block: size=%d - ", size=Wosize_val(v));
switch (Tag_val(v))
{
case Closure_tag :
printf("closure with %d free variables\n", size-1);
margin(m+4); printf("code pointer: %p\n",Code_val(v)) ;
for (i=1;i<size;i++) print_block(Field(v,i), m+4);
break;
case String_tag :
printf("string: %s (%s)\n", String_val(v),(char *) v);
break;
case Double_tag:
printf("float: %g\n", Double_val(v));
break;
case Double_array_tag :
printf ("float array: ");
for (i=0;i<size/Double_wosize;i++) printf(" %g", Double_field(v,i));
printf("\n");
break;
case Abstract_tag : printf("abstract type\n"); break;
case Final_tag : printf("abstract finalized type\n"); break;
default:
if (Tag_val(v)>=No_scan_tag) { printf("unknown tag"); break; };
printf("structured block (tag=%d):\n",Tag_val(v));
for (i=0;i<size;i++) print_block(Field(v,i),m+4);
}
return ;
}

value inspect_block (value v)
{ print_block(v,4); fflush(stdout); return v; }

Each possible tag for a block corresponds to a case of the switch construct. In the case of a block containing an array of Objective CAML values, we recursively call print_block on each field of the array. We then redefine the inspect function:

# external inspect : 'a -> 'a = "inspect_block" ;;
external inspect : 'a -> 'a = "inspect_block"
We can now explore the representations of Objective CAML structured values. We must be careful not to apply inspect_block to a cyclic value, since the recursive traversal of the value would then loop indefinitely.

Arrays, tuples, and records

Arrays and tuples are represented by structured blocks. The nth field of the block contains the representation of the nth element of the array or tuple.

# inspect [| 1; 2; 3 |] ;;
....memory block: size=3 - structured block (tag=0):
........immediate value (1)
........immediate value (2)
........immediate value (3)
- : int array = [|1; 2; 3|]
# inspect ( 10 , true , () ) ;;
....memory block: size=3 - structured block (tag=0):
........immediate value (10)
........immediate value (1)
........immediate value (0)
- : int * bool * unit = 10, true, ()


Records are also represented as structured blocks. The values of the record fields appear in the order given at record declaration time. Mutable fields and immutable fields are represented identically.

# type foo = { fld1: int ; mutable fld2: int } ;;
type foo = { fld1: int; mutable fld2: int }
# inspect { fld1=10 ; fld2=20 } ;;
....memory block: size=2 - structured block (tag=0):
........immediate value (10)
........immediate value (20)
- : foo = {fld1=10; fld2=20}


Warning


Nothing prevents a C function from physically modifying an immutable record field. It is the programmers' responsibility to make sure that their C functions do not introduce inconsistencies in Objective CAML data structures.


Sum types

We previously saw that constant constructors are represented like integers. A non-constant constructor is represented by a block containing the constructor's arguments, with a tag identifying the constructor. The tag associated with a non-constant constructor represents its position in the type declaration: the first non-constant constructor has tag 0, the second one has tag 1, and so on.

# type foo = C1 of int * int * int | C2 of int | C3 | C4 of int * int ;;
type foo = | C1 of int * int * int | C2 of int | C3 | C4 of int * int
# inspect (C1 (1,2,3)) ;;
....memory block: size=3 - structured block (tag=0):
........immediate value (1)
........immediate value (2)
........immediate value (3)
- : foo = C1 (1, 2, 3)
# inspect (C4 (1,2)) ;;
....memory block: size=2 - structured block (tag=2):
........immediate value (1)
........immediate value (2)
- : foo = C4 (1, 2)


Note


The type list is a sum type whose declaration is:
type 'a list = [] | :: of 'a * 'a list. This type has only one non-constant constructor (::). Thus, a non-empty list is represented by a memory block with tag 0.


Character strings

Characters inside strings occupy one byte each. Thus, the memory block representing a string uses one word per group of four characters (on a 32-bit machine) or eight characters (on a 64-bit machine).

Warning


Objective CAML strings can contain the null character whose ASCII code is 0. In C, the null character represents the end of a string, and cannot appear inside a string.


#include <stdio.h>
#include <caml/mlvalues.h>

value explore_string (value v)
{
char *s;
int i,size;
s = (char *) v;
size = Wosize_val(v) * sizeof(value);
for (i=0;i<size;i++)
{
int p = (unsigned int) s[i] ;
if ((p>31) && (p<128)) printf("%c",s[i]); else printf("(#%u)",p);
}
printf("\n");
fflush(stdout);
return v;
}

The length and position of last character of an Objective CAML string are determined not by looking for a terminating null character, as in C, but by combining the size of the memory block that contains the string with the last byte of the last word of this block, which indicates the number of unused bytes in the last word. The following examples clarify the role played by this last byte.

# external explore : string -> string = "explore_string" ;;
external explore : string -> string = "explore_string"
# ignore(explore "");
ignore(explore "a");
ignore(explore "ab");
ignore(explore "abc");
ignore(explore "abcd");
ignore(explore "abcd\000") ;;
(#0)(#0)(#0)(#3)
a(#0)(#0)(#2)
ab(#0)(#1)
abc(#0)
abcd(#0)(#0)(#0)(#3)
abcd(#0)(#0)(#0)(#2)
- : unit = ()
In the last two examples ("abcd" and "abcd\000"), the strings are of length 4 and 5 respectively. This explains why the last byte takes two different values, although the other bytes of the string representations are identical.

Floats and float arrays

Objective CAML offers only one type (float) of floating-point numbers. This type corresponds to 64-bit, double-precision floating point numbers in C (type double). Values of type float are heap-allocated and represented by a memory block of size 2 words (on a 32-bit machine) or 1 word (on a 64-bit machine).

# inspect 1.5 ;;
....memory block: size=2 - float: 1.5
- : float = 1.5
# inspect 0.0;;
....memory block: size=2 - float: 0
- : float = 0


Arrays of floats are represented specially to reduce their memory occupancy: the floats contained in the array are stored consecutively in the memory block, rather than having each float heap-allocated separately. Therefore, float arrays possess a specific tag and specific access macros.

# inspect [| 1.5 ; 2.5 ; 3.5 |] ;;
....memory block: size=6 - float array: 1.5 2.5 3.5
- : float array = [|1.5; 2.5; 3.5|]
This optimized representation encourages the use of Objective CAML for numerical computations that manipulate many float arrays: operations on array elements are much more efficient than if each float was heap-allocated separately.

Warning


When allocating an Objective CAML float array from C, the size of the block should be the number of array elements multiplied by Double_wosize. The Double_wosize macro represents the number of words occupied by a double-precision float (2 words on a 32-bit machine, but only 1 word on a 64-bit machine).


With the exception of float arrays, floating-point numbers contained in other data structures are always treated as a structured, heap-allocated value. The following example shows the representation of a list of floats.

# inspect [ 3.14; 1.2; 7.6];;
....memory block: size=2 - structured block (tag=0):
........memory block: size=2 - float: 3.14
........memory block: size=2 - structured block (tag=0):
............memory block: size=2 - float: 1.2
............memory block: size=2 - structured block (tag=0):
................memory block: size=2 - float: 7.6
................immediate value (0)
- : float list = [3.14; 1.2; 7.6]
The list is viewed as a block with size 2, containing its head and its tail. The head of the list is a float, which is also a block of size 2.

Closures

A function value is represented by the code to be executed when the function is applied, and by its environment (see chapter 2, page ??). There are two ways to build a function value: either by explicit abstraction (as in fun x -> x+1) or by partial application of a curried function (as in (fun x -> fun y -> x+y) 1).

The environment of a closure can contain three kinds of variables: those declared globally, those declared locally, and the function parameters already instantiated by a partial application. The implementation treats those three kinds differently. Global variables are stored in a global environment that is not explicitly part of any closure. Local variables and instantiated parameters can appear in closures, as we now illustrate.

A closure with an empty environment is simply a memory block containing a pointer to the code of the function:

# let f = fun x y z -> x+y+z ;;
val f : int -> int -> int -> int = <fun>
# inspect f ;;
....memory block: size=1 - closure with 0 free variables
........code pointer: 0x807308c
- : int -> int -> int -> int = <fun>
Functions with free local variables are represented by closures with non-empty environments. Here, the closure contains both a pointer to the code of the function, and the values of its free local variables.

# let g = let x = 1 and y = 2 in fun z -> x+y+z ;;
val g : int -> int = <fun>
# inspect g ;;
....memory block: size=3 - closure with 2 free variables
........code pointer: 0x8086450
........immediate value (1)
........immediate value (2)
- : int -> int = <fun>


The Objective CAML virtual machine treats partial applications of functions specially for better performance. A partial application of an abstraction is represented by a closure containing a value for each of the instantiated parameters, plus a pointer to the closure for the initial abstraction.

# let a1 = f 1 ;;
val a1 : int -> int -> int = <fun>
# inspect (a1) ;;
....memory block: size=3 - closure with 2 free variables
........code pointer: 0x8073088
........memory block: size=1 - closure with 0 free variables
............code pointer: 0x807308c
........immediate value (1)
- : int -> int -> int = <fun>
# let a2 = a1 2 ;;
val a2 : int -> int = <fun>
# inspect (a2) ;;
....memory block: size=4 - closure with 3 free variables
........code pointer: 0x8073088
........memory block: size=1 - closure with 0 free variables
............code pointer: 0x807308c
........immediate value (1)
........immediate value (2)
- : int -> int = <fun>
Figure 12.9 depicts the result of the inspection above.


Figure 12.9: Closure representation.


The function f has no free variables, hence the environment part of its closure is empty. The code pointer for a function with several arguments points to the code that should be called when all arguments are provided. In the case of f, this is the code corresponding to x+y+z. Partial applications of this function result in intermediate closures that point to a shared code (it is the same code pointer for a1 and a2). The role of this code is to accumulate the arguments and detect when all arguments have been provided. If so, it pushes all arguments and calls the actual code for the function body; if not, it creates a new closure. For instance, the application of a1 to 2 fails to provide all arguments to the function f (the last argument is still missing), hence a closure is created containing the first two arguments, 1 and 2. Notice that the closures resulting from partial applications always contain, in the first environment slot, a pointer to the original closure. The original closure will be called when all arguments have been gathered.

Mixing local declarations and partial applications results in the following representation:

# let g x = let y=2 in fun z -> x+y+z ;;
val g : int -> int -> int = <fun>
# let a1 = g 1 ;;
val a1 : int -> int = <fun>
# inspect a1 ;;
....memory block: size=3 - closure with 2 free variables
........code pointer: 0x8086548
........immediate value (1)
........immediate value (2)
- : int -> int = <fun>


Abstract types

Values of an abstract type are represented like those of its implementation type. Actually, type information is used only during type-checking and compilation. During execution, the types are not needed -- only the memory representation (tag bits on values, size and tag fields on memory blocks) needs to be communicated to the garbage collector.

For instance, a value of the abstract type 'a Stack.t is represented as a reference to a list, since the type 'a Stack.t is implemented as 'a list ref.

# let p = Stack.create();;
val p : '_a Stack.t = <abstr>
# Stack.push 3 p;;
- : unit = ()
# inspect p;;
....memory block: size=1 - structured block (tag=0):
........memory block: size=2 - structured block (tag=0):
............immediate value (3)
............immediate value (0)
- : int Stack.t = <abstr>
On the other hand, some abstract types are implemented by representations that cannot be expressed in Objective CAML. Typical examples include arrays of weak pointers and input-output channels. Often, values of those abstract types are represented as memory blocks with tag Abstract_tag.

# let w = Weak.create 10;;
val w : '_a Weak.t = <abstr>
# Weak.set w 0 (Some p);;
- : unit = ()
# inspect w;;
....memory block: size=11 - abstract type
- : int Stack.t Weak.t = <abstr>
Sometimes, a finalization function is attached to those values. Finalization functions are C functions which are called by the garbage collector just before the value is collected. They are very useful to free external resources, such as an input-output buffer, just before the memory block referring to those resources disappears. For instance, inspection of the ``standard output'' channel reveals that the type out_channel is represented by abstract memory blocks with a finalization function:

# inspect (stdout) ;;
....memory block: size=2 - abstract finalized type
- : out_channel = <abstr>



Previous Contents Next