Exploring Objective CAML values from C
The machine representation of Objective CAML values differs from that of C
values, even for fundamental types such as integers. This is because the
Objective CAML garbage collector needs to record additional information in
values. Since Objective CAML values are represented uniformly, their
representations all belong to the same C type, named (unsurprisingly) value.
When Objective CAML calls a C function, passing it one or several arguments,
those arguments must be decoded before using them in the C function.
Similarly, the result of this C function must be encoded before being
returned to Objective CAML.
These conversions (decoding and encoding) are performed by a
number of macros and C functions provided by the Objective CAML runtime
system. These macros and functions are declared in the include files
listed in figure 12.3. These include files are part of
the Objective CAML installation, and can be found in the directory
where Objective CAML libraries are installed4
|
|
caml/mlvalues.h |
definition of the value type and basic value conversion macros. |
caml/alloc.h |
functions for allocating Objective CAML values. |
caml/memory.h |
macros for interfacing with the Objective CAML garbage collector. |
Figure 12.3: Include files for the C interface.
Classification of Objective CAML representations
An Objective CAML representation, that is, a C datum of type value, is
one of:
-
an immediate value (represented as an integer);
- a pointer into the Objective CAML heap;
- a pointer pointing outside the Objective CAML heap.
The Objective CAML heap is the memory area that is managed by the Objective CAML
garbage collector. C code can also allocate and manipulate data
structures in its own memory space, and communicate pointers to these data
structures to Objective CAML.
Figure 12.4 shows the macros for classifying
representations and converting between C integers and their Objective CAML
representation.
|
|
Is_long(v) |
is v an Objective CAML integer? |
Is_block(v) |
is v an Objective CAML pointer? |
|
|
Long_val(v) |
extract the integer contained in v, as
a C "long" |
Int_val(v) |
extract the integer contained in v, as
a C "int" |
Bool_val(v) |
extract the boolean contained in v (0
if false, non-zero if true) |
Figure 12.4: Classification of representations and conversion of immediate
values.
Note that C offers several integer types of varying sizes
(short, int, long, etc), while Objective CAML
has only one integer type, int.
Accessing immediate values
All Objective CAML immediate values are represented as integers:
-
integers are represented by their value;
- characters are represented by their ASCII code5;
- constant constructors are represented by an integer
corresponding to their position in the datatype declaration:
the nth constant constructor of a datatype is
represented by the integer n-1.
The following program defines a C function inspect that
inspects the representation of its argument:
#include <stdio.h>
#include <caml/mlvalues.h>
value inspect (value v)
{
if (Is_long(v))
printf ("v is an integer (%ld) : %ld", (long) v, Long_val(v));
else if (Is_block(v))
printf ("v is a pointer");
else
printf ("v is neither an integer nor a pointer (???)");
printf(" ");
fflush(stdout) ;
return v ;
}
The function inspect tests whether its argument is an Objective CAML
integer. If so, it prints the integer twice, first viewed as a C long
integer (without conversion), then converted by the Long_val
macro, which extracts the actual integer represented in the argument.
On the following example, we see that the machine representation of
integers in Objective CAML differs from that of C:
# external
inspect
:
'a
->
'a
=
"inspect"
;;
external inspect : 'a -> 'a = "inspect"
# inspect
1
2
3
;;
v is an integer (247) : 123 - : int = 123
# inspect
max_int;;
v is an integer (2147483647) : 1073741823 - : int = 1073741823
We can also inspect values of other predefined types, such as
char and bool:
# inspect
'A'
;;
v is an integer (131) : 65 - : char = 'A'
# inspect
true
;;
v is an integer (3) : 1 - : bool = true
# inspect
false
;;
v is an integer (1) : 0 - : bool = false
# inspect
[]
;;
v is an integer (1) : 0 - : '_a list = []
Consider the Objective CAML type foo defined thus:
# type
foo
=
C1
|
C2
of
int
|
C3
|
C4
;;
The inspect function shows that constant constructors and
non-constant constructors of this type are represented differently:
# inspect
C1
;;
v is an integer (1) : 0 - : foo = C1
# inspect
C4
;;
v is an integer (5) : 2 - : foo = C4
# inspect
(C2
1
)
;;
v is a pointer - : foo = C2 1
When the function inspect detects an immediate value, it prints
first the ``physical'' representation of this value
(i.e. the representation viewed as a word-sized C integer of C
type long); then it prints the ``logical'' contents of this
value (i.e. the Objective CAML integer it represents, as returned by
the decoding macro Long_val). The examples above show that the
``physical'' and the ``logical'' contents differ. This difference is
due to the tag bit6 used by the garbage collector to distinguish immediate values
from pointers (see chapter 9, page ??).
Representation of structured values
Non-immediate Objective CAML values are said to be structured values.
Those values are allocated in the Objective CAML heap and represented as a
pointer to the corresponding memory block. All memory blocks contain
a header word indicating the kind of the block as well as its size
expressed in machine words. Figure 12.5 shows the
structure of a block for a 32-bit machine.
Figure 12.5: Structure of an Objective CAML heap block.
The two ``color'' bits are used by the garbage collector for walking
the memory graph (see chapter 9,
page ??). The ``tag'' field, or ``tag'' for short,
contains the kind of the block. The ``size'' field contains the size
of the block, in words, excluding the header.
The macros listed in figure 12.6 return the tag and size
of a block.
|
|
Wosize_val(v) |
return the size of the block v (header excluded) |
Tag_val(v) |
return the tag of the block v |
Figure 12.6: Accessing header information in memory blocks.
The tag of a memory block can take the values listed in figure
12.7.
from 0 to No_scan_tag-1 |
an array of Objective CAML value representations |
Closure_tag |
a function closure |
String_tag |
a character string |
Double_tag |
a double-precision float |
Double_array_tag |
an array of float |
Abstract_tag |
an abstract data type |
Final_tag |
an abstract data type equipped with a finalization function |
Figure 12.7: Tags of memory blocks.
Depending on the block tag, different macros are used to access the
contents of the blocks. These macros are described in figure
12.8. When the tag is less than
No_scan_tag, the heap block is structured as an array of
Objective CAML value representations. Each element of the array is called a
``field'' of the memory block. In accordance with C and Objective CAML
conventions, the first field is at index 0, and the last field is at
index Wosize_val(v) - 1.
|
|
Field(v,n) |
return the nth field of
v. |
Code_val(v) |
return the code pointer for a closure. |
string_length(v) |
return the length of a string. |
Byte(v,n) |
return the n th character
of a string, with C type char. |
Byte_u(v,n) |
same, but result has C type unsigned char. |
String_val(v) |
return the contents of a string with C type
(char *). |
Double_val(v) |
return the float contained in v. |
Double_field(v,n) |
return the n th
float contained in the float array v. |
Figure 12.8: Accessing the content of a memory block.
As we did earlier for immediate values, we now define a function to
inspect memory blocks. The C function print_block takes an
Objective CAML value representation, tests whether it is an immediate value
or a memory block, and in the latter case prints the kind and contents
of the block. It is called from the wrapper function
inspect_block, which can be called from Objective CAML.
#include <stdio.h>
#include <caml/mlvalues.h>
void margin (int n)
{ while (n-- > 0) printf("."); return; }
void print_block (value v,int m)
{
int size, i;
margin(m);
if (Is_long(v))
{ printf("immediate value (%d)\n", Long_val(v)); return; };
printf ("memory block: size=%d - ", size=Wosize_val(v));
switch (Tag_val(v))
{
case Closure_tag :
printf("closure with %d free variables\n", size-1);
margin(m+4); printf("code pointer: %p\n",Code_val(v)) ;
for (i=1;i<size;i++) print_block(Field(v,i), m+4);
break;
case String_tag :
printf("string: %s (%s)\n", String_val(v),(char *) v);
break;
case Double_tag:
printf("float: %g\n", Double_val(v));
break;
case Double_array_tag :
printf ("float array: ");
for (i=0;i<size/Double_wosize;i++) printf(" %g", Double_field(v,i));
printf("\n");
break;
case Abstract_tag : printf("abstract type\n"); break;
case Final_tag : printf("abstract finalized type\n"); break;
default:
if (Tag_val(v)>=No_scan_tag) { printf("unknown tag"); break; };
printf("structured block (tag=%d):\n",Tag_val(v));
for (i=0;i<size;i++) print_block(Field(v,i),m+4);
}
return ;
}
value inspect_block (value v)
{ print_block(v,4); fflush(stdout); return v; }
Each possible tag for a block corresponds to a case of the
switch construct. In the case of a block containing an array
of Objective CAML values, we recursively call print_block on each
field of the array. We then redefine the inspect function:
# external
inspect
:
'a
->
'a
=
"inspect_block"
;;
external inspect : 'a -> 'a = "inspect_block"
We can now explore the representations of Objective CAML structured values.
We must be careful not to apply inspect_block to a cyclic value,
since the recursive traversal of the value would then loop indefinitely.
Arrays, tuples, and records
Arrays and tuples are represented by structured blocks. The
nth field of the block contains the
representation of the nth element of the
array or tuple.
# inspect
[|
1
;
2
;
3
|]
;;
....memory block: size=3 - structured block (tag=0):
........immediate value (1)
........immediate value (2)
........immediate value (3)
- : int array = [|1; 2; 3|]
# inspect
(
1
0
,
true
,
()
)
;;
....memory block: size=3 - structured block (tag=0):
........immediate value (10)
........immediate value (1)
........immediate value (0)
- : int * bool * unit = 10, true, ()
Records are also represented as structured blocks. The values of the
record fields appear in the order given at record declaration time.
Mutable fields and immutable fields are represented identically.
# type
foo
=
{
fld1:
int
;
mutable
fld2:
int
}
;;
type foo = { fld1: int; mutable fld2: int }
# inspect
{
fld1=
1
0
;
fld2=
2
0
}
;;
....memory block: size=2 - structured block (tag=0):
........immediate value (10)
........immediate value (20)
- : foo = {fld1=10; fld2=20}
Warning
Nothing prevents a C function from physically modifying an immutable
record field. It is the programmers' responsibility to make sure that
their C functions do not introduce inconsistencies in Objective CAML data
structures.
Sum types
We previously saw that constant constructors are represented like
integers. A non-constant constructor is represented by a block
containing the constructor's arguments, with a tag identifying the
constructor. The tag associated with a non-constant constructor
represents its position in the type declaration: the first
non-constant constructor has tag 0, the second one has tag 1, and so on.
# type
foo
=
C1
of
int
*
int
*
int
|
C2
of
int
|
C3
|
C4
of
int
*
int
;;
type foo = | C1 of int * int * int | C2 of int | C3 | C4 of int * int
# inspect
(C1
(1
,
2
,
3
))
;;
....memory block: size=3 - structured block (tag=0):
........immediate value (1)
........immediate value (2)
........immediate value (3)
- : foo = C1 (1, 2, 3)
# inspect
(C4
(1
,
2
))
;;
....memory block: size=2 - structured block (tag=2):
........immediate value (1)
........immediate value (2)
- : foo = C4 (1, 2)
Note
The type list is a sum type whose declaration is:
type
'a
list
=
[]
|
::
of
'a
*
'a
list.
This type has only one non-constant constructor (::).
Thus, a non-empty list is represented by a memory block with tag 0.
Character strings
Characters inside strings occupy one byte each. Thus, the memory
block representing a string uses one word per group of four characters
(on a 32-bit machine) or eight characters (on a 64-bit machine).
Warning
Objective CAML strings can contain the null character whose ASCII
code is 0. In C, the null character represents the end of a
string, and cannot appear inside a string.
#include <stdio.h>
#include <caml/mlvalues.h>
value explore_string (value v)
{
char *s;
int i,size;
s = (char *) v;
size = Wosize_val(v) * sizeof(value);
for (i=0;i<size;i++)
{
int p = (unsigned int) s[i] ;
if ((p>31) && (p<128)) printf("%c",s[i]); else printf("(#%u)",p);
}
printf("\n");
fflush(stdout);
return v;
}
The length and position of last character of an Objective CAML string
are determined not by looking for a terminating null character, as
in C, but by combining the size of the memory block that contains the
string with the last byte of the last word of this block, which
indicates the number of unused bytes in the last word. The
following examples clarify the role played by this last byte.
# external
explore
:
string
->
string
=
"explore_string"
;;
external explore : string -> string = "explore_string"
# ignore(explore
""
);
ignore(explore
"a"
);
ignore(explore
"ab"
);
ignore(explore
"abc"
);
ignore(explore
"abcd"
);
ignore(explore
"abcd\000"
)
;;
(#0)(#0)(#0)(#3)
a(#0)(#0)(#2)
ab(#0)(#1)
abc(#0)
abcd(#0)(#0)(#0)(#3)
abcd(#0)(#0)(#0)(#2)
- : unit = ()
In the last two examples ("abcd" and
"abcd\000"), the strings are of length
4 and 5 respectively.
This explains why the last byte takes two different values,
although the other bytes of the string representations are identical.
Floats and float arrays
Objective CAML offers only one type (float) of floating-point
numbers. This type corresponds to 64-bit, double-precision floating
point numbers in C (type double). Values of type float
are heap-allocated and represented by a memory block of size 2 words
(on a 32-bit machine) or 1 word (on a 64-bit machine).
# inspect
1
.
5
;;
....memory block: size=2 - float: 1.5
- : float = 1.5
# inspect
0
.
0
;;
....memory block: size=2 - float: 0
- : float = 0
Arrays of floats are represented specially to reduce their memory
occupancy: the floats contained in the array are stored consecutively
in the memory block, rather than having each float heap-allocated separately.
Therefore, float arrays possess a specific tag and specific access macros.
# inspect
[|
1
.
5
;
2
.
5
;
3
.
5
|]
;;
....memory block: size=6 - float array: 1.5 2.5 3.5
- : float array = [|1.5; 2.5; 3.5|]
This optimized representation encourages the use of Objective CAML for numerical
computations that manipulate many float arrays: operations on array
elements are much more efficient than if each float was heap-allocated
separately.
Warning
When allocating an Objective CAML float array from C, the size of the block
should be the number of array elements multiplied by
Double_wosize. The Double_wosize macro represents the
number of words occupied by a double-precision float
(2 words on a 32-bit machine, but only 1 word on a 64-bit machine).
With the exception of float arrays, floating-point numbers contained
in other data structures are always treated as a structured,
heap-allocated value. The following example shows the representation
of a list of floats.
# inspect
[
3
.
1
4
;
1
.
2
;
7
.
6
]
;;
....memory block: size=2 - structured block (tag=0):
........memory block: size=2 - float: 3.14
........memory block: size=2 - structured block (tag=0):
............memory block: size=2 - float: 1.2
............memory block: size=2 - structured block (tag=0):
................memory block: size=2 - float: 7.6
................immediate value (0)
- : float list = [3.14; 1.2; 7.6]
The list is viewed as a block with size 2, containing its head
and its tail. The head of the list is a float, which is also a block
of size 2.
Closures
A function value is represented by the code to be executed when the
function is applied, and by its environment (see chapter 2,
page ??). There are two ways to build a function
value: either by explicit abstraction
(as in fun x -> x+1) or by partial application of a curried function
(as in (fun x -> fun y -> x+y) 1).
The environment of a closure can contain three kinds of variables:
those declared globally, those declared locally, and the function
parameters already instantiated by a partial application.
The implementation treats those three kinds differently.
Global variables are stored in a global environment that is not
explicitly part of any closure. Local variables and instantiated
parameters can appear in closures, as we now illustrate.
A closure with an empty environment is simply a memory block
containing a pointer to the code of the function:
# let
f
=
fun
x
y
z
->
x+
y+
z
;;
val f : int -> int -> int -> int = <fun>
# inspect
f
;;
....memory block: size=1 - closure with 0 free variables
........code pointer: 0x807308c
- : int -> int -> int -> int = <fun>
Functions with free local variables are represented by closures with
non-empty environments. Here, the closure contains both a pointer to
the code of the function, and the values of its free local variables.
# let
g
=
let
x
=
1
and
y
=
2
in
fun
z
->
x+
y+
z
;;
val g : int -> int = <fun>
# inspect
g
;;
....memory block: size=3 - closure with 2 free variables
........code pointer: 0x8086450
........immediate value (1)
........immediate value (2)
- : int -> int = <fun>
The Objective CAML virtual machine treats partial applications of functions
specially for better performance. A partial application of an
abstraction is represented by a closure containing a value for each of
the instantiated parameters, plus a pointer to the closure for the
initial abstraction.
# let
a1
=
f
1
;;
val a1 : int -> int -> int = <fun>
# inspect
(a1)
;;
....memory block: size=3 - closure with 2 free variables
........code pointer: 0x8073088
........memory block: size=1 - closure with 0 free variables
............code pointer: 0x807308c
........immediate value (1)
- : int -> int -> int = <fun>
# let
a2
=
a1
2
;;
val a2 : int -> int = <fun>
# inspect
(a2)
;;
....memory block: size=4 - closure with 3 free variables
........code pointer: 0x8073088
........memory block: size=1 - closure with 0 free variables
............code pointer: 0x807308c
........immediate value (1)
........immediate value (2)
- : int -> int = <fun>
Figure 12.9 depicts the result of the inspection above.
Figure 12.9: Closure representation.
The function f has no free variables, hence the environment
part of its closure is empty. The code pointer for a function with
several arguments points to the code that should be called when all
arguments are provided. In the case of f, this is the code
corresponding to x+y+z. Partial applications of this
function result in intermediate closures that point to a shared code
(it is the same code pointer for a1 and a2). The role
of this code is to accumulate the arguments and detect when all
arguments have been provided. If so, it pushes all arguments and
calls the actual code for the function body; if not, it creates a new
closure. For instance, the application of a1 to 2 fails to
provide all arguments to the function f (the last argument is
still missing), hence a closure is created containing the first two
arguments, 1 and 2. Notice that the closures resulting from partial
applications always contain, in the first environment slot, a pointer
to the original closure. The original closure will be called when all
arguments have been gathered.
Mixing local declarations and partial applications results in the
following representation:
# let
g
x
=
let
y=
2
in
fun
z
->
x+
y+
z
;;
val g : int -> int -> int = <fun>
# let
a1
=
g
1
;;
val a1 : int -> int = <fun>
# inspect
a1
;;
....memory block: size=3 - closure with 2 free variables
........code pointer: 0x8086548
........immediate value (1)
........immediate value (2)
- : int -> int = <fun>
Abstract types
Values of an abstract type are represented like those of its
implementation type. Actually, type information is used only during
type-checking and compilation. During execution, the types are not
needed -- only the memory representation (tag bits on values, size and
tag fields on memory blocks) needs to be communicated to the garbage
collector.
For instance, a value of the abstract type 'a Stack.t
is represented as a reference to a list, since the type 'a Stack.t
is implemented as 'a list ref.
# let
p
=
Stack.create();;
val p : '_a Stack.t = <abstr>
# Stack.push
3
p;;
- : unit = ()
# inspect
p;;
....memory block: size=1 - structured block (tag=0):
........memory block: size=2 - structured block (tag=0):
............immediate value (3)
............immediate value (0)
- : int Stack.t = <abstr>
On the other hand, some abstract types are implemented by
representations that cannot be expressed in Objective CAML. Typical
examples include arrays of weak pointers and input-output channels.
Often, values of those abstract types are represented as memory
blocks with tag Abstract_tag.
# let
w
=
Weak.create
1
0
;;
val w : '_a Weak.t = <abstr>
# Weak.set
w
0
(Some
p);;
- : unit = ()
# inspect
w;;
....memory block: size=11 - abstract type
- : int Stack.t Weak.t = <abstr>
Sometimes, a finalization function is attached to those values.
Finalization functions are C functions which are called by the garbage
collector just before the value is collected. They are very useful to
free external resources, such as an input-output buffer, just before
the memory block referring to those resources disappears. For
instance, inspection of the ``standard output'' channel reveals that
the type out_channel is represented by abstract memory
blocks with a finalization function:
# inspect
(stdout)
;;
....memory block: size=2 - abstract finalized type
- : out_channel = <abstr>