The adaptor interface is undoubtedly simple; in this section, we will ask ourselves whether it is too simplistic. The Adaptor implementation is still at the level of a prototype but, as we shall see in the next few pages, is significant enough to challenge us on all the issues that the people working with object persistence are trying to grapple with.
I wanted the Adaptor API to be transparent ; that is, to be able to change the type of persistent store at will. The idea was to write small prototypes without messing around with databases and then migrate to a database for the real thing by simply changing the adaptor. Further, I wanted to retain the flexibility of an object living in multiple persistent stores concurrently, because that is the only way to copy objects from one store to another.
I wanted to retain the best features of memory-based data structures (navigability, speed, ease of use) and those of databases (transactions, concurrency, queries), where available. Finally, I did not want the adaptor to break object encapsulation, which means that the implementation could not assume anything about how a module stores instance-specific information and, more subtly, how it constructs its objects.
One important stricture that we easily forget is that an object is not just data. The three serialization modules we saw in the last chapter - FreezeThaw, Data::Dumper, and Storable - all make this assumption. They look past an object reference at the underlying structure and serialize whatever is reachable from there. This assumes that all instance-specific data is reachable from the reference: a false assumption. For example, an object reference of type ObjectTemplate is merely a reference to a scalar. By studying that reference, you have no idea of the object's attributes.
There is a worse problem with the above modules: when restoring objects from a byte stream, they simply recreate the original data structure in memory and bless it under the target module, without the module's involvement. This has the possibility of missing a few key initializations.
To avoid these problems, Adaptor requires each class that wants persistence to support three methods: a constructor, new(), and two attribute accessor methods, get_attributes() and set_attributes(), as follows:
new()
: The module must provide this constructor (a "default constructor," in C++ parlance), capable of creating an object without any input parameters. The simplest default constructor for creating hash-table-based objects looks like this:
sub new { bless {}; # bless a hash-table reference and return it. }
Of course, an even simpler alternative is to use ObjectTemplate, which provides an inheritable default constructor. As it happens, it also provides the other two methods listed next.
get_attributes(
LIST
)
: Given a list of attribute names, this method should return a list of corresponding values. For now, the restriction is that these values must be scalars (a big limitation; we will have more to say about this shortly). Because this method can be coded efficiently, it is preferable to Adaptor calling individual accessor functions. For example, if you use a hash table for your objects, you can implement this method as a hash slice:
sub get_attributes { my $obj = shift; # @_ now contains names of attributes @{$obj}{@_}; # hash slice returns corresponding values }
Adaptor uses the configuration file to specify the list of persistent attributes.
set_attributes(
LIST
)
: Given a list of attribute name and value pairs, this method updates the appropriate attributes. Both this function and get_attributes above must allow an attribute called _id , for reasons to be outlined shortly.
These methods are perfectly general functions; they are not tied to persistence in any way. In contrast, some libraries, especially in the C++ world (Microsoft Foundation Classes and the NIH library), require the object to support a streaming interface. Since a streamed object is of no use to a database, I chose to keep the attributes distinct. Besides, if we wanted to send these attributes to a file, we know we can always rely on other modules to stream them, without having to ask the object to do it for us.
When storing the object, the adaptor consults the configuration information for the list of persistent attributes for that class. It gives this list to get_attributes to retrieve the corresponding values and, depending on the type of the adaptor, either serializes it to a file or updates the database with an SQL query.
When retrieving an object from the database, the adaptor calls new() on the appropriate class and calls set_attributes to prime the newly constructed object with data from the persistent store.
Adaptor::DBI simply translates an object to a single row in an RDBMS table. For this reason, it requires each value returned by get_attributes to be a simple scalar (number or string, not a reference). My hope is to eventually ease this restriction with the help of typemaps - pieces of code that can perform customized translations of data types.[7]
[7] In Section 18.1, "Writing an Extension: Overview", we will see how the concept of typemaps is used in creating extensions.
Here are the currently available choices for how to handle an object with one or more non-simple-scalar attributes:
Customized {get,set}_attributes: Adaptor::DBI allows multivalued attributes in memory. All it requires is that get_attributes translate such attributes to a simple scalar in a way that set_attributes will be able to convert back to the original structure, when the data is read back from disk. It can do this translation using FreezeThaw, Data::Dumper, sprintf, or pack; the last two are probably the best, because you can control the length of the resulting scalar (it matters because database columns have predeclared maximum sizes). The scalar can then be mapped to a database column capable of accommodating a variable number of characters (such as VARCHAR) or a binary string (such as Oracle's RAW or LONG RAW). Incidentally, there are still a lot of problems associated with BLOB (Binary Large OBjects) columns: some databases only allow one BLOB column, and others sport an API that is completely different from that of the conventional data types.
Use file storage : Adaptor::File doesn't care whether the attributes are references or ordinary scalars, because it simply hands over the attributes to Storable. In other words, get/set_attributes doesn't have to worry about multivalued attributes if you use Adaptor::File. Of course, the solution won't work if you decide to use a database adaptor tomorrow. There is also the danger that you might inadvertently store unrelated objects this way, just because they happen to be reachable from some attribute.
Separate object class: If an attribute is a reference to a sequence of homogenous records (an employee has multiple records of educational qualifications, for example), that attribute can be modeled as a separate class that gets its own table. More on this when we study object associations later in this section.
Since {get,set}_attributes are general methods, how do they know whether or not to serialize complex attributes? Well, they don't. If you want to make this distinction, you could have a different set of attribute names for persistence purposes (db_address, for example) and have these methods recognize these special cases. This strategy conflicts with our original intention of not embedding db-specific code within an object. Oh, well. As Jiri Soukup notes in his book Taming C++: Pattern Classes and Persistence for Large Projects [11], "It is popular to show elegant C++ programs, and elegance is not a feature of programs providing persistent data."
The common strategy for mapping an inheritance relationship to a database is to have the superclass and derived class each map to its own table. The table representing the derived class contains all the attributes of all its superclasses; in other words, the inheritance hierarchy is flattened. Another strategy - less commonly used - is to create one table with the union of all attributes of an inheritance hierarchy and have all objects of all classes in that hierarchy use that one table. You can have an extra column identify the specific class of object. Adaptor does not have a problem with either strategy, because it puts the burden of interpreting the attribute names and values on the get/set methods.
One key notion in OO circles is that an object has properties separate from its identity. Two objects may have identical properties but still occupy different address spaces; they will be considered equivalent, not identical.
In memory, an object's address provides its identity, and in a database, the primary key does the same. Adaptor requires each object to support an attribute called _id, so a future implementation can automatically convert relationship attributes (those that point to other objects) to the _ids of the objects on the other end. For example, if you ask an Employee object for its dept attribute, it will ask the department object it is pointing to for its _id and return that. Note that the object doesn't necessarily have to allocate memory for its _id; the get/set_attributes methods can compute it on the fly based on some other attribute. For example, an employee object can return the Social Security number or employee number when asked for its _id.
When store() is called, Adaptor supplies the object with a unique identity, if it doesn't already have one. The identity cannot be a simple global counter, because when the program restarts, it will get reset to 0, and the adaptor will start handing out numbers that might have been given to persistent objects in an earlier incarnation. Storing the counter's last value in a file is slow because you have to make sure you flush this value to the file every single time you store an object. (You never know when the program might crash.) The current implementation experiments with an alternate approach. When the program starts, it notes down the time (using time, which returns the seconds elapsed since January 1, 1970), and appends to it a five-digit counter; the combined number can be used as an object identifier. When the counter overflows, the time is again noted. If the program crashes and comes back again, the identifier is unique, unless it crashes and comes back up within one second. The trouble with this scheme is that it generates long identifiers (eight bytes, using pack()). It also does not work in a distributed setup, because there is the real possibility that two programs call time() within the same second, thus generating the same identifier. To avoid this, you have to create an even bigger identifier that incorporates the IP address of the machine.
An attribute that is a reference to some other object can be translated to the other object's _id value (a foreign key, in database-speak) when storing it in a database or file. As currently implemented, Adaptor does not automatically do this translation, because I don't have a good solution to handle the following problem.
Assume that an employee object's dept attribute points to a department object. When storing dept, we can simply store the department object's _id. No problems so far. Now, when we retrieve the employee record back from disk, what do we do with the encoded dept attribute? Do we immediately create a department object so that the in-memory dept attribute can refer to it? If so, what data should it contain? Should we read the database to correctly populate the department object? That has the problem that an innocuous query on an employee ends up loading all kinds of objects from the database. Alternatively, should we keep the department in an uninitialized state, and only populate it the first time it is used? Further, we must ensure that when the department data is read from disk, it doesn't create a fresh new object, because one with the same identity already exists in memory. We will have more to say on this subject in the following section. For now, it eases my life a little to leave it to the objects to implement foreign key attributes.
Now let us look at how associations of varying cardinalities can be implemented in a database regardless of how they appear in memory.
One-to-many associations such as a department containing a list of employees can be implemented as a foreign-key attribute on the many side. That is, in the database, the employee object points back to its containing department object, instead of the department maintaining a multivalued attribute.
Many-to-many associations can be modeled as a separate class; this way, each association becomes a single record in the database. For example, an employee can work on many projects; a project has many employees working on it; we can model this relationship in a separate class called ProjectEmployee. This scheme has the additional advantage that the relationships can be queried and updated, independent of the objects they are supposed to connect. Associations with cardinalities higher than two (ternary associations, for example) map to distinct tables. Rumbaugh et al. [6] give an excellent treatment of database-mapping approaches.
All these strategies (or limitations) will change dramatically once object-relational extensions become widely available.
Close on the heels of object identity issues comes a very thorny problem. Consider the following query:
@emps = $db->retrieve_where ('Employee', 'age < 40');
This returns a list of object references that match the query criteria. Now if you re-issue this query, it is not too much to expect it to return an identical list of objects (the same object references, that is). This means that Adaptor has to keep an in-memory cache of objects that have been retrieved from disk in previous queries, so that if a database row is reread, the corresponding object is reused. The problem with this scheme is that if this cache is in script space, it increments the reference count of all its constituent objects, which means that once an object is in this cache, it will never be freed, even if no one else is interested in it. In other words, the cache can never shrink, and in the worst case, it has a copy of all the objects present in the database.
One solution to this problem is to implement the cache in C and not update the reference count at all.[8] If all persistent objects were to inherit from a module called Persistent, say, then this module's DESTROY
method can be used to remove unwanted entries from this cache.
[8] You will know how to do this once you have read Chapter 20, Perl Internals.
The Adaptor::DBI module, as currently implemented, takes the easy way out and creates a fresh set of objects for each query, leaving it to Perl to automatically deallocate them when no other object refers to them. This means that the applications developer has to be careful when modifying an object returned from a query. This is a clumsy solution, I know. In addition, there is currently no provision for cache inconsistency - where the cache is out-of-date if someone else modifies the database.
The Adaptor::File module does not have this problem because it maintains a list of all objects given to its store() method (for reasons to be explained in the next section); hence successive identical queries return identical lists.
One big reason why object-oriented databases haven't caught on is the lack of a query language (or at least a standard query language). When you have a million objects in the database, it would be a terrible thing to load every single object in memory to see whether it matches your criteria; this is a job best left to the database. Adaptor::DBI simply translates queries to equivalent SQL queries, while Adaptor::File implements a simple-minded scheme for file based objects: it converts the query expression to an eval
able Perl expression and cycles through all objects, matching them against the query specification.
Let us say you have sent your objects' data to a file, and tomorrow, some more attributes are added to the object implementation. The schema is said to have evolved. The framework has to be able to reconcile old data with newer object implementations.
Copyright © 2001 O'Reilly & Associates. All rights reserved.