Chapter 13
Repositories and archives

Repositories

This chapter looks at methods for storing digital materials in repositories and archiving them for the long term. It also examines the protocols that provide access to materials stored in repositories. It may seem strange that such important topics should be so late in the book, since long-term storage is central to digital libraries, but there is a reason. Throughout the book, the emphasis has been on what actually exists today. Research topics have been introduced where appropriate, but most of the discussion has been of systems that are used in libraries today. The topics in this chapter are less well established. Beyond the ubiquitous web server, there is little consensus about repositories for digital libraries and the field of digital archiving is new. The problems are beginning to be understood, but, particularly in the field of archiving, the methods are still embryonic.

A repository is any computer system whose primary function is to store digital material for use in a library. Repositories are the book shelves of digital libraries. They can be huge or tiny, storing millions of digital objects or just a single object. In some contexts a mobile agent that contains a few digital objects can be considered a repository, but most repositories are straightforward computer systems that store information in a file system or database and present it to the world through a well-defined interface.

Web servers

Currently, by far the most common form of repository is a web server. Panel 13.1 describes how they function. Several companies provide excellent web servers. The main differences between them are in the associated programs that are linked to the web servers, such as electronic mail, indexing programs, security systems, electronic payment mechanisms, and other network services.

Panel 13.1
Web servers

A web server is a computer program whose task is store files and respond to requests in HTTP and associated protocols. It runs on a computer connected to the Internet. This computer can be a dedicated web server, a shared computer which also runs other applications, or a personal computer that provides a small web site.

At the heart of a web server is a process called httpd. The letter "d" stands for "demon". A demon is a program that runs continuously, but spends most of its time idling until a message arrives for it to process. The HTTP protocol runs on top of TCP, the Internet transport protocol. TCP provides several addresses for every computer, known as ports. The web server is associated with one of these ports, usually port 80 but others can be specified. When a message arrives at this port, it is passed to the demon. The demon starts up a process to handle this particular message, and continues to listen for more messages to arrive. In this way, several messages can be processed at the same time, without tying up the demon in the details of their processing.

The actual processing that a web server carries out is tightly controlled by the HTTP protocol. Early web servers did little more than implement the get command. This command receives a message containing a URL from a client; the URL specifies a file which is stored on the server. The server retrieves this file and returns it to the client, together with its data type. The HTTP connection for this specific message then terminates.

As HTTP has added features and the size of web sites has grown, web servers have become more complicated than this simple description. They have to support the full set of HTTP commands, and extensions, such as CGI scripts. One of the requirements of web servers (and also of web browsers) is to continue to support older versions of the HTTP protocol. They have to be prepared for messages in any version of the protocol and to handle them appropriately. Web servers have steadily added extra security features which add complexity. Version 1.1 of the protocol also includes persistent connections, which permit several HTTP commands to be processed over a single TCP connection.

High-volume web servers

The biggest web sites are so busy that they need more than one computer. Several methods are used to share the load. One straightforward method is simply to replicate the data on several identical servers. This is convenient when the number of request is high, but the volume of data is moderate, so that replication is feasible. A technique called "DNS round robin" is used to balance the load. It uses an extension of the domain name system that allows a domain name to refer to a group of computers with different IP addresses. For example, the domain name "www.cnn.com" refers to a set of computers, each of which has a copy of the CNN web site. When a user accesses this site, the domain name system chooses one of the computers to service the request.

Replication of a web site is inconvenient if the volume of data is huge or if it is changing rapidly. Web search services provide an example. One possible strategy is to divide the processing across several computers. Some web search system use separate computers to carry out the search, assemble the page that will be returned to the user, and insert the advertisements.

For digital libraries, web servers provide moderate functionality with low costs. These attributes have led to broad acceptance and a basic level of interoperability. The web owes much of its success to its simplicity, and web servers are part of that success, but some of their simplifying assumptions cause problems for the implementers of digital libraries. Web servers support only one object model, a hierarchical file system where information is organized into separate files. Their processing is inherently stateless; each message is received, processed, and forgotten.

Advanced repositories

Although web servers are widely used, other types of storage systems are used as repositories in digital libraries. In business data processing, relational databases are the standard way to manage large volumes of data. Relational databases are based on an object model that consist of data tables and relations between them. These relations allow data from different tables to be joined or viewed in various ways. The tables and the data fields within a relational database are defined by a schema and a data dictionary. Relational databases are excellent at managing large amounts of data with a well-defined structure. Many of the large publishers mount collections on relational databases, with a web server providing the interface between the collections and the user.

Catalogs and indexes for digital libraries are usually mounted on commercial search systems. These systems have a set of indexes that refer to the digital objects. Typically, they have a flexible and sophisticated model for indexing information, but only a primitive model for the actual content. Many began as full-text systems and their greatest strength lies in providing information retrieval for large bodies of text. Some systems have added relevance feedback, fielded searching, and other features that they hope will increase their functionality and hence their sales.

Relational databases and commercial search systems both provide good tools for loading data, validating it, manipulating it, and protecting it over long terms. Access control is precise and they provide services, such as audit trails, that are important in business applications. There is an industry-wide trend for database systems to add full-text searching, and for search systems to provide some parts of the relational database model. These extra features can be useful, but no company has yet created a system that combines the best of both approaches.

Although some digital libraries have used relational databases with success, the relational model of data, while working well with simple data structures, lacks flexibility for the richness of object models that are emerging. The consensus among the leading digital libraries appears to be that more advanced repositories are needed. A possible set of requirements for such a repository are as follows.

Metadata in repositories

Repositories store both data and metadata. The metadata can be considered as falling into the general classes of descriptive, structural, and administrative metadata. Identifiers may need to distinguish elements of digital objects as well as the objects themselves. Storage of metadata in a repository requires flexibility since there is a range of storage possibilities:

One of the uses of metadata is for interoperability, yet every digital library has its own ideas about the selection and specification of metadata. The Warwick Framework, described in Panel 13.2, is a conceptual framework that offers some semblance of order to this potentially chaotic situation.

Panel 13.2
The Warwick Framework

The Warwick Framework is an attempt at a general model that can represent the various parts of a complex object in a digital library. The genesis of the framework was some ideas that came out of a 1996 workshop at the University of Warwick in England.

The basic concept is aimed at organizing metadata. A vast array of metadata can apply to a single digital object, including descriptive metadata such as MARC cataloguing, access management metadata, structural metadata, and identifiers. The members of the workshop suggested that the metadata might be organized into packages. A typical package might be a Dublin Core metadata package or a package for geospatial data. This separation has obvious advantages for simplifying interoperability. If a client and a repository are both able to process a specific package type they are able to reach some level of interoperation, even if the other metadata packages that they support are not shared.

Subsequently, Carl Lagoze of Cornell University and Ron Daniel then at the Los Alamos National Laboratory took this simple idea and developed an elegant way of looking at all the components of a digital object. Their first observation is that the distinction between data and metadata is frequently far from clear. To use a familiar analog, is the contents page of a book part of the content or is it metadata about the content of the book? In the Warwick Framework, such distinctions are unimportant. Everything is divided into packages and no distinction is made between data and metadata packages.

The next observation is that not all packages need to be stored explicitly as part of a digital object. Descriptive metadata is often stored in a different repository as a record in a catalog or index. Terms and conditions that apply to many digital objects are often best stored in separate policy records, not embedded in each individual digital object. This separation can be achieved by allowing indirect packages. The package is stored wherever is most convenient, with a reference stored in the repository. The reference may be a simple pointer to a location, or it may invoke a computer program whose execution creates the package on demand.

Digital objects of the same structural type will usually be composed of a specific group of packages. This forms an object model for interoperability between clients and repositories for that type of digital object.

The Warwick Framework has not been implemented explicitly in any large-scale system, but the ideas are appealing. The approach of dividing information into well-defined packages simplifies the specification of digital objects and provides flexibility for interoperability.

Protocols for interoperability

Interoperability requires protocols that clients use to send messages to repositories and repositories use to return information to clients. At the most basic level, functions are needed that deposit information in a repository and provide access. The implementation of effective systems requires that the client is able to discover the structure of the digital objects, different types of objects require different access methods, and access management may require authentication or negotiation between client and repository. In addition, clients may wish to search indexes within the repository.

Currently, the most commonly used protocol in digital libraries is HTTP, the access protocol of the web, which is discussed in Panel 13.3. Another widely used protocol is Z39.50; because of its importance in information retrieval, it was described in Chapter 11.

Panel 13.3.
HTTP

Chapter 2 introduced the HTTP protocol and described the get message type. A get message is an instruction from the client to the server to return whatever information is identified by the URL included in the message. If the URL refers to a process that generates data, it is the data produced by the process that is returned.

The response to a get command has several parts. It begins with a status, which is a three digit code. Some of these codes are familiar to users of the web because they are error conditions, such as 404, the error code returned when the resource addressed by the URL is not found. Successful status codes are followed by technical information, which is used primarily to support proxies and caches. This is followed by metadata about the body of the response. The metadata provides information to the client about the data type, its length, language and encoding, a hash, and date information. The client used this metadata to process the final part of the message, the response body, which is usually the file referenced by the URL.

Two other HTTP message types are closely related to get. A head message requests the same data as a get message except that the message body itself is not sent. This is useful for testing hypertext links for validity, accessibility, or recent modification without the need to transfer large files. The post message is used to extend the amount of information that a client sends to the server. A common use is to provide a block of data, such as for a client to submit an HTML form. This can then be processed by a CGI script or other application at the server.

The primary use of HTTP is to retrieve information from a server, but the protocol can also be used to change information on a server. A put message is used to store specified information at a given URL and a delete message is used to delete information. These are rarely used. The normal way to add information to a web server is by separate programs that manipulate data on the server directly, not by HTTP messages sent from outside.

Many of the changes that have been made to HTTP since its inception are to allow different versions to coexist and to enhance performance over the Internet. HTTP recognizes that many messages are processed by proxies or by caches. Later versions include a variety of data and services to support such intermediaries. There are also special message types: options which allows a client to request information about the communications options that are available, and trace which is used for diagnostic and testing.

Over the years, HTTP has become more elaborate, but it is still a simple protocol. The designers have done a good job in resisting pressures to add more and more features, while making some practical enhancements to improve its performance. No two people will agree exactly what services a protocol should provide, but HTTP is clearly one of the Internet's success stories.

Object-oriented programming and distributed objects

One line of research is to develop the simplest possible repository protocol that supports the necessary functions. If the repository protocol is simple, information about complex object types must be contained in the digital objects. (This has been called "SODA" for "smart object, dumb archives".)

Several advanced projects are developing architectures that use the computing concept of distributed objects. The word "object" in this context has a precise technical meaning, which is different from the terms "digital object" and "library object" used in this book. In modern computing, an object is an independent piece of computer code with its data, that can be used and reused in many contexts. The information within an object is encapsulated, so that the internals of the object are hidden. All that the outside world knows about a class of objects is a public interface, consisting of methods, which are operations on the object, and instance data. The effect of a particular method may vary from class to class; in a digital library a "render" method might have different interpretations for different classes of object.

After decades of development, object-oriented programming languages, such as C++ and Java, have become accepted as the most productive way to build computer systems. The driving force behind object-oriented programming is the complexity of modern computing. Object-oriented programming allows components to be developed and tested independently, and not need to be revised for subsequently versions of a system. Microsoft is a heavy user of object-oriented programming to develop its own software. Versions of its object-oriented environment are known variously as OLE, COM, DCOM, or Active-X. They are all variants of the same key concepts.

Distributed objects generalize the idea of objects to a networked environment. The basic concept is that an object executing on one computer should be able to interact with an object on another, through its published interface, defined in terms of methods and instance data. The leading computer software companies - with the notable exception of Microsoft - have developed a standard for distributed objects known as CORBA. CORBA provides the developers on distributed computing systems with many of the same programming amenities that object-oriented programming provides within a single computer.

The key notion in CORBA is an Object Request Broker (ORB). When an ORB is added to an application program, it establishes a client-server relationships between objects. Using an ORB, a client can transparently invoke a method on a server object, which might be on the same machine or across a network. The ORB intercepts the call; it finds an object that can implement the request, passes it the parameters, invokes its method, and returns the results. The client does not have to be aware of where the object is located, its programming language, its operating system, or any other system aspects that are not part of the object's interface. Thus, the ORB provides interoperability between applications on different machines in heterogeneous distributed environments.

Data hiding

Objects in the computing sense and digital objects in the library sense are different concepts, but they have features in common. The term data hiding comes from object-oriented programming, but applies equally to digital objects in libraries. When a client accesses information in a repository, it needs to known the interface that the repository presents to the outside world, but it does not need to know how it is stored in the repository. With a web server, the interface is a protocol (HTTP), an address scheme (URL), and a set of formats and data types. With other repositories it can expressed in the terminology of object-oriented programming. What the user perceives as a single digital object might be stored in the repository as a complex set of files, records in tables, or as active objects that are executed on demand.

Hiding the internal structure and providing all access through a clearly defined interface clearly simplifies interoperability. Clients benefit because they do not need to know the internal organization of repositories. Two repositories may choose to organize similar information in different manners. One may store the sound track and pictures of a digitized film as two separate digital objects, the other as a single object. A client program should be able to send a request to begin playback, unaware of these internal differences. Repositories benefit because internal reorganization is entirely a local concern. What the user sees as a single digital object may in fact be a page in HTML format with linked images and Java applets. With data hiding, it is possible to move the images to a different location or change to a new version of Java, invisibly to the outside.

Chapter 3 introduced two major projects that both offer users thumbnail images of larger images, JSTOR and the National Digital Library Program at the Library of Congress. The Library of Congress has decided to derive the thumbnail in advance and to store them as separate data. JSTOR does not store thumbnails. They are computed on demand from the stored form of the larger images. Each approach is reasonable. These are internal decisions that should be hidden from the user interface and could be changed later. External systems need to know that the repository can supply a thumbnail. They do not need to know how it is created.

Legacy systems

Conventional attempts at interoperability have followed a path based on agreed standards. The strategy is to persuade all the different organizations to decide on a set of technical and organizational standards. For complex system, such as digital libraries, this is an enormous undertaking. Standards are needed for networking, for data types, methods of identification, security, searching and retrieval, reporting errors, and exchanging payment. Each of these standards will have several parts describing syntax, semantics, error procedures, extensions, and so forth. If a complete set of standards could be agreed and every organization implemented them in full, then a wonderful level of interoperability might be achieved. In practice, the pace of standardization is slower than the rate of change of technology, and no organization ever completely integrates all the standards into its systems before it begins to change those systems to take advantage of new opportunities or to avoid new difficulties.

Hence, a fundamental challenge of interoperability is how can different generations of system work together. Older systems are sometime given the disparaging name legacy systems, but many older systems do a fine job. Future plans always need to accommodate the existing systems and commitments. Panel 13.4 describes research at Stanford University that accepts legacy systems for what they are and builds simple programs, known as proxies, that translate the external view that a system has into a neutral view. They call this approach the InfoBus.

Panel 13.4
The Stanford InfoBus

One of the projects funded by the Digital Libraries Initiative was at Stanford University, under the leadership of Hector Garcia-Molina, Terry Winograd, and Andreas Paepcke. They tackled the extremely challenging problem of interoperability between existing systems. Rather than define new standards and attempt to modify older systems, they accept the systems as they find them.

The basic approach is to construct Library Service Proxies which are CORBA objects representing online services. At the back-end, these proxies communicate with the services via whatever communication channel they provide. At the front, the proxies provide interfaces defined in terms of CORBA methods. For example, a client with a Z39.50 search interface might wish to search an online search service, such as Dialog. This requires two proxies. One translate between the Z39.50 search protocol and the InfoBus model. The second translates between the Dialog interface and the InfoBus model. By using this pair of proxies, the client can search Dialog despite their different interfaces.

Perhaps the most interesting InfoBus tools are those that support Z39.50. Stanford has developed a proxy that allows Z39.50 clients to interact with search services that do not support Z39.50. Users can submit searches to this proxy through any user interface that was designed to communicate with a Z39.50 server. The proxy forwards the search requests through the InfoBus to any of the InfoBus-accessible sources, even sources that do not support Z39.50. The proxy converts the results into a format that Z39.50 clients can understand. In a parallel activity, researchers at the University of Michigan implemented another proxy that makes all Z39.50 servers accessible on the InfoBus. The Stanford project also constructed proxies for HTTP and for web search systems, including, Lycos, WebCrawler, and Altavista, and for other web services, such as ConText, which is Oracle's document summarization tool.

Archiving

Archives are the raw material of history. In the United States, the National Archives and Records Administration has the task to keep records "until the end of the republic". Hopefully, this will be a long time. At the very least, archives must be prepared to keep library information longer than any computer system that exists today, and longer than any electronic or magnetic medium has ever been tested. Digital archiving is difficult. It is easier to state the issues that to resolve them and there are few good examples to use as exemplars. The foundation of modern work on digital archiving was the report of the Task Force on Archiving of Digital Information described in Panel 13.5.

Panel 13.5
Task Force on Archiving of Digital Information

The Task Force on Archiving of Digital Information was established in late 1994 by the Commission on Preservation and Access and the Research Libraries Group to study the problems of digital preservation. It was chaired by John Garrett, then at CNRI, and Donald Waters, of Yale University. The report was the first comprehensive account to look at these issues from a legal, economic, organizational, and technical viewpoint.

The report illustrates the dangers of technological obsolescence with some frightening examples. For example, the report noted that computer programs no longer exist to analyze the data collected by the New York Land Use and Natural Resources Inventory Project in the 1960s.

The task force explored the possibility of establishing a national system of archives as a basis for long term preservation. While it considers such a system desirable, perhaps even essential, it recognizes that it is not a panacea. The fundamental responsibility for maintaining information lies with the people and organizations who manage it. They need to consider themselves part of a larger system.

Technically, the report is important in stressing that archiving is more than simply copying bits from one aging storage medium to another; the methods to interpret and manipulate the bits must also be preserved. For this reason, the report considers that migration of information from one format to another and from one type of computer system to another is likely to be the dominant form of digital archiving.

The report is particularly interesting in its discussions of the legal issues associated with archiving. Much of the information that should be archived is owned by organizations who value the intellectual property, for financial or other reasons. They are naturally hesitant about providing copies for archives. Archives, on the other hand, may have to be assertive. When information is in danger of being lost, perhaps because a company is going out of business, they need to be able to acquire information for the public good.

The task force carried out its work in 1995. Its report remains the most complete study of the field. Although its financial calculations now appear dated, the analysis of the options and the choices available have not been seriously questioned.

Conventional archiving distinguishes between conservation which looks after individual artifacts, and preservation which retains the content even if the original artifact decays or is destroyed. The corresponding techniques in digital archiving are refreshing, which aims to preserve precise sequences of bits, and migration, which preserves the content at a semantic level, but not the specific sequences of bits. This distinction was first articulated by the Task Force on Archiving of Digital Information, which recommended a focus on migration as the basic technique of digital archiving.

Both refreshing and migration require periodic effort. Business records are maintained over long periods of time because a team of people is paid to maintain that specific data. It is their job and they pay attention to the inter-related issues of security, back-up, and long term availability of data. Publishers have also come to realize that their digital information is an asset that can generate revenue for decades and are looking after it carefully, but in many digital collections, nobody is responsible for preserving the information beyond its current usefulness. Some of the data may be prized centuries from now, but today it looks of no consequence. Archiving the data is low on everybody's priority and is the first thing to be cut when budgets are tight.

Storage

In the past, whether a physical artifact survived depended primarily of the longevity of it materials. Whether the artifacts were simple records from churches and governments, or treasures such as the Rosetta Stone, the Dead Sea Scrolls, the Domesday Book, and the Gutenberg Bibles, the survivors have been those that were created with material that did not perish, notably high-quality paper.

Of today's digital media, none can be guaranteed to last for long periods. Some, such as magnetic tape, have a frighteningly short life span before they deteriorate. Others, such as CDs are more stable, but nobody will predict their ultimate life. Therefore, unless somebody pays attention, all digital information will be lost within a few decades. Panel 13.6 describes some of the methods that are used to store digital materials today. Notice that the emphasis is on minimizing the cost of equipment and fast retrieval times, not on longevity.

Panel 13.6
Storing digital information

Storage media

The ideal storage medium for digital libraries would allow vast amounts of data to be stored at low cost, would be fast to store and read information, and would be exceptionally reliable and long lasting.

Rotating magnetic disks are the standard storage medium in modern computer systems. Sizes range from a fraction of a gigabyte to arrays of thousands of gigabytes. (A gigabyte is a thousand million bytes.) Disks are fast enough for most digital library applications, since data can be read from disks faster than it can be transmitted over networks. When data is read from a disk, there is a slight delay (about 15 milliseconds) while the disk heads are aligned to begin reading; then data is read in large blocks (typically about 25 megabytes per second). These performance characteristics suit digital library applications, which typically read large blocks of data at a time.

The decline in the cost of disks is one of the marvels of technology. Magnetic disks are coming down in price even faster than semi-conductors. In 1998, the price of disks was a few hundred dollars per gigabyte. The technology is advancing so rapidly that digital libraries can confidently plan that in ten years time the cost will be no more than five percent and very likely will be less than one percent of today's.

Disks have a weakness: reliability. The data on them is easily lost, either because of a hardware failure or because a program over-writes it. To guard against such failures, standard practice is for the data to be regularly copied onto other media, usually magnetic tape. It is also common to have some redundancy in the disk stores so that simple errors can be corrected automatically. Neither disks nor magnetic tape can be relied on for long term storage. The data is encoded on a thin magnetic film deposited on some surface. Sooner or later this film decays. Disks are excellent for current operations, but not for archiving.

Hierarchical stores

Large digital library collections are sometimes stored in hierarchical stores. A typical store has three levels, magnetic disks, optical disks, and magnetic tapes. The magnetic disks are permanently online. Information can be read in a fraction of a second. The optical disks provide a cheaper way of storing huge amounts of data, but the disk platters are stored in an automated silo. Before an optical disk can be used, a robot must move it from the silo to a disk reader, which is a slow process. The magnetic tapes are also stored in a silo with a robot to load them.

To the computers that use it, a hierarchical store appears to be a single coherent file system. Less frequently used data migrates from the higher speed, but more expensive magnetic disks to the slower but cheaper media. As the cost and capacity of magnetic disks continue to fall dramatically, the need for an intermediate level of storage becomes questionable. The magnetic disks and tapes serve distinctive functions and are both necessary, but the intermediate storage level may not by needed in future.

Compression

Digital libraries use huge amounts of storage. A page of ASCII text may be only a few thousand characters, but a scanned color image, only one inch square, requires more than a megabyte (one million bytes). An hour of digitized sound, as stored on a compact disk is over 600 megabytes and a minute of video can have more than one gigabyte of data before compression.

To reduce these storage requirements, large items are compressed, including almost all images, sound, and video. The basic idea of compression is simple, though the mathematics are complex. Digitized information contains redundancy. A page image has areas of white space; it is not necessary to encode every single pixel separately. Successive frames of video differ only slightly from each other; rather than encode each frame separately, it is simpler to record the differences between them.

Compression methods can be divided into two main categories. Lossless compression removes redundant information in a manner that is completely reversible; the original data can be reconstructed exactly as it was. Lossy compression can not be reversed; approximations are made during the compression that lose some of the information. In some applications, compression must be lossless. In a physics experiment, a single dot on an image may be crucial evidence; any modification of the image might undermine the validity of the experiment. In most applications, however, some losses are acceptable. The JPEG compression used for images and MPEG used for video are lossy methods that are calibrated to provide images that are very satisfactory to a human eye.

Compression methods reduce the size of data considerably, but the files are still large. After compression, a monochrome page of scanned text is more than 50,000 bytes. MPEG compression reduces digitized video from 20 or 30 megabytes per second to 10 megabytes per minute. Since digital libraries may store millions of these items, storage capacity is an important factor.

Replication and refreshing

Replication is a basic technique of data processing. Important data that exists only as a single copy on one computer is highly vulnerable. The hardware can fail; the data can be obliterated by faulty software; an incompetent or dishonest employee can remove the data; the computer building may be destroyed by fire, flooding, or other disaster. For these reasons, computer centers make routine copies of all data for back-up and store this data in safe locations. Good organizations go one step further and periodically consolidate important records for long term storage. One approach is to retain copies of financial and legal records on microform, since archival quality microform is exceptionally durable.

Because all types of physical media on which digital information is stored have short lives, methods of preservation require that the data is copied periodically onto new media. Digital libraries must plan to refresh their collections in the same manner. Every few years the data must be moved onto new storage media. From a financial viewpoint this is not a vast challenge. For the next few decades, computing equipment will continue to tumble in price while increasing in capacity. The equipment that will be needed to migrate today's data ten years from now will cost a few percent of the cost today and robots can minimize the labor involved.

Preserving content by migration

Assume (a big assumption) that the bits are systematically refreshed from media to media, so that the technical problem of preserving the raw date is resolved. The problems are just beginning. Digital information is useless unless the formats, protocols, and metadata can be recognized and processed. Ancient manuscripts can still be read because languages and writing have changed slowly over the years. Considerable expertise is needed to interpret old documents, but expertise has been passed down through generations and scholars can decipher old materials through persistence and inspiration.

Computing formats change continually. File formats of ten years ago may be hard to read. There is no computer in the world that can run programs for computers that were widespread a short time ago. Some formats are fairly simple; if, at some future date, an archeologist were to stumble onto a file of ASCII text, even if all knowledge of ASCII had been lost, the code is sufficiently simple that the text could probably be interpreted, but ASCII is an exception. Other formats are highly complex; it is hard to believe that anybody could ever decipher MPEG compression without a record of the underlying mathematics, or understand a large computer program from its machine code.

Therefore, in addition to the raw data, digital archiving must preserve ways to interpret the data, to understand its type, its structure, and its formats. If a computer program is needed to interpret the data, then the program must be preserved with some device that can execute it, or the data migrated to a different form. In the near term, it is possible to keep old computer systems for these purposes, but computers have a short life span. Sooner or later the computer will break down, spare parts will no longer be available, and any program that depends upon the computer will be useless. Therefore migration of the content becomes necessary.

Migration has been standard practice in data processing for decades. Businesses, such as pension funds, maintain records of financial transactions over many years. In the United States, the Social Security Administration keeps a record of payroll taxes paid on behalf of all workers throughout their careers. These records are kept on computers, but the computer systems are changed periodically. Hardware is replaced and software systems are revised. When these changes take place, the data is migrated from computer to computer, and from database to database. The basic principle of migration is that the formats and structure of the data may be changed, but the semantics of the underlying content is preserved.

Another method that is sometimes suggested by people with limited knowledge of computing is emulation; the idea is to specify in complete detail the computing environment that is required to execute a program. Then, at any time in the future, an emulator can be built that will behave just like the original computing environment. In a few, specialized circumstances this is a sensible suggestion. For example, it is possible to provide such a specification for a program that renders a simple image format, such as JPEG. In all other circumstances, emulation is a chimera. Even simple computing environments are much too complex to specify exactly. The exact combination of syntax, semantics, and special rules is beyond comprehension, yet subtle, esoteric aspects of a system are often crucial to correct execution.

Digital archeology

Societies go through periods of stress, including recessions, wars, political upheavals, and other times when migrating archival material is of low priority. Physical artifacts can lie forgotten for centuries in attics and storerooms, yet still be recovered. Digital information is less forgiving. Panel 13.7 describes how one such period of stress, the collapse of East Germany, came close to losing the archives of the state. It is an example of digital archeology, the process of retrieving information from damaged, fragmentary, and archaic data sources.

Panel 13.7
Digital archeology in Germany

An article in the New York Times in March 1998 provided an illustration of the challenges faced by archivists in a digital world unless data is maintained continuously from its initial creation.

In 1989, when the Berlin Wall was torn down and Germany was reunited, the digital records of East Germany were in disarray. The German Federal Archives acquired a mass of punched cards, magnetic disks, and computer tapes that represented the records of the Communist state. Much of the media was in poor condition, the data was in undocumented formats, and the computer centers that had maintained it were hastily shut down or privatized. Since then, a small team of German archivists has been attempting to reconstruct the records of East Germany. They call themselves "digital archeologists" and the term is appropriate.

The first problem faced by the digital archeologists was to retrieve the data from the storage media. Data on even the best quality magnetic tape has a short life and this tape was in poor condition, so that the archeologists could only read it once. In many cases the data was stored on tape following a Russian system that is not supported by other computers. Over the years, the archivists have obtained several of these computers, but some 30 percent of the data is unreadable.

When the data had been copied onto other media, the problems were far from solved. To save space, much of the data had been compressed in obscure and undocumented ways. An important database of Communist officials illustrates some of the difficulties. Since the computers on which the data had been written and the database programs were clones of IBM systems, recovering the database itself was not too difficult, but interpreting the data was extremely difficult, without documentation. The archivists had one advantage. They have been able to interview some of the people who built these databases, and have used their expertise to interpret much of the information and preserve it for history.

Dr. Michael Wettengel, the head of the German group of archivists, summarizes the situation clearly, "Computer technology is made for information processing, not for long-term storage."

Creating digital libraries with archiving in mind

Since digital archiving has so many risks, what can we do today to enhance the likelihood that the digital archeologist will be able to unscramble the bits? Some simple steps are likely to make a big difference. The first is to create the information in formats that are widely adopted today. This increases the chance that, when a format becomes obsolete, conversion programs to new formats will be available. For example, HTML and PDF are so widely used in industry that viewers will surely be available many years from now.

One interesting suggestion is to create an archive that contains the definition of formats, metadata standards, protocols and the other building blocks of digital libraries. This archive should be on the most persistent media that is known, e.g., paper or microfilm, and everything should be described in simple text. If the formats and encoding schemes are preserved, the information can still be recovered. Future digital archeologist may have a tough job, creating an interpreter that can resolve long-obsolete formats or instruction sets, but it can be done. Modern formats are complex. Whereas a digital archeologist might reverse engineer the entire architecture of an early IBM computer from a memory dump, the archeologist will be helpless with more complex materials, unless the underlying specification is preserved.

Perhaps the most important way that digital libraries can support archiving is through selection. Not everything needs to be preserved. Most information is intended to have a short life; much is ephemeral, or valueless. Publishers have always made decisions about what to publish and what to reject; even the biggest libraries acquire only a fraction of the world's output. Digital libraries are managed collections of information. A crucial part of that management is deciding what to collect, what to store, what to preserve for the future, and what to discard.



Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000