Digital Libraries: Chapter 4 (1999)

Chapter 4
Innovation and research

The research process

Innovation is a theme that runs throughout digital libraries and electronic publishing. Some of this innovation comes from the marketplace's demand for information, some from systematic research in universities or corporations. This chapter looks at the process of research and provides an overview of areas of current research. Most of the topics that are introduced in this chapter are discussed in greater detail later in the book.

Innovation in libraries and publishing

Until recently, innovation by libraries and publishers was far from systematic. Most had no budget for research and development. The National Library of Medicine, which has supported digital libraries since the 1960s, and OCLC's Office of Research were notable exceptions; both carry out research in the true sense of exploring new ideas without preconceptions of how the ideas will eventually be deployed, but they have been most unusual. Other libraries and publishers have an approach to innovation that is motivated by more practical considerations, how to provide enhanced services and products in the near term. The rate of innovation is controlled by the state of technology and the availability of funds, but also by an interpretation of the current demand for services.

Although they are well-organized to bring new ideas to the market, publishers carry out little research. Advanced projects are seen as business development and the large publishers have the resources to undertake substantial investments in new ventures. Chapter 3 looked at the movement to make scientific journals available online, illustrated by the Association for Computing Machinery's digital library. This can not be called research, but it is a complex task to change the entire publishing process, so that journals are created in computer formats, such as SGML, which can be read online or used as the source of printed journals.

Libraries tend to be more innovative than publishers, though they often appear poorly organized in their approach to change. Typically, they spend almost their entire budget on current activities. Innovation is often treated as an extra, not as the key to the future with its own resources. Large libraries have large budgets. Many employ several hundred people. Yet the budget systems are so inflexible that research and innovation are grossly under-staffed.

Thus, at the Library of Congress, the National Digital Library Program is a vital part of the library's vision and may be the most important library project in the United States. The program is a showcase of how the library can expand beyond its traditional role, by managing its collections differently, handling new types of material, and providing wider access to its collections. Yet the Library of Congress provides little financial support to the project. The program raises most of its funds from private foundations and other gifts, and is staffed by people on short term contracts. The Library of Congress does support technical areas, such as Z39.50 and MARC. However, the library's most visionary project has no long-term funding.

Despite such apparent conservatism, libraries are moving ahead on a broad front, particularly in universities. The structure of university libraries inhibits radical change, but university librarians know that computing is fundamental to the future of scholarly communication. Computer scientists can be seen as the hares of digital libraries, leaping ahead, trying out new fields, then jumping somewhere else. The large libraries are the tortoises. They move more slowly, but at each step they lay a foundation that can be built on for the next. The steps often take the form of focused projects, sometimes in partnership with other organizations, often with grants from foundations. The individual projects may not have the visibility of the big government-funded research initiatives, but the collective impact may well have an equal importance in the long term.

Many notable projects have their origins in university libraries. Some are converting materials to digital formats; others are working with publishers to make materials accessible online. Several projects that were mentioned in Chapter 3 are of this type, including HighWire Press at Stanford University which is putting scientific journals online, the collaboration of university libraries and Elsevier Science in the Tulip project to explore digitized version of scientific journals, and the contribution of the University of Michigan and Princeton University to the JSTOR project to convert historic back-runs of important journals. Another partnership, between Rutgers University and Princeton, created the Center for Electronic Texts in the Humanities. Each of these projects came out of the university libraries.

Most projects are made possible by funds from foundations, industry, or the government. For example, our Mercury project at Carnegie Mellon received major grants from the Pew Charitable Trusts, Digital Equipment Corporation, and DARPA, with smaller, but welcome, support from other donors. Several private foundations have been strong supporters of digital libraries for the humanities, notably the Andrew W. Mellon Foundation, and the J. Paul Getty Trust, which specializes in art and art history information. Finally, although its budget is small compared with the NSF or DARPA, the National Endowment for the Humanities devotes a significant proportion of its funds to digital libraries.

Panel 4.1
The Coalition for Networked Information

A focal point for innovation amongst libraries and publishers in the United States is the Coalition for Networked Information, known as CNI. Most of the people at the twice yearly meetings are from university libraries or computing centers, but the meetings also attract key individuals from publishers, computer companies, national libraries, and the government. In recent years, many people from outside the United States have attended.

The coalition is a partnership of the Association of Research Libraries and Educause. It was founded in March 1990 with a mission to help realize the promise of high-performance networks and computers for the advancement of scholarship and the enrichment of intellectual productivity. More than two hundred institutions are member of the Coalition Task Force. They include higher education institutions, publishers, network service providers, computer hardware, software, and systems companies, library networks and organizations, and public and state libraries.

In 1991, several years before web browsers were introduced, CNI set an example, by being one of the first organizations to create a well-managed information service on the Internet. The following list is taken from its web site. It is a sample of the coalition's projects over the years since its founding. Some of these projects had many participants and lasted for several years, while others were a single event, such as a conference. All of them made their impact by bringing together people from various fields to work side-by-side on shared problems.

Access to networked government information via the Internet.
Federal information for higher education.
Evaluation of how networks have affected academic institutions.
Authentication, authorization, and access management.
Cost centers and measures in the networked information value chain.
Consortium for university printing and information distribution.
Capture and storage of electronic theses and dissertations.
Electronic billboards on the digital superhighway.
51 reasons to invest in the national information infrastructure.
The Government Printing Office wide information network data on-line act.
Humanities and arts on the information highway.
Institution-wide information policies.
CNI/OCLC metadata workshop.
National initiative for networked cultural heritage.
Networked information discovery and retrieval.
Creating new learning communities via the network.
Technology, scholarship and the humanities.
Rights for electronic access to and dissemination of information.
Regional conferences.
Scholarship from California on the net.
Teaching and learning via the network.
Protecting intellectual property in the networked multimedia environment.
The transformation of the public library.
Planning retreat for library and information technology professionals.
Z39.50 resources.

This list illustrates CNI's broad ranging interests, which emphasize practical applications of digital libraries, collections, relationships between libraries and publishers, and policy issues of access to intellectual property. What the list does not show is the human side of CNI, established by its founding director, the late Paul Evan Peters, and continued by his successor, Clifford Lynch. CNI meetings are noted for the warmth with which people from different fields meet together. People whose professional interests may sometimes appear in conflict have learned to respect each other and come to work together. Progress over the past few years has been so rapid that it is easy to forget the vacuum that CNI filled by bringing people together to discuss their mutual interests in networked information.

Computer science research

Federal funding of research

Computer scientists take research seriously. They are accustomed to long-term projects, where the end result is not products or services, but new concepts or a deeper understanding of the field. In the United States, much of the money for research comes from federal agencies. The world's biggest sponsor of computer science research is DARPA, the Defense Advanced Research Projects Agency. (DARPA keeps changing its name. It was originally ARPA, changed to DARPA, back to ARPA, and back again to DARPA.) The next largest is the National Science Foundation (NSF). DARPA is a branch of the Department of Defense. Although ultimately its mission is to support the military, DARPA has always taken a broad view and encourages fundamental research in almost every aspect of computer science. It is particularly noted for its emphasis on research projects that build large, experimental systems. The NSF has a general responsibility for promoting science and engineering in the United States. Its programs invest over $3.3 billion per year in almost 20,000 research and education projects. Thus it supports research both in computer science and in applications of computing to almost every scientific discipline.

Many engineering and computer companies have large budgets for computing research and development, often as much as ten percent of total operations. Most industrial research is short term, aimed at developing new products, but fundamental advances in computing have come from industrial laboratories, such as Xerox PARC, Bell Laboratories, and IBM. More recently, Microsoft has established an impressive research team.

Much of the underlying technology that makes digital libraries feasible was created by people whose primary interests were in other fields. The Internet, without which digital libraries would be very different, was originally developed by DARPA (then known as ARPA) and the NSF. The web was developed at CERN, a European physics laboratory with substantial NSF funding. The first web browser, Mosaic, was developed at the NSF supercomputing center at the University of Illinois. Several areas of computer science research are important to information management, usually with roots going back far beyond the current interest in digital libraries; they include networking, distributed computer systems, multimedia, natural language processing, databases, information retrieval, and human-computer interactions. In addition to funding specific research, the federal agencies assist the development of research communities and often coordinate deployment of new technology. The NSF supported the early years of the Internet Engineering Task Force (IETF). The World Wide Web Consortium based at MIT is primarily funded by industrial associates, and also has received money from DARPA.

Digital libraries were not an explicit subject of federal research until the 1990s. In 1992, DARPA funded a project, coordinated by the Corporation for National Research Initiatives (CNRI), involving five universities: Carnegie Mellon, Cornell, M.I.T., Stanford, and the University of California at Berkeley. The project had the innocent name of the Computer Science Technical Reports project (CSTR), but its true impact was to encourage these strong computer science departments to develop research programs in digital libraries.

The initiative that really established digital libraries as a distinct field of research came in 1994, when the NSF, DARPA, and the National Aeronautic and Space Agency (NASA) created the Digital Libraries Initiative. The research program of the Digital Libraries Initiative is summarized in Panel 4.2. Research carried out under this program is mentioned throughout this book.

Panel 4.2. The Digital Libraries Initiative

In 1994, the computer science divisions of NSF, DARPA, and NASA provided funds for six research projects in digital libraries. They were four year projects. The total government funding was $24 million, but numerous external partners provided further resources which, in aggregate, exceeded the government funding. Each project was expected to implement a digital library testbed and carry out associated research.

The impact of these projects indicates that the money has been well-spent. An exceptionally talented group of researchers has been drawn to library research, with wisdom and insights that go far beyond the purely technical. This new wave of computer scientists who are carrying out research in digital libraries brings new experiences to the table, new thinking, and fresh ideas to make creative use of technology.

Here is a list of the six projects and highlights of some of the research. Many of these research topics are described in later chapters.

The University of California at Berkeley built a large collection of documents about the California environment. They included maps, pictures, and government reports scanned into the computer. Notable research included: multivalent documents which are a conceptual method for expressing documents as layers of information, Cheshire II a search system that combines the strengths of SGML formats with the information in MARC records, and research into image recognition so that features such as dams or animals can be recognized in pictures.

The University of California at Santa Barbara concentrated on maps and other geospatial information. Their collection is called the Alexandria Digital Library. Research topics included: metadata for geospatial information, user interfaces for overlapping maps, wavelets for compressing and transmitting images, and novel methods for analyzing how people use libraries.

Carnegie Mellon University built a library of segments of video, called Informedia. The research emphasized automatic processing for information discovery and display. It included: multi-modal searching in which information gleaned from many sources is combined, speech recognition, image recognition, and video skimming to provide a brief summary of a longer video segment.

The University of Illinois worked with scientific journal publishers to build a federated library of journals for science and engineering. Much of the effort concentrated on manipulation of documents in SGML. This project also used supercomputing to study the problems of semantic information in very large collections of documents.

The University of Michigan has built upon the digital library collections being developed by the university libraries. The project has investigated applications in education, and experimented with economic models and an agent-based approach to interoperability.

Stanford University concentrated on computer science literature. At the center of the project was the InfoBus, a method to combine digital library services from many sources to provide a coherent set of services. Other research topics included novel ways to address user interfaces, and modeling of the economic processes in digital libraries.

The Digital Libraries Initiative focused international attention on digital libraries research. Beyond the specific work that it funded, the program gave a shape to an emerging discipline. Research in digital libraries was not new, but previously it had been fragmented. Even the name "digital libraries" was uncertain. The Digital Libraries Initiative highlighted the area as a challenging and rewarding field of research. It led to the growth of conferences, publications, and the appointment of people in academic departments who describe their research interests as digital libraries. This establishment of a new field is important because it creates the confidence that is needed to commit to long-term research.

Another impact of the Digital Libraries Initiative has been to clarify the distinction between research and the implementation of digital libraries. When the initiative was announced, some people thought that the federal government was providing a new source of money to build digital libraries, but these are true research projects. Some of the work has already migrated into practical applications; some anticipates hardware and software developments; some is truly experimental.

The emergence of digital libraries as a research discipline does have dangers. There is a danger that researchers will concentrate on fascinating theoretical problems in computer science, economics, sociology, or law, forgetting that this is a practical field where research should be justified by its utility. The field is fortunate that the federal funding agencies are aware of these dangers and appear determined to keep a focus on the real needs.

Agencies, such as the NSF, DARPA, and NASA, are granted budgets by Congress to carry out specific objectives and each agency operates within well-defined boundaries. None of the three agencies has libraries as a primary mission; the money for the Digital Libraries Initiative came from existing budgets that Congress had voted for computer science research. As a result, the first phase of research emphasized the computer science aspects of the field, but the staff of the agencies know that digital libraries are more than a branch of computer science. When in 1998, they created a second phase of the initiative, they sought partners whose mission is to support a broader range of activities. The new partners include the National Library of Medicine, the National Endowment for the Humanities, the Library of Congress, and the NSF's Division of Undergraduate Education. This book was written before these new grants were announced, but everybody expects to see funding for a broad range of projects that reflect the missions of all these agencies.

To shape the agenda, the funding agencies have sponsored a series of workshops. Some of these workshops have been on specific research topics, such as managing access to information. Others have had a broader objective, to develop a unified view of the field and identify key research topics. The small amounts of federal funding used for coordination of digital libraries research have been crucial in shaping the field. The key word is "coordination", not standardization. Although the output sometimes includes standards, the fundamental role played by these efforts is to maximize the impact of research, and build on it.

Areas of research

The central part of this chapter is a quick overview of the principal areas of research in digital libraries. As digital libraries have established themselves as a field for serious research, certain problems have emerged as central research topics and a body of people now work on them.

Object models

One important research topic is to understand the objects that are in digital libraries. Digital libraries store and disseminate any information that can be represented in digital form. As a result, the research problems in representing and manipulating information are varied and subtle.

The issue is an important one. What users see as a single work may be represented in a computer as an assembly of files and data structures in many formats. The relationship between these components and the user's view of the object is sometimes called an object model. For example, in a digital library, a single image may be stored several times, as a high-quality archival image, a medium resolution version for normal use, and a small thumbnail that gives an impression of the image but omits many details. Externally, this image is referenced by a single bibliographic identifier but to the computer it is a group of distinct files. A journal article stored on a web server appears to a user as a single continuous text with a few graphics. Internally it is stored as several text files, several images, and perhaps some executable programs. Many versions of the same object might exist. Digital libraries often have private versions of materials that are being prepared for release to the public. After release, new versions may be required to correct errors, the materials may be reorganized or moved to different computers, or new formats may be added as technology advances.

The ability of user interfaces and other computer programs to present the work to a user depends upon being able to understand how these various components relate to form a single library object. Structural metadata is used to describe the relationships. Mark-up languages are one method to represent structure in text. For example, in an HTML page, the <img> tag is structural metadata that indicates the location of an image. The simplicity of HTML makes it a poor tool for many tasks, whereas the power of SGML is overly complex for most purposes. A new hybrid, XML, has recently emerged that may succeed in bridging the gap between HTML and the full generality of SGML.

Much of the early work on structural metadata has been carried out with libraries of digitized pictures, music, video clips, and other objects converted from physical media. Maps provide an interesting field of their own. Looking beyond conventional library materials, content created in digital form is not constrained by linearity of print documents. There is research interest in real-time information, such as that obtained from remote sensors, in mobile agents that travel on networks, and on other categories of digital objects that have no physical counterpart. Each of these types has its problems, how to capture the information, how to store it, how to describe it, how to find the information that it contains, and how to deliver it. These are tough questions individually and even tougher in combination. The problem is to devise object models that will support library materials that combine many formats, and will enable independent digital libraries to interoperate.

User interfaces and human-computer interaction

Improving how users interact with information on computers is clearly a worthwhile subject. This is such a complex topic that it might be thought of as an art, rather than a field where progress is made through systematic research. Fortunately, such pessimism has proved unfounded. The development of web browsers is an example of creative research in areas such as visualization of complex sets of information, layering of the information contained in documents, and automatic skimming to extract a summary or to generate links.

To a user at a personal computer, digital libraries are just part of the working environment. Some user interface research looks at the whole environment, which is likely to include electronic mail, word processing, and applications specific to the individual's field of work. In addition, the environment will probably include a wide range of information that is not in digital form, such as books, papers, video tapes, maps, or photographs. The ability of users to interact with digital objects, through annotations, to manipulate them, and to add them to their own personal collections is proving to be a fertile area of research.

Information discovery

Finding and retrieving information is a central aspects of libraries. Searching for specific information in large collections of text, the field known as information retrieval, has long been of interest to computer scientists. Browsing has received less research effort despite its importance. Digital libraries bring these two areas together in the general problem of information discovery, how to find information. Enormous amounts of research is being carried out in this area and only a few topics are mentioned here.

Descriptive metadata Most of the best systems for information discovery use cataloguing or indexing metadata that has been produced by somebody with expert knowledge. This includes the data in library catalogs, and abstracting and indexing services, such as Medline or Inspec. Unfortunately, human indexing is slow and expensive. Different approaches are required for the huge volumes of fast-changing material to be expected in digital libraries. One approach is for the creator to provide small amounts of descriptive metadata for each digital object. Some of this metadata will be generated automatically; some by trained professionals; some by less experienced people. The metadata can then be fed into an automatic indexing program.
Automatic indexing. The array of information on the networks is too vast and changes too frequently for it all to be catalogued by skilled cataloguers. Research in automatic indexing uses computer programs to scan digital objects, extract indexing information and build searchable indexes. Web search programs, such as Altavista, Lycos, and Infoseek are the products of such research, much of which was carried out long before digital libraries became an established field.
Natural language processing. Searching of text is greatly enhanced if the search program understands some of the structure of language. Relevant research in computational linguistics includes automatic parsing to identify grammatical constructs, morphology to associate variants of the same word, lexicons, and thesauruses. Some research goes even further, attempting to bring knowledge of the subject matter to bear on information retrieval.
Non-textual material. Most methods of information discovery use text, but researchers are slowly making progress in searching for specific content in other formats. Speech recognition is just beginning to be usable for indexing radio programs and the audio track of videos. Image recognition, the automatic extraction of features from pictures, is an active area of research, but not yet ready for deployment.

Collection management and preservation

Collection management is a research topic that is just beginning to receive the attention than it deserves. Over the years, traditional libraries have developed methods that allow relatively small teams of people to manage vast collections of material, but early digital libraries have often been highly labor intensive. In the excitement of creating digital collections, the needs of organizing and preserving the materials over long periods of time were neglected. These topics are now being recognized as difficult and vitally important.

Organization of collections. The organization of large collections of online materials is complex. Many of the issues are the same whether the materials are an electronic journal, a large web site, a software library, an online map collection, or a large information service. They include how to load information in varying formats, and how to organize it for storage and retrieval. For access around the world, several copies are needed, using various techniques of replication. The problems are amplified by the fact that digital information changes. In the early days of printing, proof corrections were made continually, so that every copy of a book might be slightly different. Online information can also change continually. Keeping track of minor variations is never easy and whole collections are reorganized at unpleasantly frequent intervals. Many of the research topics that are important for interoperability between collections are equally important for organizing large collections. In particular, current research on identifiers, metadata, and authentication applies to both collection management and interoperation among collections.
Archiving and preservation. The long-term preservation of digital materials has recently emerged as a key research topic in collection management. Physical materials, such as printed books, have the useful property that they can be neglected for decades but still be readable. Digital materials are the opposite. The media on which data is stored have quite short life expectancies, often frighteningly short. Explicit action must be taken to refresh the data by copying the bits periodically onto new media. Even if the bits are preserved, problems remain. The formats in which information is stored are frequently replaced by new versions. Formats for word processor and image storage that were in common use ten years ago are already obsolete and hard to use. To interpret archived information, future users will need to be able to recognize the formats and display them successfully.
Conversion. Conversion of physical materials into digital formats illustrates the difficulties of collection management. What is the best way to convert huge collections to digital format? What is the trade off between cost and quality? How can today's efforts be useful in the long term?

This area illustrates the differences between small-scale and large-scale efforts, which is an important research topic in its own right. A small project may convert a few thousand items to use as a testbed. The conversion is perceived as a temporary annoyance that is necessary before beginning the real research. A group of students will pass the materials through a digital scanner, check the results for obvious mistakes, and create the metadata required for a specific project. In contrast, libraries and publishers convert millions of items. The staff carrying out the work is unlikely to be as motivated as members of a research team; metadata must be generated without knowing the long-term uses of the information; quality control is paramount. The current state of the art is that a number of organizations have developed effective processes for converting large volumes of material. Often, part of the work is shipped to countries where labor costs are low. However, each of these organizations has its own private method of working. There is duplication of tools and little sharing of experience.

Conversion of text is an especially interesting example. Optical character recognition, which uses a computer to identify the characters and words on a page, has reached a tantalizing level of being almost good enough, but not quite. Several teams have developed considerable expertise in deciding how to incorporate optical character recognition into conversion, but little of this expertise is systematic or shared.

Interoperability

From a computing viewpoint, many of the most difficult problems in digital libraries are aspects of a single challenge, interoperability, how to get a wide variety of computing systems to work together. This embraces a range of topics, from syntactic interoperability that provides a superficial uniformity for navigation and access, but relies almost entirely on human intelligence for coherence, to a deeper level of interoperability, where separate computer systems share an understanding of the information itself.

Around the world, many independently managed digital libraries are being created. These libraries have different management policies and different computing systems. Some are modern, state-of-the-art computer systems; others are elderly, long past retirement age. The term legacy system is often used to describe old systems, but this term is unnecessarily disparaging. As soon as the commitment is made to build a computer system, or create a new service or product, that commitment is a factor in all future decisions. Thus, every computer system is a legacy system, even before it is fully deployed.

Interoperability and standardization are interrelated. Unfortunately, the formal process of creating international standards is often the opposite of what is required for interoperability in digital libraries. Not only is the official processes of standardization much too slow for the fast moving world of digital libraries. The process encourages standards that are unduly complex and many international standards have never been tested in real life. In practice, the only standards that matter are those that are widely used. Sometimes a de facto standard emerges because a prominent group of researchers uses it; the use of TCP/IP for the embryonic Internet is an example. Some standards become accepted because the leaders of the community decide to follow certain conventions; the MARC format for catalog records is an example. Sometimes, generally accepted standards are created from a formal standards process; MPEG, the compression format used for video, is a good example. Other de facto standards are proprietary products from prominent corporations; Adobe's Portable Document Format (PDF) is a recent example. TCP/IP and MARC are typical of the standards that were initially created by the communities that use them, then became official standards that have been enhanced through a formal process.

The following list gives an idea of the many aspects of interoperability:

User interfaces. A user will typically use many different digital library collections. Interoperability aims at presenting the materials from those collection in a coherent manner, though it is not necessary to hide all the differences. A collection of maps is not the same as a music collection, but the user should be able to move smoothly between them, search across them, and be protected from idiosyncrasies of computer systems or peculiarities of how the collections are managed.
Naming and identification. Some means is needed to identify the materials in a digital library. The Internet provides a numeric identifier for every computer, an IP address, and the domain name system that identifies every computer on the Internet. The web's Uniform Resource Locator (URL) extends these names to individual files. However, neither domain names nor URLs are fully satisfactory. Library materials need identifiers that identify the material, not the location where an instance of the material is stored at a given moment of time. Location independent identifiers are sometimes called Uniform Resource Names (URNs).
Formats. Materials in every known digital format are stored in digital libraries. The web has created de facto standards for a few formats, notably HTML for simple text, and GIF and JPEG for images. Beyond these basic formats, there is little agreement. Text provides a particular challenge for interoperability. During the 1980s, ASCII emerged as the standard character set for computers, but has few characters beyond those used in English. Currently, Unicode appears to be emerging as an extended character set that supports a very wide range of scripts, but is not yet supported by many computer systems. Although SGML has been widely advocated, and is used in some digital library systems, it is so complex and has so much flexibility, that full interoperability is hard to achieve.
Metadata. Metadata plays an important role in many aspects of digital libraries, but is especially important for interoperability. As discussed earlier, metadata is often divided into three categories: descriptive metadata is used for bibliographic purposes and for searching and retrieval; structural metadata relates different objects and parts of objects to each other; administrative metadata is used to manage collections, including access controls. For interoperability, some of this metadata must be exchanged between computers. This requires agreement on the names given to the metadata fields, the format used to encode them, and at least some agreement on semantics. As a trivial example of the importance of semantics, there is little value in having a metadata field called "date" if one collection uses the field for the date when an object was created and another uses it for the date when it was added to the collection.
Distributed searching. Users often wish to find information that is scattered across many independent collections. Each may be organized in a coherent way, but the descriptive metadata will vary, as will the capabilities provided for searching. The distributed search problem is how to find information by searching across collections. The traditional approach is to insist that all collections agree on a standard set of metadata and support the same search protocols. Increasingly, digital library researchers are recognizing that this is a dream world. It must be possible to search sensibly across collections despite differing organization of their materials.
Network protocols. To move information from one computer to another, requires interoperability at the network level. The almost universal adoption of the Internet family of protocols has largely solved this problem, but there are gaps. For example, the Internet protocols are not good at delivering continuous streams of data, such audio or video materials, which must arrive in a steady stream at predictable time intervals.
Retrieval protocols. One of the fundamental operations of digital libraries is for a computer to send a message to another to retrieve certain items. This message must be transmitted in some protocol. The protocol can be simple, such as HTTP, or much more complex. Ideally, the protocol would support secure authentication of both computers, high-level queries to discover what resources each provides, a variety of search and retrieval capabilities, methods to store and modify intermediate results, and interfaces to many formats and procedures. The most ambitious attempt to achieve these goals is the Z39.50 protocol, but Z39.50 is in danger of collapsing under its own complexity, while still not meeting all the needs.
Authentication and security. Several of the biggest problems in interoperability among digital libraries involve authentication. Various categories of authentication are needed. The first is authentication of users. Who is the person using the library? Since few methods of authentication have been widely adopted, digital libraries are often forced to provide every user with a ID and password. The next category is authentication of computers. Systems that handle valuable information, especially financial transactions or confidential information, need to know which computer they are connecting to. A crude approach is to rely on the Internet IP address of each computer, but this is open to abuse. The final need is authentication of library materials. People need to be confident that they have received the authentic version of an item, not one that has been modified, either accidentally or deliberately. For some of these needs, good methods of authentication exist, but they are not deployed widely enough to permit full interoperability.
Semantic interoperability. Semantic interoperability is a broad term for the general problem that, when computers pass messages, they need to share the same semantic interpretation of the information in the messages. Semantic interoperability deals with the ability of a user to access similar classes of digital objects, distributed across heterogeneous collections, with compensation for site-by-site variations. Full semantic interoperability embraces a family of deep research problems. Some are extraordinarily difficult.

The web provides a base level of interoperability, but the simplicity of the underlying technology that has led to its wide acceptance also brings weaknesses. URLs make poor names for the long term; HTML is restricted in the variety of information that it can represent; MIME, which identifies the type of each item, is good as far as it goes, but library information is far richer than the MIME view of data types; user interfaces are constrained by the simplicity of the HTTP protocol. Developing extensions to the web technology has become big business. Some extensions are driven by genuine needs, but others by competition between the companies involved. A notable success has been the introduction of the Java programming language, which has made a great contribution to user interfaces, overcoming many of the constraints of HTTP.

Paradoxically, the web's success is also a barrier to the next generation of digital libraries. It has become a legacy system. The practical need to support this installed base creates a tension in carrying out research. If researchers wish their work to gain acceptance, they must provide a migration path from the web of today. As an example, the fact that the leading web browsers do not support URNs has been a barrier to using URNs to identify materials within digital libraries.

The secret of interoperability is easy to state but hard to achieve. Researchers need to develop new concepts that offer great improvements, yet are easy to introduce. This requires that new methods have high functionality and low cost of adoption. A high-level of functionality is needed to overcome the inertia of the installed base. Careful design of extensibility in digital library systems allows continued research progress with the least disruption to the installed base.

Scaling

Interoperability and collection management are two examples of problems that grow rapidly as the scale of the library increases. A user may find more difficulty in using a monograph catalog in a very large library, such as the Library of Congress or Harvard University, than in a small college library where there are only a few entries under each main heading. As the size of the web has grown, many people would agree that the indexing programs, such as Infoseek, have become less useful. The programs often respond to simple queries with hundreds of similar hits; valuable results are hard to find among the duplicates and rubbish. Scaling is a difficult topic for research, without building large-scale digital libraries. Currently the focus is on the technical problems, particularly reliability and performance.

Questions of reliability and robustness of service pervade digital libraries. The complexity of large computer systems exceeds the ability to understand fully how all the parts interact. In a sufficiently large system, inevitably, some components are out of service at any given moment. The general approach to this problem is to duplicate data. Mirror sites are often used. Unfortunately, because of the necessary delays in replicating data from one place to another, mirror sites are rarely exact duplicates. What are the implications for the user of the library? What are the consequences for distributed retrieval if some part of the collections can not be searched today, or if a back-up version is used with slightly out-of-date information?

Research in performance is a branch of computer networking research, not a topic peculiar to digital libraries. The Internet now reaches almost every country of the world, but it is far from providing uniformly high performance everywhere, at all times. One basic technique is caching, storing temporary copies of recently used information, either on the user's computer or on a close by server. Caching helps achieve decent performance across the world-wide Internet, but brings some problems. What happens if the temporary copies are out of date? Every aspect of security and control of access is made more complex by the knowledge that information is likely to be stored in insecure caches around the world. Interesting research on performance has been based on the concept of locality. Selected information is replicated and stored at a location that has been chosen because it has good Internet connections. For example, the Networked Computer Science Technical Reference Library (NCSTRL), described in Chapter 11, uses a series of zones. Everything needed to search and identify information is stored within a zone. The only messages sent outside the zone are to retrieve the actual digital objects from their home repositories.

Economic, social, and legal issues

Digital libraries exist within a complex social, economic, and legal framework, and succeed only to the extent that they meet these broader needs. The legal issues are both national and international. They range across several branches of law: copyright, communications, privacy, obscenity, libel, national security, and even taxation. The social context includes authorship, ownership, the act of publication, authenticity, and integrity. These are not easy areas for research.

Some of the most difficult problems are economic. If digital libraries are managed collections of information, skilled professionals are needed to manage the collections. Who pays these people? The conventional wisdom assumes that the users of the collections, or their institutions, will pay subscriptions or make a payment for each use of the collections. Therefore, there is research into payment methods, authentication, and methods to control the use made of collections. Meanwhile, the high quality of many open-access web sites has shown that there are other financial models. Researchers have developed some interesting economic theories, but the real advances in understanding of the economic forces come from the people who are actually creating, managing, or using digital information. Pricing models cease to be academic when the consequence of a mistake is for individuals to lose their jobs or an organization to go out of business.

Access management is a related topic. Libraries and publishers sometimes wish to control access to their materials. This may be to ensure payment, requirements from the copyright owners, conditions laid down by donors, or a response to concerns of privacy, libel, or obscenity. Such methods are sometimes called "rights management", but issues of access are much broader than simply copyright control or the generation of revenue. Some of the methods of access management involve encryption, a highly complex field where issues of technology, law, and public policy have become hopelessly entangled.

A related research topic is evaluation of the impact made by digital libraries. Can the value of digital libraries and research into digital libraries be measured? Unfortunately, despite some noble efforts, it is not clear how much useful information can be acquired. Systematic results are few and far between. The problem is well-known in market research. Standard market research techniques, such as focus groups and surveys, are quite effective in predicting the effect of incremental changes to existing products. The techniques are much less effective in anticipating the impact of fundamental changes. This does not imply that measurements are not needed. It is impossible to develop any large scale system without good management data. How many people use each part of the service? How satisfied are they? What is the unit cost of the services? What is the cost of adding material to the collections? What are the delays? Ensuring that systems provide such data is essential, but this is good computing practice, not a topic for research.

Economic, social and legal issues were left to the end of this survey of digital libraries research, not because they are unimportant but because they are so difficult. In selecting research topics, two criteria are important. The topic must be worthwhile and the research must be feasible. The value of libraries to education, scholarship, and the public good is almost impossible to measure quantitatively. Attempts to measure the early impact of digital libraries are heavily constrained by incomplete collections, rapidly changing technology, and users who are adapting to new opportunities. Measurements of dramatically new computer systems are inevitably a record of history, interesting in hind sight, but of limited utility in planning for the future. Determining the value of digital libraries may continue to be a matter for informed judgment, not research.

Research around the world

This overview of digital library research and innovation is written from an American perspective, but digital libraries are a worldwide phenomenon. The Internet allows researchers from around the world to collaborate on a day to day basis. Researchers from Australia and New Zealand have especially benefited from these improved communications, and are important contributors. The web itself was developed in Switzerland. The British e-lib project provided added stimuli to a variety of library initiatives around the theme of electronic publication and distribution of material in digital forms. Recently, the European Union and the National Science Foundation have sponsored a series of joint planning meetings. A notable international effort has been the series of Dublin Core metadata workshops, which are described in Chapter 10. The stories on digital libraries research published in D-Lib Magazine each month illustrate that this is indeed a worldwide field; during the first three years of publication, articles came from authors in more than ten countries. The big, well-funded American projects are important, but they are far from being the whole story.

Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000