Digital Libraries: Chapter 11 (1999)

Chapter 11
Distributed information discovery

Distributed computing and interoperability

This chapter is about distributed information retrieval, seeking for information that is spread across many computer systems. This is part of the broad challenge of interoperability, which is an underlying theme of this entire book. Therefore, the chapter begins with a discussion of the general issues of interoperability.

The digital libraries of the world are managed by many organizations, with different management styles, and different attitudes to collections and technology. Few libraries have more than a small percentage of the materials that users might want. Hence, users need to draw from collections and services provided by many different sources. How does a user discover and have access to information when it could be drawn from so many sources?

The technical aspects of coordinating separate computers so that they provide coherent service is called distributed computing. Distributed computing requires that the various computers share some technical standards. With distributed searching, for example, a user might want to search many independent collections with a single query, compare the results, choose the most promising, and retrieve selected materials from the collections. Beyond the underlying networking standards, this requires some method of identifying the collections, conventions for formulating the query, techniques for submitting it, means to return results so that they can be compared, and methods for obtaining the items that are discovered. The standards may be formal standards, blessed by official standards bodies, or they may be local standards developed by a small group of collaborators, or agreements to use specific commercial products. However, distributed computing is not possible without some shared standards.

An ideal approach would be to develop a comprehensive set of standards that all digital libraries would adopt. This concept fails to recognize the costs of adopting standards, especially during times of rapid change. Digital libraries are in a state of flux. Every library is striving to improve its collections, services, and systems, but no two libraries are the same. Altering part of a system to support a new standard is time-consuming. By the time the alteration is completed there may be a new version of the standard, or the community may be pursuing some other direction. Comprehensive standardization is a mirage.

The interoperability problem of digital libraries is to develop distributed computing systems in a world where the various computers are independently operated and technically dissimilar. Technically, this requires formats, protocols, and security systems so that messages can be exchanged. It also requires semantic agreements on the interpretation of the messages. These technical aspects are hard, but the central challenge is to find approaches that independent digital libraries have incentives to incorporate. Adoption of shared methods provides digital libraries with extra functionality, but shared methods also bring costs. Sometimes the costs are directly financial: the purchase of equipment and software, hiring and training staff. More often the major costs are organizational. Rarely can one aspect of a digital library be changed in isolation. Introducing a new standard requires inter-related changes to existing systems, altered work flow, changed relationships with suppliers, and so on.

Figure 11.1. Strategies for distributed searching: function versus cost of adoption

Figure 11.1 shows a conceptual model that is useful in thinking about interoperability; in this instance it is used to compare three methods of distributed searching. The horizontal axis of the figure indicates the functionality provided by various methods. The vertical axis indicates the costs of adopting them. The ideal methods would be at the bottom right of the graph, high functionality at low cost. The figure shows three particular methods of distributed searching, each of which is discussed later in this chapter. The web search programs have moderate functionality; they are widely used because they have low costs of adoption. Online catalogs based on MARC cataloguing and the Z39.50 protocol have much more function, but, because the standards are complex, they are less widely adopted. The NCSTRL system lies between them in both function and cost of adoption.

More generally. it is possible to distinguish three broad classes of methods that might be used for interoperability.

Most of the methods that are in widespread use for interoperability today have moderate function and low cost of acceptance. The main web standards, HTML, HTTP, and URLs, have these characteristics. Their simplicity has led to wide adoption, but limits the functions that they can provide.
Some high-end services provide great functionality, but are costly to adopt. Z39.50 and SGML are examples. Such methods are popular in restricted communities, where the functionality is valued, but have difficulty in penetrating into broader communities, where the cost of adoption becomes a barrier.
Many current developments in digital libraries are attempts to find the middle ground, substantial functionality with moderate costs of adoption. Examples include the Dublin Core, XML, and Unicode. In each instance, the designers have paid attention to providing a moderate cost route for adoption. Dublin Core allows every field to be optional. Unicode provides UTF-8, which accepts existing ASCII data. XML reduces the cost of adoption by its close relationships with both HTML and SGML.

The figure has no scale and the dimensions are only conceptual, but it helps to understand the fundamental principle that the costs of adopting new technology are a factor in every aspect of interoperability. Technology can never be considered by itself, without studying the organizational impact. When the objective is to interoperate with others, the creators of a digital library are often faced with the choice between the methods that are best for their particular community and adopting more generally accepted standards, even though they offer lesser functionality.

New versions of software illustrate this tension. A new version will often provide the digital library with more function but fewer users will have access to it. The creator of a web site can use the most basic HTML tags, a few well-established formats, and the services provided by every version of HTTP; this results in a simple site that can be accessed by every browser in the world. Alternatively, the creator can choose the latest version of web technology, with Java applets, HTML frames, built-in security, style sheets, with audio and video inserts. This will provide superior service to those users who have high-speed networks and the latest browsers, but may be unusable by others.

Web search programs

The most widely used systems for distributed searching are the web search programs, such as Infoseek, Lycos, Altavista, and Excite. These are automated systems that provide an index to materials on the Internet. On the graph in Figure 11.1, they provide moderate function with low barriers to use: web sites need take no special action to be indexed by the search programs, and the only cost to the user is the tedium of looking at the advertisements. The combination of respectable function with almost no barriers to use makes the web search programs extremely popular.

Most web search program have the same basic architecture, though with many differences in their details. The notable exception is Yahoo, which has its roots in a classification system. The other systems have two major parts: a web crawler which builds an index of material on the Internet, and a retrieval engine which allows users on the Internet to search the index.

Web crawlers

The basic way to discover information on the web is to follow hyperlinks from page to page. A web indexing programs follows hyperlinks continuously and assembles a list of the pages that it finds. Because of the manner in which the indexing programs traverse the Internet, they are often called web crawlers.

A web crawler builds an ever-increasing index of web pages by repeating a few basic steps. Internally, the program maintains a list of the URLs known to the system, whether or not the corresponding pages have yet been indexed. From this list, the crawler selects the URL of an HTML page that has not been indexed. The program retrieves this page and brings it back to a central computer system for analysis. An automatic indexing program examines the page and creates an index record for it which is added to the overall index. Hyerlinks from the page to other pages are extracted; those that are new are added to the list of URLs for future exploration.

Behind this simple framework lie many variations and some deep technical problems. One problem is deciding which URL to visit next. At any given moment, the web crawler has millions of unexplored URLs, but has little information to know which to select. Possible criteria for choice might include currency, how many other URLs link to the page, whether it is a home page or a page deep within a hierarchy, whether it references a CGI script, and so on.

The biggest challenges concern indexing. Web crawlers rely on automatic indexing methods to build their indexes and create records to present to users. This was a topic discussed in Chapter 10. The programs are faced with automatic indexing at its most basic: millions of pages, created by thousands of people, with different concepts of how information should be structured. Typical web pages provide meager clues for automatic indexing. Some creators and publishers are even deliberately misleading; they fill their pages with terms that are likely to be requested by users, hoping that their pages will be highly ranked against common search queries. Without better structured pages or systematic metadata, the quality of the indexing records will never be high, but they are adequate for simple retrieval.

Searching the index

The web search programs allow users to search the index, using information retrieval methods of the kind described in Chapter 10. The indexes are organized for efficient searching by large numbers of simultaneous users. Since the index records themselves are of low quality and the users likely to be untrained, the search programs follow the strategy of identifying all records that vaguely match the query and supplying them to the user in some ranked order.

Most users of web search programs would agree that they are remarkable programs, but have several significant difficulties. The ranking algorithms have little information to base their decisions on. As a result, the programs may give high ranks to pages of marginal value; important materials may be far down the list and trivial items at the top. The index programs have difficulty recognizing items that are duplicates, though they attempt to group similar items; since similar items tend to rank together, the programs often return long lists of almost identical items. One interesting approach to ranking is to use link counts. Panel 11.1 describes Google, a search system that has used this approach. It is particularly effective in finding introductory or overview material on a topic.

Panel 11.1
Page ranks and Google

Citation analysis is a commonly used tool in scientific literature. Groups of articles that cite each other are probably about related topics; articles that are heavily cited are likely to be more important than articles that are never cited. Lawrence Page, Sergey Brin, and colleagues at Stanford University have applied this concept to the web. They have used the patterns of hyperlinks among web pages as a basis for ranking pages and incorporated their ideas into an experimental web search program known as Google.

As an example, consider a search for the query "Stanford University" using various web search programs. There are more than 200,000 web pages at Stanford; most search programs have difficulty in separating those of purely local or ephemeral interest from those with broader readership. All web search programs find enormous numbers of web pages that match this query, but, in most cases, the order in which they are ranked fails to identify the sites that most people would consider to be most important. When this search was submitted to Google, the top ten results were the following. Most people would agree that this is a good list of high-ranking pages that refer to Stanford University.

Stanford University Homepage
(www.stanford.edu/)

Stanford University Medical Center
(www-med.stanford.edu/)

Stanford University Libraries & Information Resources
(www-sul.stanford.edu/)

Stanford Law School
(www-leland.stanford.edu/group/law/)

Stanford Graduate School of Business
(www-gsb.stanford.edu/)

Stanford University School of Earth Sciences
(pangea.stanford.edu/)

SUL: Copyright & Fair Use
(fairuse.stanford.edu/)

Computer Graphics at Stanford University
(www-graphics.stanford.edu/

SUMMIT (Stanford University) Home Page
(summit.stanford.edu/ )

Stanford Medical Informatics
(camis.stanford.edu/)

The basic method used by Google is simple. A web page to which many other pages provide links is given a higher rank than a page with fewer links. Moreover, links from high-ranking pages are given greater weight than links from other pages. Since web pages around the world have links to the home page of the Stanford Law School, this page has a high rank. In turn, it links to about a dozen other pages, such as the university's home page, which gain rank from being referenced by a high-ranking page.

Calculating the page ranks is an elegant computational challenge. To understand the basic concept, imagine a huge matrix listing every page on the web and identifying every page that links to it. Initially, every page is ranked equally. New ranks are then calculated, based on the number of links to each page, weighted according to the rank of the linking pages and proportional to the number of links from each. These ranks are used for another iteration and the process continued until the calculation converges.

The actual computation is a refinement of this approach. In 1998, Google had a set of about 25 million pages, selected by a process derived from the ranks of pages that link to them. The program has weighting factors to account for pages with no links, or groups of pages that link only to each other. It rejects pages that are generated dynamically by CGI scripts. A sidelight on the power of modern computers is that the system was able to gather, index, and rank these pages in five days using only standard workstation computers.

The use of links to generate page ranks is clearly a powerful tool. It helps solve two problems that bedevil web search programs: since they can not index every page on the web simultaneously, which should they index first, and how should they rank the pages found from simple queries to give priority to the most useful.

Web search programs have other weaknesses. Currency is one. The crawlers are continually exploring the web. Eventually, almost everything will be found, but important materials may not be indexed until months after they are published on the web. Conversely, the programs do an indifferent job at going back to see if materials have been withdrawn, so that many of the index entries refer to items that no longer exist or have moved.

Another threat to the effectiveness of web indexing is that a web crawler can index material only if it can access it directly. If a web page is protected by some form of authentication, or if a web page is an interface to a database or a digital library collection, the indexer will know nothing about the resources behind the interface. As more and more web pages are interfaces controlled by Java programs or other scripts, this results in high-quality information being missed by the indexes.

These problems are significant but should not be over-emphasized. The proof lies in the practice. Experienced users of the web can usually find the information that they want. They use a combination of tools, guided by experience, often trying several web search services. The programs are far from perfect but they are remarkably good - and use of them is free.

Business issues

A fascinating aspect of the web search services is their business model. Most of the programs had roots in research groups, but they rapidly became commercial companies. Chapter 6 noted that, initially, some of these organizations tried to require users to pay a subscription, but Lycos, which was developed by a researcher at Carnegie Mellon University, was determined to provide public, no-cost searching. The others were forced to follow. Not charging for the basic service has had a profound impact on the Internet and on the companies. Their search for revenue has led to aggressive attempts to build advertising. They have rapidly moved into related markets, such as licensing their software to other organizations so that they can build indexes to their own web sites.

A less desirable aspect of this business model is that the companies have limited incentive to have a comprehensive index. At first, the indexing programs aimed to index the entire web. As the web has grown larger and the management of the search programs has become a commercial venture, comprehensiveness has become secondary to improvements in interfaces and ancillary services. To build a really high quality index of the Internet, and to keep it up to date, requires a considerable investment. Most of the companies are content to do a reasonable job, but with more incentives, their indexes would be better.

Federated digital libraries

The tension that Figure 11.1 illustrates between functionality and cost of adoption has no single correct answer. Sometimes the appropriate decision for digital libraries is to select simple technology and strive for broad but shallow interoperability. At other time, the wise decision is to select technology from the top right of the figure, with great functionality but associated costs; since the costs are high, only highly motivated libraries will adopt the methods, but they will see higher functionality.

The term federated digital library describes a group of organizations working together, formally or informally, who agree to support some set of common services and standards, thus providing interoperability among their members. In a federation, the partners may have very different systems, so long as they support an agreed set of services. They will need to agree both on technical standards and on policies, including financial agreements, intellectual property, security, and privacy.

Research at the University of Illinois, Urbana Champaign provides a revealing example of the difficulties of interoperability. During 1994-98, as part of the Digital Libraries Initiative, a team based at the Grainger Engineering Library set out to build a federated library of journal articles from several leading science publishers. Since each publisher planned to make its journals available with SGML mark-up, this appeared to be an opportunity to build a federation; the university would provide central services, such as searching, while the collections would be maintained by the publishers. This turned out to be difficult. A basic problem was incompatibility in the way that the publishers use SGML. Each has its own Document Type Definition (DTD). The university was forced to enormous lengths to reconcile the semantics of the DTDs, both to extract indexing information and to build a coherent user interface. This problem proved to be so complex that the university resorted to copying the information from the publishers' computers onto a single system and converting it to a common DTD. If a respected university research group encountered such difficulties with a relatively coherent body of information, it is not surprising that others face the same problems. Panel 11.2 describes this work in more detail.

Panel 11.2
The University of Illinois federated library of scientific literature

The Grainger Engineering Library at the University of Illinois is the center of a prototype library that is a federation of collections of scientific journal articles. The work began as part of the Digital Libraries Initiative under the leadership of Bruce Schatz and William Mischo. By 1998, the testbed collection had 50,000 journal articles from the following publishers. Each of the publishers provides journal articles with SGML mark-up at the same time as their printed journals are published.

The IEEE Computer Society

The Institute of Electrical and Electronic Engineers (IEEE)

The American Physical Society

The American Society of Civil Engineers

The American Institute of Aeronautics and Astronautics

The prototype has put into practice concepts of information retrieval from marked-up text that have been frequently discussed, but little used in practice. The first phase, which proved to be extremely tough, was to reconcile the DTDs (Document Type Definitions) used by the different publishers. Each publisher has its own DTD to represent the structural elements of its documents. Some of the difference are syntax; <author> , <aut> , or <au> are alternatives for an author tag. Other variations, however, reflect significant semantic differences. For indexing and retrieval, the project has written software that maps each DTD's tags into a canonical set. The interfaces to the digital library use these tags, so that users can search for text in a given context, such as looking for a phrase within the captions on figures.

The use of this set of tags lies to the top right of Figure 11.1. Converting the mark-up for all the collections to this set imposes a high cost whenever a new collection is added to the federation, but provides powerful functionality.

Because of technical difficulties, the first implementation loaded all the documents into a single repository at the University of Illinois. Future plans call for the federation to use repositories maintained by individual publishers. There is also interest in expanding the collections to embrace bibliographic databases, catalogs, and other indexes.

Even the first implementation proved to be a fertile ground for studying users and their wishes. Giving users more powerful methods of searching was welcomed, but also stimulated requests. Users pointed out that figures or mathematical expressions are often more revealing of content than abstracts of conclusions. The experiments have demonstrated, once again, that users have great difficulty finding the right words to include in search queries when there is no control over the vocabulary used in the papers, their abstracts and the search system.

Online catalogs and Z39.50

Many libraries have online catalogs of their holding that are openly accessible over the Internet. These catalogs can be considered to form a federation. As described in Chapter 3, the catalog records follow the Anglo American Cataloguing Rules, using the MARC format, and libraries share records to reduce the costs. The library community developed the Z39.50 protocol to meets its needs for sharing records and distributed searching; Z39.50 is described in Panel 11.3. In the United States, the Library of Congress, OCLC and the Research Libraries Group, have been active in developing and promulgating these standards; there have been numerous independent implementations at academic sites and by commercial vendors. The costs of belonging to this federation are high, but they have been absorbed over decades, and are balanced by the cost savings from shared cataloguing.

Panel 11.3
Z39.50

Z39.50 is a protocol, developed by the library community, that permits one computer, the client, to search and retrieve information on another, the database server. Z39.50 is important both technically and for its wide use in library systems. In concept, Z39.50 is not tied to any particular category of information or type of database, but much of the development has concentrated on bibliographic data. Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC catalog records and present them to the client.

Z39.50 is built around a abstract view of database searching. It assumes that the server stores a set of databases with searchable indexes. Interactions are based on the concept of a session. The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. During the course of the session, both the server and the client remember the state of their interaction. It is important to understand that the client is a computer. End-user applications of Z39.50 need a user interface for communication with the user. The protocol makes no statements about the form of that user interface or how it connects to the Z39.50 client.

A typical session begins with the client connecting to the server and exchanging initial information, using the init facility. This initial exchange establishes agreement on basics, such as the preferred message size; it can include authentication, but the actual form of the authentication is outside the scope of the standard. The client might then use the explain service to inquire of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options.

The search service allows a client to present a query to a database, such as:

In the database named "Books" find all records for which the access point title contains the value "evangeline" and the access point author contains the value "longfellow."

The standard provides several choices of syntax for specifying searches, but only Boolean queries are widely implemented. The server carries out the search and builds a results set. A distinctive feature of Z39.50 is that the server saves the results set. A subsequent message from the client can reference the result set. Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching the entire database.

Depending on parameters of the search request, one or more records may be returned to the client. The standard provides a variety of ways that clients can manipulate results sets, including services to sort or delete them. When the searching is complete, the next step is likely to be that the client sends a present request. This requests the server to send specified records from the results set to the client in a specified format. The present service has a wide range of options for controlling content and formats, and for managing large records or large results sets.

This is a large and flexible standard. In addition to these basic services, Z39.50 has facilities for browsing indexes, for access control and resource management, and supports extended services that allow a wide range of extensions.

One of the principal applications of Z39.50 is for communication between servers. A catalog system at a large library can use the protocol to search a group of peers to see if they have either a copy of a work or a catalog record. End users can use a single Z39.50 client to search several catalogs, sequentially, or in parallel. Libraries and their patrons gain considerable benefits from sharing catalogs in these ways, yet interoperability among public access catalogs is still patchy. Some Z39.50 implementations have features that others lack, but the underlying cause is that the individual catalogs are maintained by people whose first loyalty is to their local communities. Support for other institutions is never the first priority. Even though they share compatible versions of Z39.50, differences in how the catalogs are organized and presented to the outside world remain.

NCSTRL and Dienst

A union catalog is a single catalog that contains records about the materials in several libraries. Union catalogs were used by libraries long before computers. They solve the problem of distributed searching by consolidating the information to be searched into a single catalog. Web search services can be considered to be a union catalogs for the web, albeit with crude catalog records. An alternative method of distributed searching is not to build a union catalog, but for each collection to have its own searchable index. A search program sends queries to these separate indexes and combines the results for presentation to the user.

Panel 11.4 describes an interesting example. The Networked Computer Science Technical Reference Library (NCSTRL) is a federation of digital library collections that are important to computer science researchers. It uses a protocol called Dienst. To minimize the costs of acceptance, Dienst builds on a variety of technical standards that are familiar to computer scientists, who are typically heavy users of Unix, the Internet, and the web. The first version of Dienst sent search requests to all servers. As the number of servers grew this approach broke down; Dienst now uses a search strategy that makes use of a master index, which is a type of union catalogs. For reasons of performance and reliability, this master index is replicated at regional centers.

Panel 11.4
NCSTRL and the Dienst model of distributed searching

The Networked Computer Science Technical Reference Library (NCSTRL) is a distributed library of research materials in computer science, notably technical reports. Cooperating organizations mount their collections on locally maintained servers. Access to these servers uses either FTP or Dienst, a distributed library protocol developed by Jim Davis of Xerox and Carl Lagoze of Cornell University. Dienst was developed as part of the Computer Science Technical Reports project, mentioned in Chapter 4. Initially there were five cooperating universities, but, by 1998, the total number was more than 100 organizations around the world, of whom forty three were operating Dienst servers.

Dienst is an architecture that divides digital library services into four basic categories: repositories, indexes, collections, and user interfaces. It provides an open protocol that defines these services. The protocol supports distributed searching of independently managed collections. Each server has an index of the materials that it stores. In early versions of Dienst, to search the collections, a user interface program sent a query to all the Dienst sites, seeking for objects that match the query. The user interface then waited until it received replies from all the servers. This is distributed searching at its most basic and it ran into problems when the number of servers grew large. The fundamental problem was that the quality of service seen by a user was determined by the service level provided by the worst of the Dienst sites. At one period, the server at Carnegie Mellon University was undependable. If it failed to respond, the user interface waited until an automatic time-out was triggered. Even when all servers were operational, since the performance of the Internet is highly variable, there were often long delays caused by slow connections to a few servers.

A slow search is tedious for the user; for a search to miss some of the collections is more serious. When a user carries out research, failure to search all the indexes means that the researcher might miss important information, simply because of a server problem.

To address these problems, Dienst was redesigned. NCSTRL is now divided into regions. Initially there were two regional centers in the United States and four in Europe. With this regional model, a master index is maintained at a central location, Cornell University, and searchable copies of this index are stored at the regional centers. Everything that a user needs to search and locate information is at each regional site. The user contacts the individual collection sites only to retrieve materials stored there. The decision which regional site to use is left to the individual user; because of the vagaries of the Internet, a user may get best performance by using a regional site that is not the closest geographically.

NCSTRL and Dienst combine day-to-day services with a testbed for distributed information management. It is one of only a small number of digital libraries where a research group operates an operational service.

Research on alternative approaches to distributed searching

In a federation, success or failure is at least as much an organizational challenge as a technical one. Inevitably, some members provide better service than others and their levels of support differ widely, but the quality of service must not be tied to the worst performing organization. Panel 11.4 describes how the Dienst system, used by NCSTRL, was redesigned to address this problem.

Every information service makes some implicit assumptions about the scenarios that it supports, the queries that it accepts, and the kinds of answers that it provides. These are implemented as facilities, such as search, browse, filter, extract, etc. The user desires coherent, personalized information, but, in the distributed world, the information sources may individually be coherent, but collectively differ. How can a diverse set of organizations provide access to their information in a manner that is effective for the users? How can services that are designed around different implied scenarios provide effective resource discovery without draconian standardization? There are tough technical problems for a single, centrally managed service. They become really difficult when the information sources are controlled by independent organizations.

The organizational challenges are so great that they constrain the technical options. Except within tight federations, the only hope for progress is to establish technical frameworks that organizations can accept incrementally, with parallel acceptance strategies for them. For each of methods there must be a low-level alternative, usually the status quo, so that services are not held back from doing anything worthwhile because of a few systems. Thus, in NCSTRL, although Dienst is the preferred protocol, more than half the sites mount their collections on simple servers using the FTP protocol.

Much of the research in distributed searching sets out to build union catalogs from metadata provided by the creator or publisher. This was one of the motivations behind the Dublin Core and the Resource Description Framework (RDF), which were described in Chapter 10. Computers systems are needed to assemble this metadata and consolidate it into a searchable index. Panel 11.5 describes the Harvest architecture for building indexes of many collections.

Panel 11.5
The Harvest architecture

Harvest was a research project in distributed searching, led by Michael Schwartz who was then at the University of Colorado. Although the project ended in 1996, the architectural ideas that it developed remain highly relevant. The underlying concept is to take the principal functions that are found in a centralized search system and divide them into separate subsystems. The project defined formats and protocols for communication among these subsystems, and implemented software to demonstrate their use.

A central concept of Harvest is a gatherer. This is a program that collects indexing information from digital library collections. Gatherers are most effective when they are installed on the same system as the collections. Each gatherer extracts indexing information from the collections and transmits it in a standard format and protocol to programs called brokers. A broker builds a combined index with information about many collections.

The Harvest architecture is much more efficient in its use of network resources than indexing methods that rely on web crawlers, and the team developed caches and methods of replication for added efficiency, but the real benefit is better searching and information discovery. All gatherers transmit information in a specified protocol, called the Summary Object Interchange Format (SOIF), but how they gather the information can be tailored to the individual collections. While web crawlers operate only on open access information, gatherers can be given access privileges to index restricted collections. They can be configured for specific databases and need not be restricted to web pages or any specific format. They can incorporate dictionaries or lexicons for specialized topic areas. In combination, these are major advantages.

Many benefits of the Harvest architecture are lost if the gatherer is not installed locally, with the digital library collections. For this reason, the Harvest architecture is particularly effective for federated digital libraries. In a federation, each library can run its own gatherer and transmit indexing information to brokers that build consolidated indexes for the entire library, combining the benefits of local indexing with a central index for users.

Another area of research is to develop methods for restricting searches to the most promising collections. Users rarely want to search every source of information on the Internet. They want to search specific categories, such as monograph catalogs, or indexes to medical research. Therefore, some means is needed for collections to provide summaries of their contents. This is particularly important where access is limited by authentication or payment mechanisms. If open access is provided to a source, an external program can, at least in theory, generate a statistical profile of the types of material and the vocabulary used. When an external user has access only through a search interface such analysis is not possible.

Luis Gravano, while at Stanford University, studied how a client can combine results from separate search services. He developed a protocol, known as STARTS, for this purpose. This was a joint project between Stanford University and several leading Internet companies. The willingness with which the companies joined in the effort shows that they see the area as fundamentally important to broad searching across the Internet. A small amount of standardization would lead to greatly improved searching.

In his analysis, Gravano viewed the information on the Internet as a large number of collection of materials, each organized differently and each with its own search engine. The fundamental concept is to enable clients to discover broad characteristics of the search engines and the collections that they maintain. The challenge is that the search engines are different and the collections have different characteristics. The difficulty is not simply that the interfaces have varying syntaxes, so that a query has to be re-formulated to be submitted to different systems. The underlying algorithms are fundamentally different. Some use Boolean methods; others have methods of ranking results. Search engines that return a ranked list give little indication how the ranks were calculated. Indeed, the ranking algorithm is often a trade secret. As a result it is impossible to merge ranking lists from several sources into a single, overall list with sensible ranking. The ranking are strongly affected by the words used in a collection, so that, even when two sources use the same ranking algorithm, merging them is fraught with difficulty. The STARTS protocol enables the search engines to report characteristics of their collections and the ranks that they generate, so that a client program can attempt to combine results from many sources.

Beyond searching

Information discovery is more than searching. Most individuals use some combination of browsing and systematic searching. Chapter 10 discussed the range of requirements that users have when looking for information and the difficulty of evaluating the effectiveness of information retrieval in an interactive session with the user in the loop. All these problems are aggravated with distributed digital libraries.

Browsing has always been an important way to discover information in libraries. It can be as simple as going to the library shelves to see what books are stored together. A more systematic approach is to begin with one item and then move to the items that it refers to. Most journal articles and some other materials include list of references to other materials. Following these citations is an essentially part of research, but is a tedious task when the materials are physical objects that must be retrieved one at a time. With hyperlinks, following references becomes straightforward. A gross generalization is that following links and references is easier in digital libraries, but the quality of catalogs and indexes is higher in traditional libraries. Therefore, browsing is likely to be relatively more important in digital libraries.

If people follow a heuristic combination of browsing and searching, using a variety of sources and search engines, what confidence can they have in the results? This chapter has already seen the difficulties of comparing results obtained from searching different sets of information and deciding whether two items found in different sources are duplicates of the same information. For the serious user of a digital library there is a more subtle but potentially more serious problem. It is often difficult to know how comprehensive a search is being carried out. A user who searches a central database, such as the National Library of Medicine's Medline system, can be confident of searching every record indexed in that system. Contrast this with a distributed search of a large number of datasets. What is the chance of missing important information because one dataset is behind the others in supplying indexing information, or fails to reply to a search request?

Overall, distributed searching epitomizes the current state of digital libraries. From one viewpoint, every technique has serious weaknesses, the technical standards have not emerged, the understanding of user needs is embryonic, and organizational difficulties are pervasive. Yet, at the same time, enormous volumes of material are accessible on the Internet, web search programs are freely available, federations and commercial services are expanding rapidly. By intelligent combination of searching and browsing, motivated users can usually find the information they seek.

Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000