Chapter 3
Libraries and publishers

Computing in libraries

Faced with the management of huge collections over long periods of time, libraries have a proud tradition as early adopters of new technology, from microfilm, through online information services and CD-ROMs, to digital libraries. (The second typewriter imported into Brazil was for the National Library of Brazil to type library catalog cards.) The Internet and the web are the latest examples of technology that has been embraced by libraries and is being adapted to their needs, but far from the first.

Mainstream commercial products do not always fit the needs of libraries. The computing industry aims its product developments at the large markets provided by commercial businesses, science, medicine, and the military. In the United States, much of the basic research behind these products has been funded by the Department of Defense or companies such as IBM. Until recently, these large markets had little interest in managing the kinds of information that is found in digital libraries. Consequently, the products sold by the computing industry did not address important needs in information management. Therefore, libraries are accustomed to taking core technology from other fields and tailoring it to their needs. This chapter introduces several such areas, most of which are described more fully in later chapters.

Resource sharing and online catalogs

In libraries, the word "networking" has two meanings. Before modern computer networks, libraries had a wide variety of arrangements to share resources. These include inter-library lending, reciprocal reading privileges between institutions, and the exchange of photocopies. The term given to these arrangements was "networking" or "resource sharing". Resource sharing services, such as the photocopying and document delivery provided by the British Library, enhance the availability of information around the world. However, nothing in resource sharing has had an impact to rival computer networking.

Shared cataloguing was mentioned in Chapter 1 as an early use of computers to support libraries. Almost every library has a catalog with records of the materials in the collections. The catalog has several uses. It helps users find material in the library, it provides bibliographic information about the items, and it is an important tool in managing the collections.

Cataloguing is a area in which librarians use precise terminology, some of which may be unfamiliar to outsiders. Whereas a non-specialist may use the term catalog as a generic term, in a library it has a specific meaning as a collection of bibliographic records created according to strict rules. Rather than the everyday word "book", librarians use the more exact term monograph. Over the years, the information that is included in a monograph catalog record has been codified into cataloguing rules. English speaking countries use the Anglo-American Cataloguing Rules (AACR). The current version is known as AACR2.

The task of cataloguing each monograph is time consuming and requires considerable expertise. To save costs, libraries share catalog records. Major libraries that catalog large numbers of monographs, such as the Library of Congress and the major university libraries, make their catalog records available to others, free of charge. Rather than create duplicate records, most libraries look for an existing catalog record and create their own records only when they can not find another to copy.

Since the late 1960s, this sharing of catalog records has been computer based. Technically, the fundamental tool for distributing catalog records is the MARC format (an abbreviation for Machine-Readable Cataloging). MARC was developed by Henriette Avram and colleagues at the Library of Congress, initially as a format to distribute monograph catalog records on magnetic tape. To be precise, it is necessary to distinguish between AACR2, which provides the cataloguing rules, and MARC, which is a format for representing the resulting catalog records. In practice, the term "MARC cataloguing" is often used in a general sense to cover both the catalog records and the computer format in which they are stored. MARC cataloguing has been expanded far beyond monographs and is now used for most categories of library materials, including serials, archives, manuscripts, and many more. Panel 3.1 gives an example of a MARC record.

Panel 3.1. A MARC record
A MARC record

Consider a monograph, for which the conventional bibliographic citation is:

Caroline R. Arms, editor, Campus strategies for libraries and electronic information. Bedford, MA: Digital Press, 1990.

A search of the catalog at the Library of Congress, using one of the standard terminal-based interfaces, displays the catalog in a form that shows the information in the underlying MARC record format.

&001 89-16879 r93
&050 Z675.U5C16 1990
&082 027.7/0973 20
&245 Campus strategies for libraries and electronic information/Caroline Arms, editor.
&260 {Bedford, Mass.} : Digital Press, c1990.
&300 xi, 404 p. : ill. ; 24 cm.
&440 EDUCOM strategies series on information technology
&504 Includes bibliographical references (p. {373}-381).
&020 ISBN 1-55558-036-X : $34.95
&650 Academic libraries--United States--Automation.
&650 Libraries and electronic publishing--United States.
&650 Library information networks--United States.
&650 Information technology--United States.
&700 Arms, Caroline R. (Caroline Ruth)
&040 DLC DLC DLC &043 n-us---
&955 CIP ver. br02 to SL 02-26-90
&985 APIF/MIG

The information is divided into fields, each with a three-digit code. For example, the 440 field is the title of a monograph series, and the 650 fields are Library of Congress subject headings. Complex rules specify to the cataloguer which fields should be used and how relationships between elements should be interpreted.

The actual coding is more complex than shown. The full MARC format consists of a pre-defined set of fields each identified by a tag. Within each field, subfields are permitted. Fields are identified by three-digit numeric tags and subfields by single letters. To get a glimpse of how information is encoded in this format, consider the 260 field, which begins "&260". In an actual MARC record, this is encoded as:

     &2600#abc#{Bedford, Mass.} :#Digital Press,#c1990.%

This has information about publication, divided into three subfields. The string "abc" indicates that there are three subfields. The first with tag "a" is the place of publication, the next with tag "b" is the publisher, and the third with tag "c" is the date.

The development of MARC led to two important types of computer-based system. The first was shared cataloguing; the pioneer was OCLC, created by Fred Kilgour in 1967. OCLC has a large computer system which has grown to more 35 million catalog records in MARC format, including records received from the Library of Congress. When an OCLC member library acquires a book that it wishes to catalog, it begins by searching the OCLC database. If it finds a MARC record, it downloads the record to its own computer system and records the holding in the OCLC database. In the past it could also have ordered a printed catalog card. This is called "copy cataloguing". If the OCLC database does not contain the record, the library is encouraged to create a record and contribute it to OCLC. With copy cataloguing, each item is catalogued once and the intellectual effort shared among all libraries. MARC cataloguing and OCLC's success in sharing catalog records have been emulated by similar services around the world.

The availability of MARC records stimulated a second development. Individual libraries were are to create online catalogs of their holdings. In most cases, the bulk of the records were obtained from copy cataloguing. Today, almost every substantial library in the United States has its online catalog. Library jargon calls such a catalog an "OPAC" for "online public access catalog". Many libraries have gone to great efforts to convert their old card catalogs to MARC format, so that the online catalog is the record of their entire holdings, rather than having an online catalog for recent acquisitions, but traditional card catalogs for older materials, The retrospective conversion of Harvard University's card catalog to MARC format has recently been completed. Five million cards were converted at a cost approaching $15 million.

A full discussion of MARC cataloguing and online public access catalogs is outside the scope of this book. MARC was an innovative format at a time when most computer systems represented text as fixed length fields with capital letters only. It remains a vital format for libraries, but it is showing its age. Speculation on the future of MARC is complicated by the enormous investment that libraries have made in it. Whatever its future, MARC was a pioneering achievement in the history of both computing and libraries. It is a key format that must be accommodated by digital libraries.

Linking online catalogs and Z39.50

During the 1980s, universities libraries began to connect their online catalogs to networks. As an early example, by 1984 there was a comprehensive campus network at Dartmouth College. Since the computer that held the library catalog was connected to the network, anybody with a terminal or personal computer on-campus could search the catalog. Subsequently, when the campus network was connected to the Internet, the catalog became available to the whole world. People outside the university could search the catalog and discover what items were held in the libraries at Dartmouth. Members of the university could use their own computers to search the catalogs of other universities. This sharing of library catalogs was one of the first, large-scale examples of cooperative sharing of information over the Internet.

In the late 1970s, several bibliographic utilities, including the Library of Congress, the Research Libraries Group, and the Washington Libraries Information Network, began a project known as the Linked Systems Project, which developed the protocol now known by the name of Z39.50. This protocol allows one computer to search for information on another. It is primarily used for searching records in MARC format, but the protocol is flexible and is not restricted to MARC. Technically, Z39.50 specifies rules that allow one computer to search a database on another and retrieve the records that are found by the search. Z39.50 and its role in fostering interoperability among digital libraries are discussed in Chapter 11. It is one of the few protocols to be widely used for interoperation among diverse computer systems.

Abstracts and indexes

Library catalogs are the primary source of information about monographs, but they are less useful for journals. Catalogs provide a single, brief record for an entire run of a journal. This is of little value to somebody who wants to discover individual articles in academic journals. Abstracting and indexing services developed to help researchers to find such information. Typical services are Medline for the biomedical literature, Chemical Abstracts for chemistry, and Inspec for the physical sciences including computing. The services differ in many details, but the basic structures are similar. Professionals, who are knowledgeable about a subject area, read each article from a large number of journals and assign index terms or write abstracts. Sometimes services use index terms that are drawn from a carefully controlled vocabulary, such as the MeSH headings that the National Library of Medicine uses for its Medline service. Others services are less strict. Some generate all their own abstracts. Others, such as Inspec, will use an abstract supplied by the publisher.

Most of these services began as printed volumes that were sold to libraries, but computer searching of these indexes goes back to the days of batch processing and magnetic tape. Today, almost all searching is by computer. Some indexing services run computer systems on which users can search for a fee; others license their data to third parties who provide online services. Many large libraries license the data and mount it on their own computers. In addition, much of the data is available on CD-ROM.

Once the catalog was online, libraries began to mount other data, such as abstracts of articles, indexes, and reference works. These sources of information can be stored in a central computer and the retrieved records displayed on terminals or personal computers. Reference works consisting of short entries are particularly suited for this form of distribution, since users move rapidly from one entry to another and will accept a display that has text characters with simple formatting. Quick retrieval and flexible searching are more important than the aesthetics of the output on the computer screen.

As a typical example, here are some of the many information sources that the library at Carnegie Mellon University provided online during 1997/8.

     Carnegie Mellon library catalog
     Carnegie Mellon journal list
     Bibliographic records of architectural pictures and drawings
     Who's who at CMU
     American Heritage Dictionary
     Periodical Abstracts
     ABI/Inform (business periodicals)
     Inspec (physics, electronics, and computer science)
     Research directory (Carnegie Mellon University)

Several of these online collections provide local information, such as Who's who at CMU, which is the university directory of faculty, students, and staff. Libraries do not provide their patrons only with formally published or academic materials. Public libraries, in particular, are a broad source of information, from tax forms to bus timetables. Full-text indexes and web browsers allow traditional and non-traditional library materials to be combined in a single system with a single user interface. This approach has become so standard that it is hard to realize that only a few years ago merging information from such diverse sources was rare.

Mounting large amounts of information online and keeping it current is expensive. Although hardware costs fall continually, they are still noticeable, but the big costs are in licensing the data and the people who handle both the business aspects and the large data files. To reduce these costs, libraries have formed consortia where one set of online data serves many libraries. The MELVYL system, which serves the campuses of the University of California, was one of the first. It is described in Chapter 5.

Information retrieval

Information retrieval is a central topic for libraries. A user, perhaps a scientist, doctor, or lawyer, is interested in information on some topic and wishes to find the objects in a collection that cover the topic. This requires specialized software. During the mid-1980s, libraries began to install computers with software that allowed full-text searching of large collections. Usually, the MARC records of a library's holdings were the first data to be loaded onto this computer, followed by standard references works. Full-text searching meant that a user could search using any words that appeared in the record and did not need to be knowledgeable about the structures of the records or the rules used to create them.

Research in this field is at least thirty years old, but the basic approach has changed little. A user expresses a request as a query. This may be a single word, such as "cauliflower", a phrase, such as "digital libraries", or a longer query, such as, "In what year did Darwin travel on the Beagle?" The task of information retrieval is to find objects in the collection that match the query. Since a computer does not have the time to go through the entire collection for each search, looking at every object separately, the computer must have an index of some sort that allows information retrieval by looking up entries in indexes.

As computers have grown more powerful and the price of storage declined, methods of information retrieval have moved from carefully controlled searching of short records, such as catalog records or those used by abstracting and indexing services, to searching the full text of every word in large collections. In the early days, expensive computer storage and processing power, stimulated the development of compact methods of storage and efficient algorithms. More recently, web search programs have intensified research into methods for searching large amounts of information which are distributed across many computers. Information retrieval is a topic of Chapters 10 and 11.

Representations of text and SGML

Libraries and publishers share an interest in using computers to represent the full richness of textual materials. Textual documents are more than simple sequences of letters. They can contain special symbols, such as mathematics or musical notation, characters from any language in the world, embedded figures and table, various fonts, and structural elements such as headings, footnotes, and indexes. A desirable way to store a document in a computer is to encode these features and store them with the text, figures, tables, and other content. Such an encoding is called a mark-up language. For several years, organizations with a serious interest in text have been developing a mark-up scheme known as SGML. (The name is an abbreviation for Standard Generalized Markup Language.) HTML, the format for text that is used by the web, is a simple derivative of SGML.

Since the representation of a document in SGML is independent of how it will be used, the same text, defined by its SGML mark-up, can be displayed in many forms and formats: paper, CD-ROM, online text, hypertext, and so on. This makes SGML attractive for publishers who may wish to produce several versions of the same underlying work. A pioneer application in using SGML in this way was the new Oxford English Dictionary. SGML has also been heavily used by scholars in the humanities who find in SGML a method to encode the structure of text that is independent of any specific computer system or method of display. SGML is one of the topics in Chapter 9.

Digital libraries of scientific journals

The early experiments

During the late 1980s several publishers and libraries became interested in building online collections of scientific journals. The technical barriers that had made such projects impossible earlier were disappearing, though still present to some extent. The cost of online storage was coming down, personal computers and networks were being deployed, and good database software was available. The major obstacles to building digital libraries were that academic literature was on paper, not in electronic formats, and that institutions were organized around physical media, not computer networks.

One of the first attempts to create a campus digital library was the Mercury Electronic Library, a project that we undertook at Carnegie Mellon University between 1987 and 1993. Mercury was able to build upon the advanced computing infrastructure at Carnegie Mellon, which included a high-performance network, a fine computer science department, and the tradition of innovation by the university libraries. A slightly later effort was the CORE project at Cornell University to mount images of chemistry journals. Both projects worked with scientific publishers to scan journals and establish collections of online page images. Whereas Mercury set out to build a production system, CORE also emphasized research into user interfaces and other aspects of the system by chemists. The two projects are described in Panel 3.2.

Panel 3.2. Mercury and CORE

Mercury

Mercury was a five year to build a prototype digital library at Carnegie Mellon University. It began in 1988 and went live in 1991 with a dozen textual databases and a small number of page images of journal articles in computer science. It provides a good example of the state of the art before the arrival of the web.

One of the objectives was to mount page images of journal articles, using materials licensed from publishers. Four publishers were identified as publishing sixteen of the twenty computer science journals most heavily used on campus. They were ACM, IEEE, Elsevier, and Pergammon. During the project, Pergammon was taken over by Elsevier. None of the publishers had machine-readable versions of their journals, but they gave permission to convert printed materials for use in the library. Thus, an important part of the work was the conversion, storage, and delivery of page images over the campus network.

The fundamental paradigm of Mercury was searching a text database, to identify the information to be displayed. An early version of Z39.50 was chosen as the protocol to send queries between the clients and the server computers on which the indexes were stored. Mercury introduced the concept of a reference server, which keeps information about the information stored on the servers, the fields that could be searched, the indexes, and access restrictions. To display bit-mapped images, the project developed a new algorithm to take page images stored in compressed format, transmit them across the network, decompress them and display them, with an overall response time of one to two seconds per page.

Since Mercury was attached to the Internet and most materials were licensed from publishers, security was important. Carnegie Mellon already had a mature set of network services, known as Andrew. Mercury was able to use standard Andrew services for authentication and printing. Electronic mail was used to dispatch information to other computers.

CORE

CORE was a joint project by Bellcore, Cornell University, OCLC, and the American Chemical Society that ran from 1991 to 1995. The project converted about 400,000 pages, representing four years of articles from twenty journals published by the American Chemical Society.

The project used a number of ideas that have since become popular in conversion projects. CORE included two versions of every article, a scanned image and a text version marked up in SGML. The scanned images ensured that when a page was displayed or printed it had the same design and layout as the original paper version. The SGML text was used to build a full-text index for information retrieval and for rapid display on computer screens. Two scanned images were stored for each page, one for printing and the other for screen display. The printing version was black and white, 300 dots per inch; the display version was 100 dots per inch, grayscale.

CORE was one of the first projects to articulate several issues that have since been confirmed in numerous studies. The technical problems of representing, storing, and storing complex scientific materials are substantial, particularly if they were originally designed to be printed and digitization was an afterthought. CORE also highlighted the problems of scale. The CORE collections, which were only twenty journals, occupied some 80 gigabytes of storage. All the early digitization projects found that the operational problems of managing large collections were greater than predicted.

Despite such annoyances, the user interface studies showed considerable promise. Chemists liked the collections. Although they may prefer paper for reading an article in detail, they found the online versions easier to search and the screen displays of articles more than adequate. CORE was one of the first studies to emphasize the importance of browsing in addition to searching, an experience which the web has amply demonstrated.

Although both the Mercury and CORE projects converted existing journal articles from print to bit-mapped images, conversion was not seen as the long-term future of scientific libraries. It simply reflected the fact that none of the journal publishers were in a position to provide other formats. Printers had used computer typesetting for many years, but their systems were organized entirely to produce printed materials. The printers' files were in a wide variety of formats. Frequently, proof corrections were held separately and not merged with the master files, so that they were not usable in a digital library without enormous effort.

Mercury and CORE were followed by a number of other projects that explored the use of scanned images of journal articles. One of the best known was Elsevier Science Publishing's Tulip project. For three years, Elsevier provided a group of universities, which included Carnegie Mellon and Cornell, with images from forty three journals in material sciences. Each university, individually mounted these images on their own computers and made them available locally.

Projects such as Mercury, CORE, and Tulip were not long-term production systems. Each had rough edges technically and suffered from the small size of the collection provided to researchers, but they were followed by systems that have become central parts of scientific journal publishing. They demonstrated that the potential benefits of a digital library could indeed be achieved in practice. The next generation of developments in electronic publishing were able to take advantage of much cheaper computer storage allowing large collections of images to be held online. The emergence of the web and the widespread availability of web browsers went a long way towards solving the problem of user interface development. Web browsers are not ideal for a digital library, though they are a good start, but they have the great advantage that they exist for all standard computers and operating systems. No longer does every project have to develop its own user interface programs for every type of computer. Scientific journal publishers

Until the mid-1990s, established publishers of scientific journals hesitated about online publishing. While commercial publishing on CD-ROM had developed into an important part of the industry, few journals were available online. Publishers could not make a business case for online publishing and feared that, if materials were available online, sales of the same materials in print would suffer. By about 1995, it became clear that broad-reaching changes were occurring in how people were using information. Libraries and individuals were expanding their uses of online information services and other forms of electronic information much more rapidly than their use of printed materials. Print publications were competing for a declining portion of essentially static library budgets.

Panel 3.3. HighWire Press

HighWire Press is a venture of the Stanford University Libraries. It has brought some of the best scientific and medical journals online by building partnerships with the scientific and professional societies that publish the journals. Its success can be credited to attention to the interests of the senior researchers, a focus on the most important journals, and technical excellence.

HighWire Press began in 1995 with an online version of the Journal of Biological Chemistry, taking advantage of several members of the editorial board being Stanford faculty. This is a huge journal, the second largest in the world. Since each weekly edition is about 800 pages, nobody reads it from cover to cover. Therefore the HighWire Press interface treats articles as separate documents. It provides two ways to find them, by browsing the contents, issue by issues, or by searching. The search options include authors, words in title, or full-text searching of abstract or the entire article. The search screen and the display of the articles are designed to emphasize that this is indeed the Journal of Biological Chemistry not a publication of HighWire Press. Great efforts have been made to display Greek letters, mathematical symbols, and other special characters.

In three years, HighWire Press went from an experiment to a major operation with almost one hundred journals online, including the most prominent of all, Science, published by the American Society for the Advancement of Science (AAAS). Despite its resources, the AAAS had a problem that is faced by every society that publishes only a few journals. The society feared that an in-house effort to bring Science online might over-extend their staff and lower the overall quality of their work. A partnership with HighWire Press enabled them to share the development costs with other society journals and to collaborate with specialists at Stanford University. The AAAS has been delighted by the number of people who visit their site every week, most of whom were not regular readers of Science.

Chapter 6 explains that journal publishing is in the grip of a price spiral with flat or declining circulation. In the past, publishers have used new technology to reduce production costs, thus mitigating some of the problems of declining circulation, but these savings are becoming exhausted. They realize that, if they want to grow their business or even sustain its current level, they need to have products for the expanding online market. Electronic information is seen as a promising growth market.

Most of the major publishers of scientific journals have moved rapidly to electronic publishing of their journals. The approach taken by the Association for Computing Machinery (ACM) is described in Panel 3.4, but, with only minor differences, the same strategy is being followed by the large commercial publishers, including Elsevier, John Wiley, and Academic Press, by societies such as the American Chemical Society, and by university presses, including M.I.T. Press.

Panel 3.4
The Association for Computing Machinery's Digital Library

The Association for Computing Machinery (ACM) is a professional society that publishes seventeen research journals in computer science. In addition, its thirty eight special interest groups run a wide variety of conferences, many of which publish proceedings. The ACM members are practicing computer scientists, including many of the people who built the Internet and the web. These members were some of the first people to become accustomed to communicating online and they expected their society to be a leader in the movement to online journal publication.

Traditionally, the definitive version of a journal has been a printed volume. In 1993, the ACM decided that its future production process would use a computer system that creates a database of journal articles, conference proceedings, magazines and newsletters, all marked up in SGML. Subsequently, ACM also decided to convert large numbers of its older journals to build a digital library covering its publications from 1985.

One use of the SGML files is a source for printed publications. However, the plan was much more progressive. The ACM planned for the day when members would retrieve articles directly from the online database, sometimes reading them on the screen of a computer, sometimes downloading them to a local printer. Libraries would be able to license parts of the database or take out a general subscription for their patrons.

The collection came online during 1997. It uses a web interface that offers readers the opportunity to browse through the contents pages of the journals, and to search by author and keyword. When an article has been identified, a subscriber can read the full text of the article. Other readers pay a fee for access to the full text, but can read abstracts without payment.

Business issues

ACM was faced with the dilemma of paying for this digital library. The society needed revenue to cover the substantial costs, but did not want to restrain authors and readers unduly. The business arrangements fall into two parts, the relationships with authors and with readers. Initially, both are experimental.

In 1994, ACM published an interim copyright policy, which describes the relationship between the society as publisher and the authors. It attempts to balances the interest of the authors against the needs for the association to generate revenue from its publications. It was a sign of the times, that ACM first published the new policy on its web server. One of the key features of this policy is the explicit acknowledgment that many of the journal articles are first distributed via the web.

To generate revenue, ACM charges for access to the full text of articles. Members of the ACM can subscribe to journals or libraries can subscribe on behalf of their users. The electronic versions of journals are priced about 20 percent below the prices of printed versions. Alternatively, individuals can pay for single articles. The price structure aims to encourage subscribers to sign up for the full set of publications, not just individual journals.

Electronic journals

The term electronic journal is commonly used to describe a publication that maintains many of the characteristics of printed journals, but is produced and distributed online. Rather confusingly, the same term is used for a journal that is purely digital, in that it exists only online, and for the digital version of a journal that is primarily a print publication, e.g., the ACM journals described in Panel 3.4.

Many established publishers have introduced a small number of purely online periodicals and there have been numerous efforts by other groups. Some of these online periodical set out to mimic the processes and procedures of traditional journals. Perhaps the most ambitions publication using this approach was the On-line Journal of Current Clinical Trials which was developed by the American Association for the Advancement of Science (AAAS) in conjunction with OCLC. Unlike other publications for which the electronic publication was secondary to a printed publication, this new journal was planned as a high-quality, refereed journal for which the definitive version would be the electronic version. Since the publisher had complete control over the journal, it was possible to design it and store it in a form that was tailored to electronic delivery and display, but the journal was never accepted by researchers or physicians. It came out in 1992 and never achieved its goals, because it failed to attract the numbers of good papers that were planned for. Such is the fate of many pioneers.

More recent electronic periodicals retain some characteristics of traditional journals but experiment with formats or services that take advantage of online publishing. In 1995, we created D-Lib Magazine at CNRI as an online magazine with articles and news about digital libraries research and implementation. The design of D-Lib Magazine illustrates a combination of ideas drawn from conventional publishing and the Internet community. Conventional journals appear in issues, each containing several articles. Some electronic journals publish each article as soon as it has is ready, but D-Lib Magazine publishes a monthly issue with strong emphasis on punctuality and rapid publication. The graphic design is deliberately flexible, while the constraints of print force strict design standards, but online publications can allow authors to be creative in their use of technology.

Research libraries and conversion projects

One of the fundamental tasks of research libraries is to save today's materials to be the long-term memory for tomorrow. The great libraries have wonderful collections that form the raw material of history and of the humanities. These collections consist primarily of printed material or physical artifacts. The development of digital libraries has created great enthusiasm for converting some of these collections to digital forms. Older materials are often in poor physical condition. Making a digital copy preserves the content and provides the library with a version that it can make available to the whole world. This section looks at two of these major conversion efforts.

Many of these digital library projects convert existing paper documents into bit-mapped images. Printed documents are scanned one page at a time. The scanned images are essentially a picture of the page. The page is covered by an imaginary grid. In early experiments this was often at 300 dots per inch, and the page was recorded as an array of black and white dots. More recently, higher resolutions and full color scanning have become common. Bit-mapped images of this kind are crisp enough that they can be displayed on large computer screens or printed on paper with good legibility. Since this process generates a huge number of dots for each page, various methods are used to compress the images which reduces the number of bits to be stored and the size of files to be transmitted across the network, but even the simplest images are 50,000 bytes per page.

Panel 3.5. American Memory and the National Digital Library Program

Background

The Library of Congress, which is the world's biggest library, has magnificent special collections of unique or unpublished materials. Among the Library's treasures are the papers of twenty three presidents. Rare books, pamphlets, and papers provide valuable material for the study of historical events, periods, and movements. Millions of photographs, prints, maps, musical scores, sound recordings, and moving images in various formats reflect trends and represent people and places. Until recently, anybody who wanted to use these materials had to come to the library buildings on Capitol Hill in Washington, DC.

American Memory was a pilot program that, from 1989 to 1994, reproduced selected collections for national dissemination in computerized form. Collections were selected for their value for the study of American history and culture, and to explore the problems of working with materials of various types, such as prints, negatives, early motion pictures, recorded sound, and textual documents. Initially, American Memory used a combination of digitized representations on CD-ROM and analog forms on videodisk, but, in June 1994, three collections of photographs were made available on the web.

The National Digital Library Program (NDLP) builds on the success of American Memory. Its objective is to convert millions of items to digital form and make them available over the Internet. The focus is on Americana, materials that are important in American history. Typical themes are Walt Whitman's notebooks, or documents from the Continental Congress and the Constitutional Convention.

Some of the collections that are being converted are coherent archives, such as the papers or the photograph collection of a particular individual or organization. Some are collections of items in a special original form, such as daguerreotypes or paper prints of early films. Others are thematic compilations by curators or scholars, either from within an archival collection or selected across the library's resources.

Access to the collections

American Memory discovered the enthusiasm of school teachers for access to these primary source materials and the new program emphasizes education as its primary, but certainly not its only audience. The most comprehensive method of access to the collections is by searching an index. However, many of the archival collections being converted do not have catalog records for each individual item. The collections have finding aids. These are structured documents that describe the collection, and groups of items within it, without providing a description for each individual item.

Hence, access to material in American Memory is a combination of methods, including searching bibliographic records for individual items where such records exist, browsing subject terms, searching full text, and, in the future, searching the finding aids.

Technical considerations

This is an important project technically, because of its scale and visibility. Conscious of the long-term problems of maintaining large collections, the library has placed great emphasis on how it organizes the items within its collections.

The program takes seriously the process of converting these older materials to digital formats, selecting the most appropriate format to represent the content, and putting great emphasis on quality control. Textual material is usually converted twice: to a scanned page image, and an SGML marked-up text. Several images are made from each photograph, ranging from a small thumbnail to a high-resolution image for archival purposes.

Many of the materials selected for conversion are free from copyright or other restrictions on distribution, but others have restrictions. In addition to copyright, other reasons for restrictions include conditions required by donors of the original materials to the library. Characteristic of older materials, especially unpublished items, is that it is frequently impossible to discover all the restrictions that might conceivably apply, and prohibitively expensive to make an exhaustive search for every single item. Therefore the library's legal staff has to develop policies and procedures that balance the value to the nation of making materials available, against the risk of inadvertently infringing some right.

Outreach

A final, but important, aspect of American Memory is that people look to the Library of Congress for leadership. The expertise that the library has developed and its partnerships with leading technical groups permit it to help the entire library community move forward. Thus the library is an active member of several collaborations, has overseen an important grant program, and is becoming a sponsor of digital library research.

Since the driving force in electronic libraries and information services has come from the scientific and technical fields, quite basic needs of other disciplines, such as character sets beyond English, have often been ignored. The humanities have been in danger of being left behind, but a new generation of humanities scholars is embracing computing. Fortunately, they have friends. Panel 3.6 describes JSTOR, a project of the Andrew W. Mellon Foundation which is both saving costs for libraries and bringing important journal literature to a wider audience than would ever be possible without digital libraries.

Panel 3.6. JSTOR

JSTOR is a project that was initiated by the Andrew W. Mellon Foundation to provide academic libraries with back runs of important journals. It combines both academic and economic objectives. The academic objective is to build a reliable archive of important scholarly journals and provide widespread access to them. The economic objective is to save costs to libraries by eliminating the need for every library to store and preserve the same materials.

The JSTOR collections are developed by fields, such as economics, history, and philosophy. The first phase is expected to have about one hundred journals from some fifteen fields. For each journal, the collection consists of a complete run, usually from the first issue until about five years before the current date.

The economic and organizational model

In August 1995, JSTOR was established as an independent not-for-profit organization with a goal to become self-sustaining financially. It aims to do so by charging fees for access to the database to libraries around the world. These fees are set to be less than the comparable costs to the libraries of storing paper copies of the journals.

The organization has three offices. Its administrative, legal and financial activities are managed and coordinated from the main office in New York, as are relationships with publishers and libraries. In addition, staff in offices at the University of Michigan and Princeton University maintain two synchronized copies of the database, maintain and develop JSTOR's technical infrastructure, provide support to users, and oversee the conversion process from paper to computer formats. Actual scanning and keying services are provided by outside vendors. JSTOR has recently established a third database mirror site at the University of Manchester, which supplies access to higher education institutions in the United Kingdom.

JSTOR has straightforward licenses with publishers and with subscribing institutions. By emphasizing back-runs, JSTOR strives not to compete with publishers, whose principal revenues come from current issues. Access to the collections has initially been provided only to academic libraries who subscribe to the entire collection. They pay a fee based on their size. In the best Internet tradition, the fee schedule and the license are available online for anybody to read.

The technical approach

At the heart of the JSTOR collection are scanned images of every page. These are scanned at a high resolution, 600 bits per inch, with particular emphasis on quality control. Unlike some other projects, only one version of each image is stored. Other versions, such as low-resolution thumbnails, are computed when required, but not stored. Optical character recognition with intensive proof reading is used to convert the text. This text is used only for indexing. In addition, a table of contents file is created for each article. This includes bibliographic citation information with keywords and abstracts if available.

These two examples, American Memory and JSTOR, are further examples of a theme that has run throughout this chapter. Libraries, by their nature, are conservative organizations. Collections and their catalogs are developed over decades or centuries. New services are introduced cautiously, because they are expected to last for a long time. However, in technical fields, libraries have frequently been adventurous. MARC, OCLC, the Linked Systems Project, Mercury, CORE, and the recent conversion projects may not have invented the technology they used, but the deployment as large-scale, practical systems was pioneering.

Chapter 2 discussed the community that has grown up around the Internet and the web. Many members of this community discovered online information very recently, and act as though digital libraries began in 1993 with the release of Mosaic. As discussed in this chapter, libraries and publishers developed many of the concepts that are the basis for digital libraries, years before the web. Combining the two communities of expertise provides a powerful basis for digital libraries in the future.



Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000