Digital Libraries: Chapter 7 (1999)

Chapter 7
Access management and security

Why control access?

This chapter looks at two related topics: methods for controlling who has access to materials in digital libraries, and techniques of security in networked computing. This book uses the term access management to describe the control of access to digital libraries, but other words are also used. Some people refer to "terms and conditions." In publishing, where the emphasis is usually on generating revenue, the strange expression "rights management" is common. Each phrase has a different emphasis, but they are essentially synonymous.

An obvious reason for controlling access is economic. When publishers expect revenue from their products, they permit access only to users who have paid. It might be thought that access management would be unnecessary except when revenue is involved, but that is not the case; there are other reasons to control access to materials in a digital library. Materials donated to a library may have conditions attached, perhaps tied to external events such as the lifetime of certain individuals. Organizations may have information in their private collections that they wish to keep confidential, such as commercial secrets, police records, and classified government information. The boundaries of art, obscenity, and the invasion of privacy are never easy to draw. Even when access to the collections is provided openly, controls are needed over the processes of adding, changing, and deleting material, both content and metadata. A well-managed digital library will keep a record of all changes, so that the collections can be restored if mistakes are made or computer files are corrupted.

Uncertainty is a fact of life in access management. People from a computing background sometimes assume that every object can be labeled with metadata that lists all the rights, permissions, and other factors relevant to access management. People who come from libraries, and especially those who manage historic collections or archives, know that assembling such information is always time consuming and frequently impossible. Projects, such as the American Memory project at the Library of Congress, convert millions of items from historic collections. For these older materials, a natural assumption is that copyright has expired and there need be no access restrictions, but this is far from true. For published materials, the expiration of copyright is linked to the death of the creator, a date which is often hard to determine, and libraries frequently do not know whether items have been published.

As explained in Chapter 6, many of the laws that govern digital libraries, such as copyright, have fuzzy boundaries. Access management policies that are based on these laws are subject to this fuzziness. As the boundaries become clarified through new laws, treaties, or legal precedents, policies have to be modified accordingly.

Elements of access management

Figure 7.1 shows a framework that is useful for thinking about access management. At the left of this figure, information managers create policies for access. Policies relate users (at the top) to digital material (at the bottom). Authorization, at the center of the figure, specifies the access, at the right. Each of these sections requires elaboration. Policies that the information managers establish must take into account relevant laws, and agreements made with others, such as licenses from copyright holders. Users need to be authenticated and their role in accessing materials established. Digital material in the collections must be identified and its authenticity established. Access is expressed in terms of permitted operations.

Figure 7.1. A framework for access management

When users request access to the collections, each request passes through an access management process. The users are authenticated; authorization procedures grant or refuse them permission to carry out specified operations.

The responsibility for access lies with whoever manages the digital material. The manager may be a library, a publisher, a webmaster, or the creator of the information. Parts of the responsibility may be delegated. If a library controls the materials and makes them available to users, the library sets the policies and implements them, usually guided by external restraints, such as legal restrictions, licenses from publishers, or agreements with donors. If a publisher mounts materials and licenses access, then the publisher is the manager, but may delegate key activities, such as authorization of users, to others.

Users

Authentication

When a user accesses a computer system, a two-step process of identification usually takes place. The first is authentication which establishes the identify of the individual user. The second is to determine what a user is authorized to do. A wide variety of techniques are used to authenticate users; some are simple but easy to circumvent, while others are more secure but complex. The techniques divide into four main categories:

What does the user know? A standard method of authentication is to provide each user with a login name and a password. This is widely used but has weaknesses. Passwords are easily stolen. Most people like to select their own password and often select words that are easy to remember and hence easy to guess, such as personal names, or everyday words.
What does the user possess? Examples of physical devices that are used for authentication include the magnetically encoded cards used by bank teller machines, and digital smart-cards that execute an authentication program. Smart-cards are one of the best systems of authentication; they are highly secure and quite convenient to use.
What does the user have access to? A common form of authentication is the network address of a computer. Anybody who has access to a computer with an approved IP address is authenticated. Data on many personal computers is unprotected except by physical access; anybody who has access to the computer can read the data.
What are the physical characteristics of the user? Authentication by physical attributes such as voice recognition is used in a few esoteric applications, but has had little impact in digital libraries.

Roles

Policies for access management rarely specify users by name. They are usually tied to categories of users or the role of a user. An individual user can have many roles. At different times, the same person may use a digital library for teaching, private reading, or to carry out a part-time business. The digital library may have different policies for the same individual in these various roles. Typical roles that may be important include:

Membership of a group. The user is a member of the Institute of Physics. The user is a student at the U.S. Naval Academy.
Location. The user is using a computer in the Carnegie Library of Pittsburgh. The user is in the USA.
Subscription. The user has a current subscription to Journal of the Association for Computing Machinery. The user belongs to a university that has a site license to all JSTOR collections.
Robotic use. The user is an automatic indexing program, such as a Web crawler.
Payment. The user has a credit account with Lexis. The user has paid $10 to access this material.

Most users of digital libraries are people using personal computers, but the user can be a computer with no person associated, such as a program that is indexing web pages or a mirroring program that replicates an entire collection. Some sites explicitly ban access by automatic programs.

Digital material

Identification and authenticity

For access management, digital materials must be clearly identified. Identification associates some name or identifier with each item of material. This is a major topic in both digital libraries and electronic publishing. It is one of the themes of Chapter 12.

Authentication of digital materials assures both users and managers of collections that materials are unaltered. In some contexts this is vital. In one project, we worked with a U.S. government agency to assemble a collection of documents relevant to foreign affairs, such as trade agreements and treaties. With such documents the exact wording is essential; if a document claims to be the text of the North America Free Trade Agreement, the reader must be confident that the text is accurate. A text with wrong wording, whether created maliciously or by error, could cause international problems.

In most digital libraries, the accuracy of the materials is not verified explicitly. Where the level of trust is high and the cost of mistakes are low, no formal authentication of documents is needed. Deliberate alterations are rare and mistakes are usually obvious. In some fields, however, such as medical records, errors are serious. Digital libraries in these areas should seriously consider using formal methods of authenticating materials.

To ensure the accuracy of an object, a digital signature can be associated with it, using techniques described at the end of this chapter. A digital signature ensures that a file or other set of bits has not changed since the signature was calculated. Panel 7.1 describes the use of digital signatures in the U.S. Copyright Office.

Panel 7.1
Electronic registration and deposit for copyright

The U.S. Copyright Office is a separate department of the Library of Congress. Under United States law, all works published in the U.S. are subject to mandatory deposit and the library is entitled to receive two copies for its collections. While deposit of published works is mandatory, the copyright law since 1978 has not required registration of copyrighted works, though registration is encouraged by conferring significant benefits.

The method of copyright registration is straightforward. The owner of the copyright sends two copies of the work to the Copyright Office, with an application form and a fee. The copyright claim and the work are examined and, after approval, a registration certificate is produced. Published works are transferred to the Library of Congress, which decides whether to retain the works for its collections or use them in its exchange program.

In 1993, the Copyright Office and CNRI began work on a system, known as CORDS (Copyright Office Electronic Registration, Recordation and Deposit System) to register and deposit electronic works. The system mirrors the traditional procedures. The submission consists of a web form and a digital copy of the work, delivered securely over the Internet. The fee is processed separately.

Digital signatures are used to identify claims that are submitted for copyright registration in CORDS. The submitter signs the claim with the work attached, using a private key. The submission includes the claim, the work, the digital signature, the public key, and associated certificates. The digital signature verifies to the Copyright Office that the submission was received correctly and confirms the identity of the submitter. If, at any future date, there is a copyright dispute over the work, the digital signature can be used to authenticate the claim and the registered work.

A group of methods that are related to authentication of materials are described by the term watermarking. They are defensive techniques used by publishers to deter and track unauthorized copying. The basic idea is to embed a code into the material in a subtle manner that is not obtrusive to the user, but can be retrieved to establish ownership. A simple example is for a broadcaster to add a corporate logo to a television picture, to identify the source of the picture if is copied. Digital watermarks can be completely imperceptible to a users, yet almost impossible to remove without trace.

Attributes of digital material

Access management policies frequently treat different material in varying ways, depending upon properties or attributes of the material. These attributes can be encoded as administrative metadata and stored with the object, or they can be derived from some other source. Some attributes can also be computed. Thus the size of an object can be measured when required. Here are some typical examples:

Division into sub-collections. Collections often divide material into items for public access and items with restricted access. Publishers may separate the full text of articles from indexes, abstracts and promotional materials. Web sites have public areas, and private areas for use within an organization.
Licensing and other external commitments. A digital library may have material that is licensed from a publisher or acquired subject to terms and conditions that govern access, such as materials that the Library of Congress receives through copyright deposit.
Physical, temporal, and similar properties. Digital libraries may have policies that depend upon the time since the date of publication or physical properties, such as the size of the material. Some newspapers provide open access to selected articles when they are published while requiring licenses for the same articles later.
Media types. A digital library may have access policies that depend upon format or media type, for example treating digitized sound differently from textual material, or computer programs from images.

Attributes need to be assigned at varying granularity. If all the materials in a collection have the same property, then it is convenient to assign the attribute to the collection as a whole. At the other extreme, there are times when parts of objects may have specific properties. The rights associated with images are often different from those associated with the text in which they are embedded and will have to be distinguished. A donor may donate a collection of letters to a library for public access, but request that access to certain private correspondence be restricted. Digital libraries need to offer flexibility, so that attributes can be associated with entire collections, sub-collections, individual library objects, or elements of individual objects.

Operations

Access management policies often specify or restrict the operations and the various actions that a user is authorized to carry out on library materials. Some of the common categories of operation include:

Computing actions. Some operations are defined in computing terms, such as to write data to a computer, execute a program, transmit data across a network, display on a computer screen, print, or copy from one computer to another.
Extent of use. A user may be authorized to extract individual items from a database, but not copy the entire database.

These operations can be controlled by technical means, but many policies that an information manager might state are essentially impossible to enforce technically. They include:

Business or purpose. Authorization of a user might refer to the reason for carrying out an operation. Examples include commercial, educational, or government use.
Intellectual operations. Operations may specify the intellectual use to be made of an item. The most important is the rules that govern the creation of a new work that is derived from the content of another. The criteria may need to consider both the intent and the extent of use.

Subsequent use

Systems for access management have to consider both direct operations and subsequent use of material. Direct operations are actions initiated by a repository, or another computer system that acts as a agent for the managers of the collection. Subsequent use covers all those operations that can occur once material leaves the control of the digital library. It includes all the various ways that a copy can be made, from replicating computer files to photocopying paper documents. Intellectually, it can include everything from extracting short sections, the creation of derivative works, to outright plagiarism.

When an item, or part of a item, has been transmitted to a personal computer it is technically difficult to prevent a user from copying what is received, storing it, and distributing it to others. This is comparable to photocopying a document. If the information is for sale, the potential for such subsequent use to reduce revenue is clear. Publishers naturally have concerns about readers distributing unauthorized copies of materials. At an extreme, if a publisher sells one copy of an item that is subsequently widely distributed over the Internet, the publisher might end up selling only that one copy. As a partial response to this fear, digital libraries are often designed to allow readers access to individual records, but do not provide any way to copy complete collections. While this does not prevent a small loss of revenue, it is a barrier against anybody undermining the economic interests of the publisher by wholesale copying.

Policies

The final branch of Figure 7.1 is the unifying concept of a policy, on the left-hand side of the figure. An informal definition of a policy is that it is a rule, made by information managers that states who is authorized to do what to which material. Typical policies in digital libraries are:

A publication might have the policy of open access. Anybody may read the material, but only the editorial staff may change it.
A publisher with journals online may have the policy that only subscribers have access to all materials. Other people can read the contents pages and abstracts, but have access to the full content only if they pay a fee per use.
A government organization might classify materials, e.g., "top secret", and have strict policies about who has access to the materials, under what conditions, and what they can do with them.

Policies are rarely as simple as in these examples. For example, while D-Lib Magazine has a policy of open access, the authors of the individual articles own the copyright. The access policy is that everybody is encouraged to read the articles and print copies for private use, but some subsequent use, such as creating a derivative work, or selling copies for profit requires permission from the copyright owner. Simple policies can sometimes be represented as a table in which each row relates a user role, attributes of digital material, and certain operations.

Because access management policies can be complex, a formal method is needed to express them, which can be used for exchange of information among computer system. Perhaps the most comprehensive work in this area has been carried out by Mark Stefik of Xerox. The Digital Property Rights Language, which he developed, is a language for expressing the rights, conditions, and fees for using digital works. The purpose of the language is to specify attributes of material and policies for access, including subsequent use. The manager of a collection can specify terms and conditions for copying, transferring, rendering, printing, and similar operations. The language allows fees to be specified for any operation, and it envisages links to electronic payment mechanisms. The notation used by the language is based on Lisp, a language used for natural language processing. Some people have suggested that, for digital libraries, a more convenient notation would use XML, which would be a straightforward transformation. The real test of this language is how effective it proves to be when used in large-scale applications.

Enforcing access management policies

Access management is not simply a question of developing appropriate policies. Information managers want the policies to be followed, which requires some form of enforcement.

Some policies can be enforced technically. Others are more difficult. There are straightforward technical methods to enforce a policy of who is permitted to change material in a collection, or search a repository. There are no technical means to enforce a policy against plagiarism, or invasion of privacy, or to guarantee that all use is educational. Each of these is a reasonable policy that is extremely difficult to enforce by technical means. Managing such policies is fundamentally social.

There are trade-offs between strictness of enforcement and convenience to users. Technical method of enforcing policies can be annoying. Few people object to typing in a password when they begin a session, but nobody wants to be asked repeatedly for passwords, or other identification. Information managers will sometimes decide to be relaxed about enforcing policies in the interests of satisfying users. Satisfied customers will help grow the size of the market, even if some revenue is lost from unauthorized users. The publishers who are least aggressive about enforcement keep their customers happy and often generate most total revenue. As discussed in Panel 7.2, this is the strategy now used for most personal computer software. Data from publishers such as HighWire Press is beginning to suggest the same result with electronic journal publishing.

If technical methods are relaxed, social and legal pressures can be effective. The social objective is to educate users about the policies that apply to the collections, and coax or persuade people to follow them. This requires policies that are simple to understand and easy for users to follow. Users must be informed of the policies and educated as to what constitutes reasonable behavior. One useful tool is to display an access statement when the material is accessed; this is text that states some policy. An example is, "For copyright reasons, this material should not be used for commercial purposes." Other non-technical methods of enforcement are more assertive. If members of an organization repeatedly violate a licensing agreement or abuse policies that they should respect, a publisher can revoke a license. In extreme cases, a single, well-publicized legal action will persuade many others to behave responsibly.

Panel 7.2
Access management policies for computer software

Early experience with software for personal computers provides an example of what happens when attempts to enforce policies are unpleasant for the users.

Software is usually licensed for a single computer. The license fee covers use on one computer only, but software is easy to copy. Manufacturers lose revenue from unlicensed copying, particularly if there develops widespread distribution of unlicensed copies.

In the early days of personal computers, software manufacturers attempted to control unlicensed copying by technical means. One approach was to supply their products on disks that could not be copied easily. This was called copy-protection. Every time that a program was launched, the user had to insert the original disk. This had a powerful effect on the market, but not what the manufacturers had hoped. The problem was that it became awkward for legitimate customers to use the software that they had purchased. Hard disk installations were awkward and back-up difficult. Users objected to the inconvenience. Those software suppliers who were most assertive about protection lost sales to competitors who supplied software without copy-protection.

Microsoft was one of the companies that realized that technical enforcement is not the only option. The company has become extremely rich by selling products that are not technically protected against copying. Instead, Microsoft has worked hard to stimulate adherence to its policies by non-technical methods. Marketing incentives, such as customer support and low-cost upgrades, encourage customers to pay for licenses. Social pressures are used to educate people and legal methods are used to frighten the worst offenders.

Unlicensed copying still costs the software manufacturers money, but, by concentrating on satisfying their responsible customers, the companies are able to thrive.

Access management at a repository

Most digital libraries implement policies at the repository or collection level. Although there are variations in the details, the methods all follow the outline in Figure 7.1. Digital libraries are distributed computer systems, in which information is passed from one computer to another. If access management is only at the repository, access is effectively controlled locally, but once material leaves the repository problems multiply.

The issue of subsequent use has already been introduced; once the user's computer receives information it is hard for the original manager of the digital library to retain effective control, without obstructing the legitimate user. With networks, there is a further problem. Numerous copies of the material are made in networked computers, including caches, mirrors, and other servers, beyond the control of the local repository.

To date, most digital libraries have been satisfied to provide access management at the repository, while relying on social and legal pressure to control subsequent use. Usually this is adequate, but some publishers are concerned that the lack of control could damage their revenues. Therefore, there is interest in technical methods that control copying and subsequent, even after the material has left the repository. The methods fall into two categories: trusted systems and secure containers.

Trusted systems

A repository is an example of a trusted system. The managers of a digital library have confidence that the hardware, software, and administrative procedures provide an adequate level of security to store and provide access to valuable information. There may be other systems, linked to the repository, that are equally trusted. Within such a network of trusted systems, digital libraries can use methods of enforcement that are simple extensions of those used for single repositories. Attributes and policies can be passed among systems, with confidence that they will be processed effectively.

Implementing networks of trusted systems is not easy. The individual systems components must support a high level of security and so must the processes by which information is passed among the various computers. For these reasons, trusted systems are typically used in restricted situations only or on special purpose computers. If all the computers are operated by the same team or by teams working under strict rules, many of the administrative problems diminish. An example of a large, trusted system is the network of computers that support automatic teller machines in banks.

No assumptions can be made about users' personal computers and how they are managed. In fact, it is reasonable not to trust them. For this reason, early applications of trusted systems in digital libraries are likely to be restricted to special purpose hardware, such as smart cards or secure printers, or dedicated servers running rightly controlled software.

Secure containers

Since networks are not secure and trusted system difficult to implement, several groups are developing secure containers for transmitting information across the Internet. Digital material is delivered to the user in a package that contains data and metadata about access policies. Some or all of the information in the package is encrypted. To access the information requires a digital key, which might be received from an electronic payment system or other method of authentication. An advantage of this approach is that it provides some control over subsequent use. The package can be copied and distributed to third parties, but the contents can not be accessed without the key. Panel 7.3 describes one such system, IBM's Cryptolopes.

Panel 7.3. Cryptolopes

IBM's Cryptolope system is an example of how secure containers can be used. Cryptolopes are designed to let Internet users buy and sell content securely over the Internet. The figure below gives an idea of the structure of information in a Cryptolope.

Figure 7.2. The structure of a Cryptolope

Information is transmitted in a secure cryptographic envelope, called a Cryptolope container. Information suppliers seal their information in the Cryptolope container. It can be opened by recipients only after they have satisfied any access management requirements, such as paying for use of the information. The content is never separated from the access management and payment information in the envelope. Thus, the envelope can later be passed on to others, who also must pay for usage if they want to open it; each user must obtain the code to open the envelope.

In addition to the encrypted content, Cryptolope containers can include subfiles in clear text to provide users with a description of the product. The abstract might include the source, summary, author, last update, size, and price, and terms of sale. Once the user has decided to open the contents of a Cryptolope container, a digital key is issued unlocking the material contained within. To view a free item, the user clicks on the abstract and the information appears on the desktop. To view priced content, the user agrees to the terms of the Cryptolope container as stated in the abstract.

The content in a Cryptolope container can be dynamic. The system has the potential to wrap JavaScripts, Java programs, and other live content into secure containers. In the interest of standardization, IBM has licensed Xerox's Digital Property Rights Language for specifying the rules governing the use and pricing of content.

Secure containers face a barrier to acceptance. They are of no value to a user unless the user can acquire the necessary cryptographic keys to unlock them and make use of the content. This requires widespread deployment of security service and methods of electronic payment. Until recently, the spread of such services has been rather slow, so that publishers have had little market for information delivered via secure containers.

Security of digital libraries

The remainder of this chapter looks at some of the basic methods of security that are used in networked computer systems. These are general purpose methods with applications far beyond digital libraries, but digital libraries bring special problems because of the highly decentralized networks of suppliers and users of information.

Security begins with the system administrators, the people who install and manage the computers and the networks that connect them. Their honesty must be above suspicion, since they have privileges that provide access to the internals of the system. Good systems administrators will organize networks and file systems so that user have access to appropriate information. They will manage passwords, install firewalls to isolate sections of the networks, and run diagnostic programs to search for problems. They will back-up information, so that the system can be rebuilt after a major incident whether it is an equipment breakdown, a fire, or a security violation.

The Internet is basically not secure. People can tap into it and observe the packets of information traveling over the network. This is often done for legitimate purposes, such as trouble-shooting, but it can also be done for less honest reasons. The general security problem can be described as how to build secure applications across this insecure network.

Since the Internet is not secure, security in digital libraries begins with the individual computers that constitute the library and the data on them, paying special attention to the interfaces between computers and local networks. For many personal computers, the only method of security is physical restrictions on who uses the computer. Other computers have some form of software protection, usually a simple login name and password. When computers are shared by many users, controls are needed to determine who may read or write to each file.

The next step of protection is to control the interface between local networks and the broader Internet, and to provide some barrier to intruders from outside. The most complete barrier is isolation, having no external network connections. A more useful approach is to connect the internal network to the Internet through a special purpose computer called a firewall. The purpose of a firewall is to screen every packet that attempts to pass through and to refuse those that might cause problems. Firewalls can refuse attempts from outside to connect to computers within the organization, or reject packets that are not formatted according to a list of approved protocols. Well-managed firewalls can be quite effective in blocking intruders.

Managers of digital libraries need to have a balanced attitude to security. Absolute security is impossible, but moderate security can be built into networked computer systems, without excessive cost, though it requires thought and attention. Universities have been at the heart of networked computing for many years. Despite their polyglot communities of users, they have succeeded in establishing adequate security for campus networks with thousands of computers. Incidents of abusive, anti-social, or malicious behavior occur on every campus, yet major problems are rare.

With careful administration, computers connected to a network can be made reasonably secure, but that security is not perfect. There are many ways that an ill-natured person can attempt to violate security. In universities, most problems come from insiders: disgruntled employees or students who steal a user's login name and password. More sophisticated methods of intrusion take advantage of the complexity of computer software. Every operating system has built-in security, but design errors or programming bugs may have created gaps. Some of the most useful programs for digital libraries, such as web servers and electronic mail, are some of the most difficult to secure. For these reasons, everybody who builds a digital library must recognize that security can never be guaranteed. With diligence, troubles can be kept rare, but there is always a chance of a flaw.

Encryption

Encryption is the name given to a group of techniques that are used to store and transmit private information, encoding it in a way that the information appears completely random until the procedure is reversed. Even if the encrypted information is read by somebody who is unauthorized, no damage is done. In digital libraries, encryption is used to transmit confidential information over the Internet, and some information is so confidential that it is encrypted wherever it is stored. Passwords are an obvious example of information that should always be encrypted, whether stored on computers or transmitted over networks. In many digital libraries, passwords are the only information that needs to be encrypted.

Figure 7.3. Encryption and decryption

The basic concept of encryption is shown in Figure 7.3. The data that is to be kept secret, X, is input to an encryption process which performs a mathematical transformation and creates an encrypted set of data, Y. The encrypted set of data will have the same number of bits as the original data. It appears to be a random collection of bits, but the process can be reversed, using a reverse process which regenerates the original data, X. These two processes, encryption and decryption, can be implemented as computer programs, in software or using special purpose hardware.

The commonly used methods of encryption are controlled by a pair of numbers, known as keys. One key is used for encryption, the other for decryption. The methods of encryption vary in the choice of processes and in the way the keys are selected. The mathematical form of the processes are not secret. The security lies in the keys. A key is a string of bits, typically from 40 to 120 bits or more. Long keys are intrinsically much more secure than short keys, since any attempt to violate security by guessing keys is twice as difficult for every bit added to the key length.

Historically, the use of encryption has been restricted by computer power. The methods all require considerable computation to scramble and unscramble data. Early implementations of DES, the method described in Panel 7.4, required special hardware to be added to every computer. With today's fast computers, this is much less of a problem, but the time to encrypt and decrypt large amounts of data is still noticeable. The methods are excellent for encrypting short message, such as passwords, or occasional highly confidential messages, but the methods are less suitable for large amounts of data where response times are important.

Private key encryption

Private key encryption is a family of methods in which the key used to encrypt the data and the key used to decrypt the data are the same, and must be kept secret. Private key encryption is also known as single key or secret key encryption. Panel 7.4 describes DES, one of the most commonly used methods.

Panel 7.4
The Data Encryption Standard (DES)

The Data Encryption Standard (DES) is a method of private key encryption originally developed by IBM. It has been a U.S. standard since 1977. The calculations used by DES are fairly slow when implemented in software, but the method is fast enough for many applications. A modern personal computer can encrypt about one million bytes per second.

DES uses keys that are 56 bits long. It divides a set of data into 64 bit blocks and encrypts each of them separately. From the 56 bit key, 16 smaller keys are generated. The heart of the DES algorithm is 16 successive transformations of the 64 bit block, using these smaller keys in succession. Decryption uses the same 16 smaller keys to carry out the reverse transactions in the opposite order. This sounds like a simple algorithm, but there are subtleties. Most importantly, the bit patterns generated by encryption appear to be totally random, with no clues about the data or the key.

Fanatics argue that DES with its 56 bit keys can be broken simply by trying every conceivable key, but this is a huge task. For digital library applications it is perfectly adequate.

Private key encryption is only as secure as the procedures that are used to keep the key secret. If one computer wants to send encrypted data to a remote computer it must find a completely secure way to get the key to the remote computer. Thus private key encryption is most widely used in applications where trusted services are exchanging information.

Dual key encryption

When using private key encryption over a network, the sending computer and the destination must both know the key. This poses the problem of how to get started if one computer can not pass a key secretly to another. Dual key encryption permits all information to be transmitted over a network, including the public keys, which can be transmitted completely openly. For this reason, it has the alternate name of public key encryption. Even if every message is intercepted, the encrypted information is still kept secret.

The RSA method is the best known method of dual key encryption. It requires a pair of keys. The first key is made public; the second is kept secret. If an individual, A, wishes to send encrypted data to a second individual, B, then the data is encrypted using the public key of B. When B receives the data it can be decrypted, using the private key, which only B knows.

This dual key system of encryption has many advantages and one major problem. The problem is to make sure that a key is genuinely the public key of a specific individual. The normal approach is to have all keys generated and authenticated by a trusted authority, called a certification authority. The certification authority generates certificates, which are signed messages specifying an individual and a public key. This works well, so long as security at the certificate authority is never violated.

Digital signatures

Digital signatures are used to check that a computer file has not been altered. Digital signatures are based on the concept of a hash function. A hash is a mathematical function that can be applied to the bytes of a computer file to generate a fixed-length number. One commonly used hash function is called MD5. The MD5 function can be applied to any length computer file. It carries out a special transformation on the bits of the file and ends up with an apparently random 128 bits.

If two files differ by as little as one bit, their MD5 hashes will be completely different. Conversely, if two files have the same hash, there is an infinitesimal probability that they are not identical. Thus a simple test for whether a file has been altered is to calculate the MD5 hash when the file is created; at a later time, to check that no changes have taken place, recalculate the hash and compare it with the original. If the two are the same then the files are almost certainly the same.

The MD5 function has many strengths, including being fast to compute on large files, but, as with any security device, there is always a possibility that some bright person may discover how to reverse engineer the hash function, and find a way to create a file that has a specific hash value. At the time that this book was being written there were hints that MD5 may be vulnerable in this way. If so, other hash functions are available.

A hash value gives no information about who calculated it. A digital signature goes one step further towards guaranteeing the authenticity of a library object. When the hash value is calculated it is encrypted using the private key of the owner of the material. This together with the public key and the certificate authority creates a digital signature. Before checking the hash value the digital signature is decrypted using the public key. If the hash results match, then the material is unaltered and it is known that the digital signature was generated using the corresponding private key.

Digital signatures have a problem. While users of a digital library want to be confident that material is unaltered, they are not concerned with bits; their interest lies in the content. For example, the Copyright Office pays great attention to the intellectual content, such as the words in text, but does not care that a computer system may have attached some control information to a file, or that the font used for the text has been changed, yet the test of a digital signature fails completely when one bit is changed. As yet, nobody has suggested an effective way to ensure authenticity of content, rather than bits.

Deployment of public key encryption

Since the basic mathematics of public key encryption are now almost twenty years old, it might be expected that products based on the methods would have been widely deployed for many years. Sadly this is not the case.

One reason for delay is that there are significant technical issues. Many of them concern the management of the keys, how they are generated, how private keys are stored, and what precautions can be taken if the agency that is creating the keys has a security break-in. However, the main problems are policy problems.

Patents are part of the difficulty. Chapter 6 discussed the problems that surround software patents. Public key encryption is one of the few areas where most computer scientists would agree that there were real inventions. These method are not obvious and their inventors deserve the rewards that go with invention. Unfortunately, the patent holders and their agents have followed narrow licensing policies, which have restricted the creative research that typically builds on a break-through invention.

A more serious problem has been interference from U.S. government departments. Agencies such as the CIA claim that encryption technology is a vital military secret and that exporting it would jeopardize the security of the United States. Police forces claim that public safety depends upon their ability to intercept and read any messages on the networks, when authorized by an appropriate warrant. The export argument is hard to defend when the methods are widely published overseas, and reputable companies in Europe and Japan are building products that incorporate them. The public safety augment is more complicated, but it is undercut by the simple fact that the American public does not trust the agencies, neither their technical competence nor their administrative procedures. People want the ability to transmit confidential information without being monitored by the police.

The result of these policy problems has been to delay the deployment of the tools that are needed to build secure applications over the Internet. Progress is being made and, in a few years time we may be able to report success. At present it is a sorry story.

It is appropriate that this chapter ends with a topic in which the technical solution is held up by policy difficulties. This echoes a theme that recurs throughout digital libraries and is especially important in access management. People, technology, and administrative procedures are intimately linked. Successful digital libraries combine aspects of all three and do not rely solely on technology to solve human problems.

Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000