[Chapter 8] 8.6 World Wide Web (WWW) and HTTP

8.6 World Wide Web (WWW) and HTTP

The existence of the World Wide Web (WWW) is a major factor behind the recent explosive growth of the Internet. Since the introduction of the NCSA Mosaic package (the first graphical user interface to the WWW to gain widespread acceptance) in 1993, WWW traffic on the Internet has been growing at an explosive rate, far faster than any other kind of traffic (e.g., SMTP email, FTP file transfers, Telnet remote terminal sessions, etc.). You will certainly want to let your users use a browser to access WWW sites, and you will very likely to want to run a site yourself, if you do anything that might benefit from publicity.

Most WWW browsers are capable of using protocols other than HTTP, which is the basic protocol of the Web. For example, these browsers are usually also Gopher and FTP clients, or are capable of using your existing Telnet and FTP clients transparently (without its being visible to the user that he is starting an external program). Many of them are also NNTP, SMTP, and Archie clients. They use a single, consistent notation called a Uniform Resource Locator (URL) (see sidebar) to specify connections of various types.

The general form of URLs is service://host/path/to/file/or/page.

There are several possible values for the service field, including "ftp", "http", "gopher", and "telnet".

The host part generally specifies the hostname or IP address of the host to connect to. If the service to be accessed is on a nonstandard port on that host (for example, if it's an HTTP server on something other than port 80), the host field can be optionally specified as "host:port" (e.g., http://www.somewhere.com:8080 would refer to an HTTP server on port 8080 on machine www.somewhere.com).

The path field (what comes after the "/" following the host part) varies by type of URL, but generally that field specifies the particular file or document to access on that server using that service. For example, the URL ftp://ftp.greatcircle.com/pub/firewalls/FAQ refers to the file /pub/firewalls/FAQ to be obtained via anonymous FTP from host ftp.greatcircle.com.

For more information about URLs, see any good book on the World Wide Web.

8.6.1 Packet Filtering Characteristics of HTTP

HTTP is a TCP-based service. Clients use random ports above 1023. Most servers use port 80, but some don't. To understand why, you need some history.

Many of the modern information access services (notably HTTP, WAIS, and Gopher) were designed so that the servers don't have to run on a fixed well-known port on all machines. A standard well-known port was established for each of these services, but the clients and servers are all capable of using alternate ports as well. When you reference one of these servers, you can include the port number it's running on (assuming that it's not the standard port for that service) in addition to the name of the machine it's running on. For example, an HTTP URL of the form http://host.domain.net/file.html is assumed to refer to a server on the standard HTTP port (port 80); if the server were on an alternate port (port 8000, for example), the URL would be written http://host.domain.net:8000/file.html.

The protocol designers had two good and valid reasons for designing these services this way:

Doing so allows a single machine to run multiple servers for multiple data sets. You could, for example, run one HTTP server that's accessible to the world with data that you wish to make available to the public, and another that has other, nonpublic data on a different port that's restricted (via packet filtering or the authentication available in the HTTP server, for example).
Doing so allows users to run their own servers (which may be a blessing or a curse, depending on your particular security policy). Because the standard well-known ports are all in the "below 1024" range that's reserved for use only by root on UNIX machines, unprivileged users can't run their servers on the standard port numbers.

The ability to provide these services on nonstandard ports has its uses, but it complicates things considerably from a packet filtering point of view. If your users wish to access a server running on a nonstandard port, you have several choices:

You can tell the users they can't do it; this may or may not be acceptable, depending on your environment.
You can add a special exception for that service to your packet filtering setup. This is bad for your users because it means that they first have to recognize the problem and then wait until you've fixed it, and it's bad for you because you'll constantly have to be adding exceptions to the filter list.
You can try to convince the server's owner to move the server to the standard port. While encouraging folks to use the standard ports as much as possible is a good long-term solution, it's not likely to yield immediate results.
You can use some kind of proxied version of the client. This requires setup on your end, and may restrict your choice of clients. On the other hand, both Mosaic and Netscape Navigator support proxying, and they are by far the most popular clients.
If you can filter on the ACK bit, you can allow all outbound connections, regardless of destination port. This opens up a wide variety of services, including passive-mode FTP. It also is a noticeable increase in your vulnerability.

The good news is that the vast majority of these servers (probably much greater than 90%) use the standard port, and that the more widely used and important the server is, the more likely it is to use the standard port. Many servers that use nonstandard ports use one of a few easily recognizable substitutes (81, 800, 8000, and 8080).

Your firewall will probably prevent people on your internal network from setting up their own servers at nonstandard ports (you're not going to want to allow inbound connection to arbitrary ports above 1023). You could set up such servers on a bastion host, but wherever possible, it's kinder to other sites to leave your servers on the standard port.

Direc-	Source	Dest.	Pro-	Source	Dest.	ACK
tion	Addr.	Addr.	tocol	Port	Port	Set	Notes
In	Ext	Int	TCP	>1023	80[18]	[19]	Incoming session, client to server
Out	Int	Ext	TCP	80[18]	>1023	Yes	Incoming session, server to client
Out	Int	Ext	TCP	>1023	80[18]	[19]	Outgoing session, client to server
In	Ext	Int	TCP	80[18]	>1023	Yes	Outgoing session, server to client

[18] 80 is the standard port number for HTTP servers, but some servers run on different port numbers.
[19] ACK is not set on the first packet of this type (establishing connection) but will be set on the rest.

8.6.2 Proxying Characteristics of HTTP

Various HTTP clients (such as Mosaic and Netscape Navigator) transparently support various proxying schemes. Some clients support SOCKS; others support user-transparent proxying via special HTTP servers, and some support both. (See the discussion of SOCKS and proxying in general in Chapter 7.)

The CERN HTTP server, developed at the European Particle Physics Laboratory in Geneva, Switzerland, has a proxy mode in which the server handles all requests for remote documents from browsers inside the firewall. The server makes the remote connection, passing the information back to the clients transparently. See Appendix B for information about getting the CERN HTTP server.

Using the CERN HTTP server as a proxy server can provide an additional benefit, because the server can locally cache WWW pages obtained from the Internet. This caching can significantly improve client performance and reduce network bandwidth requirements. It does this by ensuring that popular WWW pages are retrieved only once at your site. The second and subsequent requests get the locally cached copy of the page, rather than a new copy each time from the original server out on the Internet.

The TIS FWTK also includes an HTTP proxy server, called http-gw, that can be used with any client program. Clients that support HTTP proxying can use the FWTK HTTP proxy server transparently (all you have to do is configure the client to tell it where the server is), but you must enforce custom user procedures for clients that don't support HTTP proxying. Basically, URLs have to be modified to direct the clients to the proxy server rather than the real server. URLs embedded in HTML documents that pass through the server are modified automatically, but users must know how to do it by hand for URLs they type in from scratch or obtain through other channels. Chapter 7 describes the TIS FWTK in more detail.

8.6.3 HTTP Security Concerns

There are two basic sets of security concerns regarding HTTP:

What can a malicious client do to your HTTP server?
What can a malicious HTTP server do to your clients?

The following sections describe these concerns:

8.6.3.1 What can a malicious client do to your HTTP server?

In most ways, the security concerns we have for an HTTP server are very similar to the security concerns we have for any other server that handles connections from the Internet, e.g., an anonymous FTP server. You want to make sure that the users of those connections can access only what you want them to access, and that they can't trick your server so they get to something they shouldn't.

There are a variety of methods to accomplish these goals, including:

Carefully configure the security and access control features of your server to restrict its capabilities and what users can access with it.[20]
[20] For a more complete discussion of these features and their use, see the chapters on HTTP servers in Managing Internet Information Services.
Run the server as an unprivileged user.
Use the chroot mechanism to restrict the server's operation to a particular section of your filesystem hierarchy. You can use chroot either within the server or through an external wrapper program.
Don't put anything sensitive on the server machine in the first place. In this way, even if somebody does "break out" somehow, there's nothing else of interest on the machine; at least, nothing that they couldn't already get to anyway via the normal access procedures.
Configure the rest of your network security so that even if an attacker manages to totally, compromise the server host, they're going to have a hard time getting any further into your network. To start with, don't put the server on an internal net.

HTTP servers themselves are providing a limited service and don't pose major security concerns. However, there is one unique feature of HTTP servers that you need to worry about: their use of external programs, particularly ones that interact with the user via the Common Gateway Interface (CGI) which is the piece of HTTP that specifies how user information is communicated to the server and from it to external programs. Many HTTP servers are configured to run other programs to generate HTML pages on the fly. These programs are generically called CGI scripts, even if they don't use CGI and aren't scripts. For example, if someone issues a database query to an HTTP server, the HTTP server runs an external program to perform the query and generate an HTML page with the answers.

There are two things you need to worry about with these external programs:

Can an attacker trick the external programs into doing something they shouldn't?
Can an attacker upload his own external programs and cause them to be executed?

You may want to run your HTTP server on a Macintosh, DOS, or Windows machine. These machines have good HTTP server implementations available, but don't generally have the other capabilities that would make those servers insecure. For example, they are unlikely to be running other servers, they don't have a powerful and easily available scripting facility, and they're less likely to have other data or trusted access to other machines. The downside of this is that it's hard to do interesting things on them; the easier it gets, the less secure they'll be.

Tricking external programs

The external programs run by HTTP servers are often shell scripts written by folks who have information they want to provide access to, but who know little or nothing about writing secure shell scripts (which is by no means trivial, even for an expert).

Because it's difficult to ensure the security of the scripts themselves, about the best you can do is try to provide a secure environment (using chroot and other mechanisms) that the scripts can run in (one which, you hope, they can't get out of). There should be nothing in the environment you'd worry about being revealed to the world. Nothing should trust the machine the server is running on. If you set up the environment in this way, then even if attackers somehow manage to break out of the restricted environment and gain full access to the machine, they're not much further along towards breaking into the really interesting stuff on your internal network.

Alternatively, or in addition, if you have people who you feel sure are capable of writing secure scripts, you can have all the scripts written, or at least reviewed, by these people. Most sites don't have people like this readily available, but if you are going to be seriously involved in providing WWW service, you may want to hire one. It's still a good idea to run the scripts in a restricted environment; nobody's perfect.

Uploading external programs

The second concern is that attackers might be able to upload their own external programs and cause your server to run them. How could attackers do this? Suppose the following:

Your HTTP server and your anonymous FTP server both run on the same machine.
They can both access the same areas of the filesystem.
There is a writable directory somewhere in those areas, so that customers can upload core dumps from your product via FTP for analysis by your programmers, for example.

In this case, the attacker might be able to upload his own script or binary to that writable directory using anonymous FTP, and then cause the HTTP server to run it.

What is your defense against things like this? Once again, your best bet is to restrict what filesystem areas each server can access (generally using chroot), and to provide a restricted environment in which each server can run.

8.6.3.2 What can a malicious server do to your HTTP clients?

The security problems of HTTP clients are far more complex that those of HTTP servers. The basis of these client problems is that HTTP clients (like Mosaic and Netscape Navigator) are generally designed to be extensible and to run particular external programs to deal with particular data types. This extensibility can be abused by an attacker.

HTTP servers can provide data in any number of formats: plain text files, HTML files, PostScript documents, still video files (GIF and JPEG), movie files (MPEG), audio files, and so on. The servers use MIME, discussed briefly above in the section on electronic mail, to format the data and specify its type. HTTP clients generally don't attempt to understand and process all of these different data formats. They understand a few (such as HTML, plain text, and GIF), and they rely on external programs to deal with the rest. These external programs will display, play, preview, print, or do whatever is appropriate for the format.

For example, UNIX Web browsers confronted with a PostScript file will ordinarily invoke the GhostScript program, and UNIX Web browsers confronted with a JPEG file will ordinarily invoke the xv program. The user controls (generally via a configuration file) what data types the HTTP client knows about, which programs to invoke for which data types, and what arguments to pass to those programs. If the user hasn't provided his own configuration file, the HTTP client generally uses a built-in default or a systemwide default.

All of these external programs present two security concerns:

What are the inherent capabilities of the external programs an attacker might take advantage of?
What new programs (or new arguments for existing programs) might an attacker be able to convince the user to add to his configuration?

An example

Let's consider, for example, what an HTTP client is going to do with a PostScript file. PostScript is a language for controlling printers. While primarily intended for that purpose, it is a full programming language, complete with data structures, flow of control operators, and file input/output operators. These operators ("read file", "write file", "create file", "delete file", etc.) are seldom used, except on printers with local disks for font storage, but they're there as part of the language. PostScript previewers (such as GhostScript) generally implement these operators for completeness.

Suppose that a user uses Mosaic to pull down a PostScript document. Mosaic invokes GhostScript, and it turns out that the document has PostScript commands in it that say "delete all files in the current directory." If GhostScript executes the commands, who's to blame? You can't really expect Mosaic to scan the PostScript on the way through to see if it's dangerous; that's an impossible problem. You can't really expect GhostScript not to do what it's told in valid PostScript code. You can't really expect your users not to download PostScript code, or to scan it themselves.

Current versions of GhostScript have a safer mode they run in by default. This mode disables "dangerous" operators such as those for file input/output. But what about all the other PostScript interpreters or previewers? And what about the applications to handle all the other data types? How safe are they? Who knows?

Even if you have safe versions of these auxiliary applications, how do you keep your users from changing their configuration files to add new applications, run different applications, or pass different arguments (for example, to disable the safer mode of GhostScript) to the existing applications?

Why would a user do this? Suppose that the user found something in the WWW that claimed to be something he really wanted - a game demo, a graphics file, a copy of Madonna's new song, whatever. And, suppose that this desirable something came with a note that said "Hey, before you can access this Really Cool Thing, you need to modify your Mosaic configuration, because the standard configuration doesn't know how to deal with this thing; here's what you do..." And, suppose that the instructions were something like "remove the `-dSAFER' flag from the `ghostscript' line of your .mosaicrc file," or "add this line to your .mosaicrc file."

Would your users recognize that they were being instructed to disable the safer mode in GhostScript, or to add some new data type with /bin/sh as its auxiliary program, so that whatever data of that type came down was passed as commands straight to the shell? Even if they recognized it, would they do it anyway (nice, trusting people that they are)?

Some people believe that Macintosh and PC-based versions of WWW browsers are less susceptible to some of these security problems than UNIX-based browsers. On Mac and PC machines, there is usually no shell (or only a shell of limited power, like the MS-DOS command interpreter) that an attacker can break out to, and a limited and highly unpredictable set of programs to access once they're there. Also, if any damage occurs, it can often be more easily isolated to a single machine. On the other hand, "highly unpredictable" does not mean "completely unpredictable". (For example, a very large percentage of Macs and PCs have copies of standard Microsoft applications, like Word and Excel.) Further, if your Macs and PCs are networked with AppleShare, Novell, PC-NFS, or something similar, you can't make any assumptions about damage being limited to a single machine.

What can you do?

There is no simple, foolproof defense against the type of problem we've described. At this point in time, you have to rely on a combination of carefully installed and configured client and auxiliary programs, and a healthy dose of user education and awareness training. This is an area of active research and development, and both the safeguards and the attacks will probably develop significantly over the next couple of years.

Because Mac and PC clients seem less susceptible to some of the client-side problems, some sites take the approach of allowing WWW access only from Macs or PCs. Some go even further and limit access to particular machines (often placed in easily accessible locations like libraries or cafeterias) that have been carefully configured so they have no sensitive information on them, and no access to such information. The idea is this: If anything bad happens, it will affect only this one easily rebuilt machine. The machine can't be used to access company data on other machines.

Some people have experimented, at least in UNIX environments, with running Mosaic and its auxiliary programs under the X Window System in a restricted environment - or on a "sacrificial goat" machine that has nothing else on it - with the displays directed to their workstation. This provides a certain measure of protection, but it also imposes a certain amount of inconvenience. Consider the following problems with this approach:

This approach works only for the UNIX/X version of Mosaic. If you have Mac and PC users, they're going to have to run X on their system, log in to the goat system, set up the restricted environment, and start Mosaic. All in all, this may be more interaction with UNIX than they're willing to put up with.
Any files legitimately retrieved during the session are going to wind up in the restricted environment or on the goat machine. Then, they're going to have to be transferred separately to the machine where they are really wanted.
This approach generally doesn't work for audio files, which will end up being played on the audio system of the goat machine (or wherever the restricted environment is), not on the user's machine.

As discussed above in the section called "Packet Filtering Characteristics of HTTP," there is another complication of WWW clients in environments in which packet filtering is part of the firewall solution: not all HTTP servers run on port 80. To address this, you might consider using proxy servers for HTTP access. If you do this, the internal clients talk on standard ports through the packet filtering system to the proxy server, and the proxy server talks on arbitrary ports (because it's outside the packet filtering system) to the real server.

8.6.4 Secure HTTP

You may hear discussions of Secure HTTP and wonder how it relates to firewalls and the configuring of services. Secure HTTP is not designed to solve the kinds of problems we've been discussing in this section. It's designed to deal with privacy issues by encrypting the information that is being passed around via HTTP. A mechanism like Secure HTTP is necessary to be able to do business using HTTP so that things like credit card numbers can be passed over the Internet without fear of capture by packet sniffers. In order to distinguish between privacy issues, on the one hand, and vulnerability to malicious servers, on the other hand, people working on HTTP and similar extensible protocols usually use the word "safe" to refer to protocols that protect you from hostile servers, and the word "secure" to refer to protocols that protect you from data snooping.

Because it provides authentication as well as encryption, Secure HTTP could eventually provide some assistance with safety. If you are willing to connect only to sites that you know, that run Secure HTTP, and that authenticate themselves, you can be sure that you're not talking to a hostile site. However, even when Secure HTTP is released and in wide usage, this approach (limited connections) is unlikely to be a popular and practical one; part of the glory of the Web is being able to go to new and unexpected places.

Although people are working on HTTP-like protocols that are safe, safe HTTP is probably not a viable concept. It's not HTTP that's unsafe; it's the fact that HTTP is transferring programs in other languages. This is a major design feature of HTTP and one of the things responsible for its rapid spread.

8.6.5 Summary of WWW Recommendations

If you're going to run an HTTP server, use a dedicated bastion host if possible.
If you're going to run an HTTP server, carefully configure the HTTP server to control what it has access to; in particular, watch out for ways that someone could upload a program to the system somehow (via mail or FTP, for example), and then execute it via the HTTP server.
Carefully control the external programs your HTTP server can access.
You can't allow internal hosts to access all HTTP servers without allowing them to access all TCP ports, because some HTTP servers use nonstandard port numbers. If you don't mind allowing your users access to all TCP ports, you can use packet filtering to examine the ACK bit to allow outgoing connections to those ports (but not incoming connections from those ports). If you do mind, then either restrict your users to servers on the standard port (80), or use proxying.
Proxying HTTP is easy, and a caching proxy server offers network bandwidth benefits as well as security benefits.
Configure your HTTP clients carefully and warn your users not to reconfigure them based on external advice.


8.5 Network News Transfer Protocol (NNTP)		8.7 Other Information Services