The incident response plan is not the only thing that you need to have ready in advance. There are a number of practices and procedures that you need to set up so that you'll be able to respond quickly and effectively when an incident occurs. Most of these procedures are general good practice; some of them are aimed at letting you recover from any kind of disaster; and a few are specific to security incidents.
Your filesystem backups are probably the single most important part of your recovery plan. Before you do anything else (including writing your response plan), make sure that your site's backup plan is a solid one and that it works. Don't assume that it's OK just because you haven't had a problem yet. It is entirely possible to go for months without noticing that you have no backups at all, and it may take you years to notice that they're only partially broken. Unfortunately, when you do notice, it's often when you need the backups most, and the outcome is likely to be disastrous.
Backups are vital for two reasons:
If your site suffers serious damage and you have to restore your systems from scratch, you will need these backups.
If you aren't sure of the extent of the damage, backups will help you to determine what changes were made to a system and when.
Every organization needs a backup plan and not just for security reasons. If you don't have one, that's probably a sign that your current backup system is not OK. When you are doing incident-response planning, however, pay special attention to your backup plan.
For your security-critical systems (e.g., bastion hosts and servers), you might want to consider keeping your monthly or weekly backups indefinitely, rather than recycling them as you would your regular systems. If an incident does occur, you can use this archive of backup tapes to recover a "snapshot" of the system as of any of the dates of the backups. Snapshots of this kind can be helpful in investigating security incidents. For example, if you find that a program has been modified, going back through the snapshots will tell you approximately when the modification took place. That may tell you when the break-in occurred; if the modification happened before the break-in, it may tell you that it was an accident and not part of the incident at all.
If you're not sure whether or not you should be worried, try testing your backup system. Play around and see what you can restore. Ask these questions:
Can you restore files from all of your tapes?
Can you do a restore of an entire filesystem?
If you pick a specific file, can you figure out how to restore it?
If you have a corrupt file and want a version from before it was corrupted, can you do that?
If all of your disks died (or were trashed by an attacker) simultaneously, would you be able to rebuild your computer facility?
Even the best backup system won't work if the backup images aren't safeguarded. Don't rely on online backups and keep your media in a secure place separate from the data they're backing up.
NOTE: The design of backup systems is outside the scope of this book. This description, along with the description in Chapter 12, Maintaining Firewalls, provides only a summary. If you're uncertain about your backup system, you'll want to look at a general system-administration reference. See Appendix A for complete information on additional resources.
As organizations grow, they acquire hardware; they configure networking in different ways; and they add or change equipment of various kinds. Usually only one or two people really know what a site's systems look like in any detail.
Information about system configuration may be crucial to investigating and controlling a security incident. While you may know exactly how everything works and fits together at your site, you may not be the person who has to respond to the incident. What if you're on vacation? Think about what your managers or coworkers would need to know about each system in order to respond effectively to an incident involving that system.
Labels and diagrams are crucial in an emergency. System labels should indicate what a system is, what it does, what its physical configuration is (how much disk space, how much memory, etc.), and who is responsible for it. They should be attached firmly to the correct systems and easily legible. Use large type sizes, and put at least minimal labels on the back as well as the front (the front of a machine may have more flat space, but you're probably going to be looking at it from behind when you're trying to work on it). Network diagrams should show how the various systems are connected, both physically and logically, as well as things like what kind of packet filtering is done where.
Be sure that labels are kept up to date as you move systems around; wrong labels are worse than no labels at all. It's particularly important to label racked equipment and equipment with widely scattered pieces. There's nothing more frustrating than turning off all the equipment in a rack, only to discover that some of it was actually part of the computer in the next rack over, which you meant to leave running.
Information that's easily available when machines are working normally may be impossible to find if machines are not working. For example, you'll need disk partition tables written down in order to reformat and reinstall disks, and you may need a printed copy of the host table in order to configure machines as they're brought back up.
Once you've had a break-in, you need to know what's been changed on your systems. The standard tools that come with your operating system won't tell you; intruders can fake modification dates and match the trivial checksums most operating systems provide. You will need to install a cryptographic checksumming program (such as Tripwire, which is discussed in Chapter 5), make checksums of important files, and store them where an intruder can't modify them (which generally means somewhere off-line). You may not need to checksum every system separately if they're all running the same release of the same operating system, although you should make sure that the checksum program is available on all your systems.
An activity log is a record of any changes that have been made to a system, both before an incident and during the response to an incident. Normally, you'll use an activity log to list programs you've installed, configuration files you've modified, or peripherals you've added. During an incident, you'll be doing a lot more logging.
What is the purpose of an activity log? A log allows you to redo the changes if you have to rebuild the system. It also lets you determine whether any of the changes affect the incident or the response. Without a log, you may find mystery programs; you don't know where they came from and what they were supposed to do, so you can't tell if the intruder installed them or not, if they still work the way they're supposed to, or how to rebuild them. Figure 13.4 shows a sampling of routine log entries and incident log entries.
There are a variety of easy ways to keep activity logs, both electronic and manual; email, notebooks, and tape recorders can also be used. Some are better for routine logs (those that record your activities before an incident occurs). Others may be more appropriate for incident logs (those that keep track of your activities during an incident).
Email to an appropriate staff alias that also keeps a record of all messages is probably the simplest approach to keeping an activity log. Not only will email keep a permanent record of system changes, but it has the side benefit of letting everybody else know what's going on as the changes are made. The email approach is good for routine logs, whereas manual methods are likely to work more reliably during an incident. During an actual security incident, your email system may be down, so any messages generated during the response may be lost. You may also be unable to reach existing on-line logs during an incident, so keep a printed copy of these email messages up to date in a binder somewhere.
Notebooks make a good incident log, but people must be disciplined enough to use them. For routine logs, notebooks may not be convenient, because they may not be physically accessible when people actually make changes to the system. Some sites use a combination of electronic and paper logs for routine logs, with a paper logbook kept in the machine room for notes. This works as long as it's clear which things should be logged where; having two sets of logs to keep track of can be confusing.
Pocket tape recorders make good incident logs, although they require that somebody transcribe them later on. They're not reasonable for routine logging.
Well before a security incident, collect the tools and supplies that you are likely to need during that incident. You don't want to be running around, begging and borrowing, when the clock is ticking.
Here are some of the things you'll need to have in order to respond well to an incident. (Actually, these are things you ought to have around at all times; they come in handy in all sorts of disasters.)
Blank backup tapes and possibly spare disks as well.
Basic tools; you'll need them if you disconnect your system from the external network, or if you need to rewire the internal network to disconnect compromised hosts. Make sure you have a ladder if your site uses in-ceiling cabling or tall equipment racks.
Spare networking equipment - at least transceivers and cables.
Set aside basic supplies (e.g., a backup's worth of media, a few transceivers and cables, the most critical tools, notebooks or tape recorders for incident logs) in a cache to be used only in case of disaster. This should be separate from your normal stock of spare parts and tools.
If a serious security incident occurs, you may need to restore your system from backups. In this case, you will need to load a minimal operating system before you can load the backups. Are you equipped to do this?
Make sure that you:
Understand your system's operating system installation procedures
Understand the procedures for restoring from backups
Have all the materials (distribution media, manuals, etc.) available to restore the system
Test your reload plans and procedures before you really need them
Testing your ability to reload the operating system is a good idea, and too few organizations ever do it. You can learn a lot by doing this. While you're trying to reload a dead system is not a good time to discover that you've got a bad copy of the distribution media. It's also not a good time to discover that the people who have to do the reload can't figure out how to do it. The best way to test is to designate the least experienced people who might have to do the work, and let them try out the reload well ahead of time.
Most organizations find that the first time they try to reinstall the operating system and restore on a completely blank disk, the operation fails. This can happen for a number of reasons, although the usual reason is a failure in the design of the backup system. One site found that they were doing their backups with a program that wasn't distributed with the operating system, so they couldn't restore from a fresh operating system installation. (After that, they made a tape of the restore program using the standard operating system tools; they could then load the standard operating system, recover their custom restore program, and reload their data from backups.)
Don't assume that responding to a security incident will come naturally. Like everything else, such a response benefits from practice. Test your own organization's ability to respond to an incident by running occasional drills.
There are two basic types of drills:
In a paper (or "tabletop") drill, you gather all the relevant people in a conference room (or over pizza at your local hangout), outline a hypothetical problem, and work through the consequences and recovery procedures. It's important to go through all the details, step by step, to expose any missing pieces or misunderstandings.
In a live drill, you actually carry out a response and recovery procedure. A live drill can be performed, with appropriate notice to users, during scheduled system downtimes.
You might also test only parts of your response. For example, before configuring a new machine, use it to test your recovery procedures by recovering an existing machine onto it. If you have down time scheduled for your facility, you may be able to use it to test what happens when you disconnect from the network. Run your checksum comparison program before and after you install changes to the operating system to see what changes it catches when you think everything's the same, and what it does about the things you know have changed. Coordinate with another site to see what messages are logged when various types of attacks occur (pick someone you know and trust and who'll reliably tell you exactly what they did, or do it yourself). Try taking all of your central machines down at the same time and see whether they'll all come back up in this situation. (Do this when you have a few hours to spare; if it doesn't work, it often takes a while to figure out how to coax the machines past their interdependencies.)
This is all a lot of trouble, but there is a certain amount of perverse amusement to be had by playing around with fictitious disasters, and it's much less stressful than having to improvise in a real disaster.