[Chapter 11] 11.6 Checking Name Service

11.6 Checking Name Service

Name server problems are indicated when the "unknown host" error message is returned by the user's application. Name server problems can usually be diagnosed with nslookup or dig. nslookup is discussed in detail in Chapter 8. dig is an alternative tool with similar functionality that is discussed in this chapter. Before looking at dig, let's take another look at nslookup and see how it is used to troubleshoot name service.

The three features of nslookup covered in Chapter 8 are particularly important for troubleshooting remote name server problems. These features are its ability to:

Locate the authoritative servers for the remote domain using the NS query
Obtain all records about the remote host using the ANY query
Browse all entries in the remote zone using nslookup's ls and view commands

When troubleshooting a remote server problem, directly query the authoritative servers returned by the NS query. Don't rely on information returned by non-authoritative servers. If the problems that have been reported are intermittent, query all of the authoritative servers in turn and compare their answers. Intermittent name server problems are sometimes caused by the remote servers returning different answers to the same query.

The ANY query returns all records about a host, thus giving the broadest range of troubleshooting information. Simply knowing what information is (and isn't) available can solve a lot of problems. For example, if the query returns an MX record but no A record, it is easy to understand why the user couldn't telnet to that host! Many hosts are accessible to mail that are not accessible by other network services. In this case, the user is confused and is trying to use the remote host in an inappropriate manner.

If you are unable to locate any information about the hostname that the user gave you, perhaps the hostname is incorrect. Given that the hostname you have is wrong, looking for the correct name is like trying to find a needle in a haystack. However, nslookup can help. Use nslookup's ls command to dump the remote zone file, and redirect the listing to a file. Then use nslookup's view command to browse through the file, looking for names similar to the one the user supplied. Many problems are caused by a mistaken hostname.

All of the nslookup features and commands mentioned here are used in Chapter 8. However, some examples using these commands to solve real name server problems will be helpful. The three examples that follow are based on actual trouble reports. [7]

[7] The host and server names are fictitious, but the problems were real.

11.6.1 Some systems work, others don't

A user reported that she could resolve a certain hostname from her workstation, but could not resolve the same hostname from the central system. However, the central system could resolve other hostnames. We ran several tests and found that we could resolve the hostname on some systems and not on others. There seemed to be no predictable pattern to the failure. So we used nslookup to check the remote servers.

% nslookup
Default Server:  almond.nuts.com
Address:  172.16.12.1

> set type=NS
> foo.edu.
Server:  almond.nuts.com
Address:  172.16.12.1

foo.edu        nameserver = gerbil.foo.edu
foo.edu        nameserver = red.big.com
foo.edu        nameserver = shrew.foo.edu
gerbil.foo.edu   inet address = 198.97.99.2
red.big.com   inet address = 184.6.16.2
shrew.foo.edu    inet address = 198.97.99.1
> set type=ANY
> server gerbil.foo.edu
Default Server:  gerbil.foo.edu
Address:  198.97.99.2

> hamster.foo.edu
Server:  gerbil.foo.edu
Address:  198.97.99.2

hamster.foo.edu        inet address = 198.97.99.8
> server red.big.com
Default Server:  red.big.com
Address:  184.6.16.2
> hamster.foo.edu
Server:  red.big.com
Address:  184.6.16.2

* red.big.com can't find hamster.foo.edu: Non-existent domain

This sample nslookup session contains several steps. The first step is to locate the authoritative servers for the host name in question (hamster.foo.edu). We set the query type to NS to get the name server records, and query for the domain (foo.edu) in which the hostname is found. This returns three names of authoritative servers: gerbil.foo.edu, red.big.com, and shrew.foo.edu.

Next, we set the query type to ANY to look for any records related to the hostname in question. Then we set the server to the first server in the list, gerbil.foo.edu, and query for hamster.foo.edu. This returns an address record. So server gerbil.foo.edu works fine. We repeat the test using red.big.com as the server, and it fails. No records are returned.

The next step is to get SOA records from each server and see if they are the same:

> set type=SOA
> foo.edu.
Server:  red.big.com
Address:  184.6.16.2

foo.edu        origin = gerbil.foo.edu
	mail addr = amanda.gerbil.foo.edu
	serial=10164, refresh=43200, retry=3600, expire=3600000,
	min=2592000
> server gerbil.foo.edu
Default Server:  gerbil.foo.edu
Address:  198.97.99.2

> foo.edu.
Server:  gerbil.foo.edu
Address:  198.97.99.2

foo.edu        origin = gerbil.foo.edu
	mail addr = amanda.gerbil.foo.edu
	serial=10164, refresh=43200, retry=3600, expire=3600000,
	min=2592000

> exit

If the SOA records have different serial numbers, perhaps the zone file, and therefore the hostname, has not yet been downloaded to the secondary server. If the serial numbers are the same and the data is different, as in this case, there is a definite problem. Contact the remote domain administrator and notify her of the problem. The administrator's mailing address is shown in the "mail addr" field of the SOA record. In our example, we would send mail to [email protected] reporting the problem.

11.6.2 The data is here and the server can't find it!

This problem was reported by the administrator of one of our secondary name servers. The administrator reported that his server could not resolve a certain hostname in a domain for which his server was a secondary server. The primary server was, however, able to resolve the name. The administrator dumped his cache (more on dumping the server cache in the next section), and he could see in the dump that his server had the correct entry for the host. But his server still would not resolve that hostname to an IP address!

The problem was replicated on several other secondary servers. The primary server would resolve the name; the secondary servers wouldn't. All servers had the same SOA serial number, and a dump of the cache on each server showed that they all had the correct address records for the hostname in question. So why wouldn't they resolve the hostname to an address?

Visualizing the difference between the way primary and secondary servers load their data made us suspicious of the zone file transfer. Primary servers load the data directly from local disk files. Secondary servers transfer the data from the primary server via a zone file transfer. Perhaps the zone files were getting corrupted. We displayed the zone file on one of the secondary servers, and it showed the following data:

% cat /usr/etc/sales.nuts.com.hosts
PCpma      IN   A   172.16.64.159
      IN   HINFO   "pc" "n3/800salesnutscom"
PCrkc      IN   A   172.16.64.155
      IN   HINFO   "pc" "n3/800salesnutscom"
PCafc      IN   A   172.16.64.189
      IN   HINFO   "pc" "n3/800salesnutscom"
accu      IN   A   172.16.65.27
cmgds1   IN   A   172.16.130.40
cmg      IN   A   172.16.130.30
PCgns      IN   A   172.16.64.167
      IN   HINFO   "pc" "(3/800salesnutscom"
gw      IN   A   172.16.65.254
zephyr   IN   A   172.16.64.188
      IN   HINFO   "Sun" "sparcstation"
ejw      IN   A   172.16.65.17
PCecp      IN   A   172.16.64.193
      IN   HINFO   "pc" "nLsparcstationstcom"

Notice the odd display in the last field of the HINFO statement for each PC. [8] This data might have been corrupted in the transfer or it might be bad on the primary server. We used nslookup to check that.

[8] See Appendix D, A dhcpd Reference, for a detailed description of the HINFO statement.

% nslookup
Default Server:  almond.nuts.com
Address:  172.16.12.1

> server acorn.sales.nuts.com
Default Server:  acorn.sales.nuts.com
Address:  172.16.6.1

> set query=HINFO
> PCwlg.sales.nuts.com
Server:  acorn.sales.nuts.com
Address:  172.16.6.1

PCwlg.sales.nuts.com     CPU=pc  OS=ov
packet size error (0xf7fff590 != 0xf7fff528)
> exit

In this nslookup example, we set the server to acorn.sales.nuts.com, which is the primary server for sales.nuts.com. Next we queried for the HINFO record for one of the hosts that appeared to have a corrupted record. The "packet size error" message clearly indicates that nslookup was even having trouble retrieving the HINFO record directly from the primary server. We contacted the administrator of the primary server and told him about the problem, pointing out the records that appeared to be in error. He discovered that he had forgotten to put an operating system entry on some of the HINFO records. He corrected this, and it fixed the problem.

11.6.3 Cache corruption

The problem described above was caused by having the name server cache corrupted by bad data. Cache corruption can occur even if your system is not a secondary server. Sometimes the root server entries in the cache become corrupted. Dumping the cache can help diagnose these types of problems.

For example, a user reported intermittent name server failures. She had no trouble with any hostnames within the local domain, or with some names outside the local domain, but names in several different remote domains would not resolve. nslookup tests produced no solid clues, so the name server cache was dumped and examined for problems. The root server entries were corrupted, so named was reloaded to clear the cache and reread the named.ca file. Here's how it was done.

The SIGINT signal causes named to dump the name server cache to the file /var/tmp/named_dump.db. The following command passes named this signal:

# kill -INT `cat /etc/named.pid`

The process ID of named can be obtained from /etc/named.pid, as in the example above, because named writes its process ID in that file during startup. [9]

[9] On our Linux system the process ID is written to /var/run/named.pid.

Once SIGINT causes named to snapshot its cache to the file, we can then examine the first part of the file to see if the names and addresses of the root servers are correct. For example:

# head -10 /var/tmp/named_dump.db
; Dumped at Wed Sep 18 08:45:58 1991
; --- Cache & Data ---
$ORIGIN .
.       80805   IN      SOA     NS.NIC.DDN.MIL. HOSTMASTER.NIC.DDN.MIL.
		( 910909 10800 900 604800 86400 )
        479912  IN      NS      NS.NIC.DDN.MIL.
        479912  IN      NS      AOS.BRL.MIL.
        479912  IN      NS      A.ISI.EDU.
        479912  IN      NS      C.NYSER.NET.
        479912  IN      NS      TERP.UMD.EDU.

The cache shown above is clean. If intermittent name server problems lead you to suspect a cache corruption problem, examine the cache and check the names and addresses of all the root servers. The following symptoms might indicate a problem with the root server cache:

Incorrect root server names. The section on /etc/named.ca in Chapter 8 explains how you can locate the correct root server names. The easiest way to do this is to get the file domain/named.root from the InterNIC.
No address or an incorrect address for any of the servers. Again, the correct addresses are in domain/named.root.
A name other than root (.) in the name field of the first root server NS record, or the wildcard character (*) occurring in the name field of a root or top-level name server. The structure of NS records is described in Appendix D.

A "bad cache" with multiple errors might look like this:

# head -10 /var/tmp/named_dump.db
; Dumped at Wed Sep 18 08:45:58 1991
; --- Cache & Data ---
$ORIGIN .
arpa   80805   IN     SOA    SRI-NIC.ARPA.  HOSTMASTER.SRI-NIC.ARPA.
		( 910909 10800 900 604800 86400 )
       479912  IN     NS     NS.NIC.DDN.MIL.
       479912  IN     NS     AOS.BRL.MIL.
       479912  IN     NS     A.ISI.EDU.
       479912  IN     NS     C.NYSER.NET.
       479912  IN     NS     TERP.UMD.EDU.
*      479912  IN     NS     NS.FOO.MIL.

This contrived example has three glaring errors. The "arpa" entry in the first field of the SOA record is invalid, and is the most infamous form of cache corruption. The last NS record is also invalid. NS.FOO.MIL. is not a valid root server, and an asterisk (*) in the first field of a root server record is not normal.

If you see problems like these, force named to reload its cache with the SIGHUP signal as shown below:

# kill -HUP `cat /etc/named.pid`

This clears the cache and reloads the valid root server entries from your named.ca file.

If you know which system is corrupting your cache, instruct your system to ignore updates from the culprit by using the bogusns statement in the /etc/named.boot file. The bogusns statement lists the IP addresses of name servers whose information cannot be trusted. For example, in the previous section we described a problem where acorn.sales.nuts.com (172.16.16.1) was causing cache corruption with improperly formatted HINFO records. The following entry in the named.boot file blocks queries to acorn.sales.nuts.com and thus blocks the cache corruption:

bogusns 172.16.16.1

The bogusns entry is only a temporary measure. It is designed to keep things running while the remote domain administrator has a chance to diagnose and repair the problem. Once the remote system is fixed, remove the bogusns entry from named.boot.

11.6.4 dig: An Alternative to nslookup

An alternative to nslookup for making name service queries is dig. dig queries are usually entered as single-line commands, while nslookup is usually run as an interactive session. But the dig command performs essentially the same function as nslookup. Which you use is mostly a matter of personal choice. They both work well.

As an example, we'll use dig to ask the root server terp.umd.edu for the NS records for the mit.edu domain. To do this, enter the following command:

% dig @terp.umd.edu mit.edu ns

In this example, @terp.umd.edu is the server that is being queried. The server can be identified by name or IP address. If you're troubleshooting a problem in a remote domain, specify an authoritative server for that domain. In this example we're asking for the names of servers for a top-level domain (mit.edu), so we ask a root server.

If you don't specify a server explicitly, dig uses the local name server, or the name server defined in the /etc/resolv.conf file. (Chapter 8 describes resolv.conf.) Optionally, you can set the environment variable LOCALRES to the name of an alternate resolv.conf file. This alternate file will then be used in place of /etc/resolv.conf for dig queries. Setting the LOCALRES variable will only affect dig. Other programs that use name service will continue to use /etc/resolv.conf.

The last item on our sample command line is ns. This is the query type. A query type is a value that requests a specific type of DNS information. It is similar to the value used in nslookup's set type command. Table 11.1 shows the possible dig query types and their meanings.

Table 11.1: dig Query Types
Query Type	DNS Record Requested
a	Address records
any	Any type of record
mx	Mail Exchange records
ns	Name Server records
soa	Start of Authority records
hinfo	Host Info records
axfr	All records in the zone
txt	Text records

Notice that the function of nslookup's ls command is performed by the dig query type axfr.

dig also has an option that is useful for locating a hostname when you have only an IP address. If you only have the IP address of a host, you may want to find out the hostname because numeric addresses are more prone to typos. Having the hostname can reduce the user's problems. The in-addr.arpa domain converts addresses to hostnames, and dig provides a simple way to enter in-addr.arpa domain queries. Using the -x option, you can query for a number to name conversion without having to manually reverse the numbers and add "in-addr.arpa." For example, to query for the hostname of IP address 18.72.0.3, simply enter:

% dig -x 18.72.0.3

; <<>> DiG 2.1 <<>> -x 
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6
;; flags: qr aa rd ra; Ques: 1, Ans: 1, Auth: 0, Addit: 0
;; QUESTIONS:
;;      3.0.72.18.in-addr.arpa, type = ANY, class = IN

;; ANSWERS:
3.0.72.18.in-addr.arpa. 21600   PTR     BITSY.MIT.EDU.

;; Total query time: 74 msec
;; FROM: peanut to SERVER: default -- 172.16.12.1
;; WHEN: Sat Jul 12 11:12:55 1997
;; MSG SIZE  sent: 40  rcvd: 67

The answer to our query is BITSY.MIT.EDU, but dig displays lots of other output. The first five lines and the last four lines provide information and statistics about the query. For our purposes, the only important information is the answer. [10]

[10] To see a single-line answer to this query, pipe dig's output to grep; e.g., dig -x 18.72.0.3 | grep PTR.


11.5 Checking Routing		11.7 Analyzing Protocol Problems