Chapter 19

Table

19.1 Arrays, Vectors and Tables

Arrays are a generally useful data structure, but they suffer from two important limitations:

The size of the array does not depend on the number of items in it. If the array is too big, it wastes space. If it is too small it might cause an error, or we might have to write code to resize it.
Although the array can contain any type of item, the indices of the array have to be integers. We cannot, for example, use a String to specify an element of an array.

In Section 17.10 we saw how the built-in Vector class solves the first problem. As the user adds items it expands automatically. It is also possible to shrink a Vector so that the capacity is the same as the current size.

But Vectors don't help with the second problem. The indices are still integers.

That's where the Table ADT comes in. The Table is a generalization of the Vector that can use any type as an index. These generalized indices are called keys.

Just as you would use an index to access a value in an array, you use a key to access a value in a Table. So each key is associated with a value, which is why Tables are sometimes called associative arrays.

A common example of a table is a dictionary, which is a table that associates words (the keys) with their definitions (the values). Because of this example Tables are also sometimes called Dictionaries. Also, the association of a particular key with a particular value is called an entry.

19.2 The Table ADT

Like the other ADTs we have looked at, Tables are defined by the set of operations they support:

constructor: Make a new, empty table.
put: Create an entry that associates a value with a key.
get: For a given key, find the corresponding value.
containsKey: Return true if there is an entry in the Table with the given Key.
keys: : Return a collection that contains all the keys in the Table.

19.3 The built-in Hashtable

Java provides an implementation of the Table ADT called Hashtable. It is in the java.util package. Later in the chapter we'll see why it is called Hashtable.

To demonstrate the use of the Hashtable we'll write a short program that traverses a String and counts the number of times each word appears.

We'll create a new class called WordCount that will build the Table and then print its contents. Naturally, each WordCount object contains a Hashtable:

public class WordCount {
Hashtable ht;

public WordCount () {
ht = new Hashtable ();
}
}

The only public methods for WordCount are processLine, which takes a String and adds its words to the Table, and print, which prints the results at the end.

processLine breaks the String into words using a StringTokenizer and passes each word to processWord.

public void processLine (String s) {
StringTokenizer st = new StringTokenizer (s, " ,.");
while (st.hasMoreTokens()) {
String word = st.nextToken();
processWord (word.toLowerCase ());
}
}

The interesting work is in processWord.

public void processWord (String word) {
if (ht.containsKey (word)) {
Integer i = (Integer) ht.get (word);
Integer j = new Integer (i.intValue() + 1);
ht.put (word, j);
} else {
ht.put (word, new Integer (1));
}
}

If the word is already in the table, we get its counter, increment it, and put the new value. Otherwise, we just put a new entry in the table with the counter set to 1.

To print the entries in the table, we need to be able to traverse the keys in the table. Fortunately, the Hashtable implementation provides a method, keys, that returns an Enumeration object we can use. Enumerations are very similar to the Iterators we saw in Section 17.11. Both are abstract classes in the java.util package; you should review the documentation of both. Here's how to use keys to print the contents of the Hashtable:

public void print () {
Enumeration enum = ht.keys ();
while (enum.hasMoreElements ()) {
String key = (String) enum.nextElement ();
Integer value = (Integer) ht.get (key);
System.out.println ("{ " + key + ", " + value + " }");
}
}

Each of the elements of the Enumeration is an Object, but since we know they are keys, we typecast them to be Strings. When we get the values from the Table, they are also Objects, but we know they are counters, so we typecast them to be Integers.

Finally, to count the words in a string:

WordCount wc = new WordCount ();
wc.processLine ("da doo ron ron ron, da doo ron ron");
wc.print ();

The output is

{ ron, 5 }
{ doo, 2 }
{ da, 2 }

The elements of the Enumeration are not in any particular order. The only guarantee is that all the keys in the table will appear.

19.4 A Vector implementation

An easy way to implement the Table ADT is to use a Vector of entries, where each entry is an object that contains a key and a value. These objects are called key-value pairs.

A class definition for a KeyValuePair might look like this:

class KeyValuePair {
Object key, value;

public KeyValuePair (Object key, Object value) {
this.key = key;
this.value = value;
}

public String toString () {
return "{ " + key + ", " + value + " }";
}
}

Then the implementation of Table looks like this:

public class Table {
Vector v;

public Table () {
v = new Vector ();
}
}

To put a new entry in the table, we just add a new KeyValuePair to the Vector:

public void put (Object key, Object value) {
KeyValuePair pair = new KeyValuePair (key, value);
v.add (pair);
}

Then to look up a key in the Table we have to traverse the Vector and find a KeyValuePair with a matching key:

public Object get (Object key) {
Iterator iterator = v.iterator ();
while (iterator.hasNext ()) {
KeyValuePair pair = (KeyValuePair) iterator.next ();
if (key.equals (pair.key)) {
return pair.value;
}
}
return null;
}

The idiom to traverse a Vector is the one we saw in Section 17.11. When we compare keys, we use deep equality (the equals method) rather than shallow equality (the == operator). This allows the key class to specify the definition of equality. In our example, the keys are Strings, so it will use the built-in equals method in the String class.

For most of the built-in classes, the equals method implements deep equality. For some classes, though, it is not easy to define what that means. For example, see the documentation of equals for Doubles.

Because equals is an object method, this implementation of get does not work if key is null. We could handle null as a special case, or we could do what the build-in Hashtable does---simply declare that null is not a legal key.

Speaking of the built-in Hashtable, it's implementation of put is a bit different from ours. If there is already an entry in the table with the given key, put updates it (give it a new value), and returns the old value (or null if there was none. Here is an implementation of their version:

public Object put (Object key, Object value) {
Object result = get (key);
if (result == null) {
KeyValuePair pair = new KeyValuePair (key, value);
v.add (pair);
} else {
update (key, value);
}
return result;
}

The update method is not part of the Table ADT, so it is declared private. It traverses the vector until it finds the right KeyValuePair and then it updates the value field. Notice that we don't have to modify the Vector itself, just one of the objects it contains.

private void update (Object key, Object value) {
Iterator iterator = v.iterator ();
while (iterator.hasNext ()) {
KeyValuePair pair = (KeyValuePair) iterator.next ();
if (key.equals (pair.key)) {
pair.value = value;
break;
}
}
}

The only methods we haven't implemented are containsKey and keys. The containsKey method is almost identical to get except that it returns true or false instead of an object reference or null.

As an exercise, implement keys by building a Vector of keys and returning the elements of the vector. See the documentation of elements in the Vector class for more information.

19.5 The List abstract class

The java.util package defines an abstract class called List that specifies the set of operations a class has to implement in order to be considered (very abstractly) a list. This does not mean, of course, that every class that implements List has to be a linked list.

Not surprisingly, the built-in LinkedList class is a member of the List abstract class. Surprisingly, so is Vector.

The methods in the List definition include add, get and iterator. In fact, all the methods from the Vector class that we used to implement Table are defined in the List abstract class.

That means that instead of a Vector, we could have used any List class. In Table.java we can replace Vector with LinkedList, and the program still works!

This kind of type generality can be useful for tuning the performance of a program. You can write the program in terms of an abstract class like List and then test the program with several different implementations to see which yields the best performance.

19.6 Hash table implementation

The reason that the built-in implementation of the Table ADT is called Hashtable is that it uses a particularly efficient implementation of a Table called a hashtable.

Of course, the whole point of defining an ADT is that it allows us to use an implementation without knowing the details. So it is probably a bad thing that the people who wrote the Java library named this class according to its implementation rather than its ADT, but I suppose of all the bad things they did, this one is pretty small.

Anyhoo, you might be wondering what a hashtable is, and why I say it is particularly efficient. We'll start by analyzing the performance of the List implementation we just did.

Looking at the implementation of put, we see that there are two cases. If the key is not already in the table, then we only have to create a new key-value pair and add it to the List. Both of these are constant-time operations.

In the other case, we have to traverse the List to find the existing key-value pair. That's a linear time operation. For the same reason, get and containsKey are also linear.

Although linear operations are often good enough, we can do better. It turns out that there is a way to implement the Table ADT so that both put and get are constant time operations!

The key is to realize that traversing a list takes time proportional to the length of the list. If we can put an upper bound on the length of the list, then we can put an upper bound on the traverse time, and anything with a fixed upper bound is considered constant time.

But how can we limit the length of the lists without limiting the number of items in the table? By increasing the number of lists. Instead of one long list, we'll keep many short lists.

As long as we know which list to search, we can put a bound on the amount of searching.

19.7 Hash Functions

And that's where hash functions come in. We need some way to look at a key and know, without searching, which list it will be in. We'll assume that the lists are in an array (or Vector) so we can refer to them by index.

The solution is to come up with some mapping---almost any mapping---between the key values and the indices of the lists. For every possible key there has to be a single index, but there might be many keys that map to the same index.

For example, imagine an array of 8 lists and a table made up of keys that are Integers and values that are Strings. It might be tempting to use the intValue of the Integers as indices, since they are the right type, but there are a whole lot of integers that do not fall between 0 and 7, which are the only legal indices.

The modulus operator provides a simple (in terms of code) and efficient (in terms of run time) way to map all the integers into the range (0, 7). The expression

key.intValue() % 8

is guaranteed to produce a value in the range from -7 to 7 (including both). If you take its absolute value (using Math.abs) you will get a legal index.

For other types, we can play similar games. For example, to convert a Character to an integer, we can use the built-in method Character.getNumericValue and for Doubles there is intValue.

For Strings we could get the numeric value of each character and add them up, or instead we might use a shifted sum. To calculate a shifted sum, alternate between adding new values to the accumulator and shifting the accumulator to the left. By "shift to the left" I mean "multiply by a constant."

To see how this works, take the list of numbers { 1, 2, 3, 4, 5, 6 } and compute their shifted sum as follows. First, initialize the accumulator to 0. Then,

Multiply the accumulator by 10.
Add the next element of the list to the accumulator.
Repeat until the list is finished.

As an exercise, write a method that calculates the shifted sum of the numeric values of the characters in a String using a multiplier of 32.

For each type, we can come up with a function that takes values of that type and generates a corresponding integer value. These functions are called hash functions, because they often involve making a hash of the components of the object. The integer value for each object is called its hash code.

There is one other way we might generate a hash code for Java objects. Every Java object provides a method called hashCode that returns an integer that corresponds to that object. For the built-in types, the hashCode method is implemented so that if two objects contain the same data, they will have the same hash code (as in deep equality). The documentation of these methods explains what the hash function is. You should check them out.

For user-defined types, it is up to the implementor to provide an appropriate hash function. The default hash function, provided in the Object class, often uses the location of the object to generate a hash code, so its notion of "sameness" is shallow equality. Most often when we are searching a hash table for a key, shallow equality is not what we want.

Regardless of how the hash code is generated, the last step is to use modulus and absolute value to map the hash code into the range of legal indices.

19.8 Resizing a hash table

Let's review. A Hash table consists of an array (or Vector) of Lists, where each List contains a small number of key-value pairs. To add a new entry to a table, we calculate the hash code of the new key and add the entry to the corresponding List.

To look up a key, we hash it again and search the corresponding list. If the lengths of the lists are bounded then the search time is bounded.

So how do we keep the lists short? Well, one goal is to keep them as balanced as possible, so that there are no very long lists at the same time that others are empty. This is not easy to do perfectly---it depends on how well we chose the hash function---but we can usually do a pretty good job.

Even with perfect balance, the average list length grows linearly with the number of entries, and we have to put a stop to that.

The solution is to keep track of the average number of entries per list, which is called the load factor; if the load factor gets too high, we have to resize the table.

To resize, we create a new table, usually twice as big as the original, take all the entries out of the old one, hash them again, and put them in the new table. Usually we can get away with using the same hash function; we just use a different value for the modulus operator.

19.9 Performance of resizing

How long does it take to resize the table? Clearly it is linear with the number of entries. That means that most of the time put takes constant time, but every once in a while ---when we resize---it takes linear time.

At first that sounds bad. Doesn't that undermine my claim that we can perform put in constant time? Well, frankly, yes. But with a little wheedling, I can fix it.

Since some put operations take longer than others, let's figure out the average time for a put operation. The average is going to be c, the constant time for a simple put, plus an additional term of p, the percentage of the time I have to resize, times kn, the cost of resizing.

t(n) = c + p · kn

I don't know what c and k are, but we can figure out what p is. Imagine that we have just resized the hash table by doubling its size. If there are n entries, then we can add an addition n entries before we have to resize again. So the percentage of the time we have to resize is 1/n.

Plugging into the equation, we get

t(n) = c + 1/n · kn = c + k

In other words, t(n) is constant time!

19.10 Glossary

table: An ADT that defines operations on a collection of entries.
entry: An element in a table that contains a key-value pair.
key: An index, of any type, used to look up values in a table.
value: An element, of any type, stored in a table.
dictionary: Another name for a table.
associative array: Another name for a dictionary.
hash table: A particularly efficient implementation of a table.
hash function: A function that maps values of a certain type onto integers.
hash code: The integer value that corresponds to a given value.
shifted sum: A simple hash function often used for compounds objects like Strings.
load factor: The number of entries in a hashtable divided by the number of lists in the hashtable; i.e. the average number of entries per list.