Hash Functions

Operations using hash tables are efficient because the position of a stored value can be calculated using the key. Hash tables are implemented typically as an array of values. A hash function is used to map a key to an index within this array. This index is where the key's corresponding value is stored. Other search algorithms, such as a linear or binary search, cannot map a key to its value's position as quickly as hash tables. These algorithms must perform more comparisons to find the index of the stored value. The time this search takes to complete increases as the number of stored items increases.


Figure 1 A sample hash function

A mapping of keys to indexes is demonstrated in Figure 1. In this figure, the hash function generates an index based on the rightmost digit of the ASCII value of the second letter of the person's last name. For example, in the name "Hanson, Bob", the second letter is an "a". The ASCII value of "a" is 97. The rightmost digit of 97 is 7. Therefore, the record for "Hanson, Bob" is stored at index 7 of the hash table.


Figure 3 A sample hash table

The hash table in Figure 2 shows each name in its mapped position. The advantage of a hash table is that to find an entry, one only has to apply the hash function to the key.

A popular method used in implementing a hash function is the division method. Like all hash functions, the division method maps a key to an index within the hash table. The division method implementation involves the conversion of a key to a variable of type unsigned int. Then, this value is divided by the size of the hash table. The remainder of this division is used as the index.

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: class hash_function {   public:   unsigned int mm;   hash_function(unsigned int m = 6151) : mm(m) {}   unsigned int operator()(const string& s) const { unsigned int res = 0; for (unsigned int i = 0; i < s.size(); i++) { res = res * mm + s[i]; } return res; } };
Listing 1A hash function based on the division method

A problem arises when we consider that there are typically many more keys than there are locations within a hash table. This means that a hash function can potentially map two or more distinct keys to the same index. Consider if we tried to add the name "Anderson, Jeff" to the hash table pictured in Figure 2. Applying the hash function yields an index of 0. The name "Adderly, Maria", however, already exists at index 0. When a hash function maps more than one key to the same index a collision occurs.

Hash table implementations are complicated by the fact that they must handle potential collisions. These mechanisms to handle collisions are discussed in detail in chapter 20 of the Weiss textbook. Basically, collisions decrease the performance of hash table operations. The best way to reduce the number of collisions, and thus increase the efficiency of a hash table, is to use a good hash function. A good hash function evenly distributes the mapping of keys throughout the positions in the hash table.

It is good software engineering practice to separate the hash function from the implementation of the hash table. This allows a programmer to customize a hash function to the data stored by a particular instance of a hash table. Listing 1 defines a hash table as a class that overloads operator ().

Memoizing: An Application of Hash Tables

When it is computationally expensive to calculate a function value y = f(x), it may be a good idea to store the value for future use. In other words, we calculate f(x) only the first time the function value on this particular input x is required, and then store ( x, f(x) ) in a hash table. Any future requests for f(x) will then result in a table lookup and be much faster than a recalculation. This technique is called memoizing.

Note that we have to modify the function a bit to take advantage of the stored values. Here is the structure of a memoized function.

int f_memo(int x) {

if ( there is an entry (x,y) in hash table ) {

return y;

}

else {

compute y = f(x) directly

store (x,y) in table

return y;

}

}

Example 1 A typical memoized function

Since the hash table has to persist between function calls, it would be natural to organize this as a class that overloads operator ().