SIGEST

The Editors

doi:10.1137/siread000053000003000545000001

Abstract

The SIGEST article in this issue, “Linear Probing with 5-wise Independence” by Anna Pagh, Rasmus Pagh, and Milan Ružić, is about one form of hashing, a method for fast data retrieval. Hashing is used in operating systems and compilers for symbol storage and management of memory pages and buffers. Database systems and routers use hashing to manage their data structures as well. Hashing is best explained with an example, and we will use the most common one. Some of you may remember searching for a telephone number with a phone book. A phone book was a large hard-copy volume with an alphabetical list of people with the phone number of each person listed to the right of the name. Searching for a phone number meant opening the book, finding the right starting letter, looking for the next letter, and finally figuring out which of several identically named people was the one you wanted. In general, you'd eliminate all but one letter on the first pass, and then cut the search space by roughly half with every step after that. The time to find a number is, therefore, approximately logarithmic in the number of names with the same starting letter. Hashing is a better way to find a phone number, especially with a computer. One allocates storage for a hash table and stores names in the table with a hash function h. The hash function maps names to integers, and a name together with the phone number is stored in location $h(name)$ in the table. Looking up the number requires only evaluating $h(name)$ and getting the number. If one can avoid conflicts (not likely), then the work is constant rather than logarithmic. If conflicts are possible, so h(`A. E. Newman') can be the same as h(`L. Trotsky'), then the storage/lookup algorithms and the hash function must have good theoretical properties, if the ideal constant lookup time is to be saved. These theoretical properties are expressed in probabilistic terms, and therefore the complexity results describe the expected cost of adding to the table or looking up an entry. In general a hash table maps keys (the names in the example) to locations where values (the phone numbers) are stored. Linear probing is one way to organize the storage and lookup. Linear probing tries to put a value in $h(key)$. If that location is already occupied, then one tries $h(key)+k$ for $k=1, \dots$ until one finds an empty location. The data are stored in the first empty location. It is possible that $h(key)$ may be in a long string of occupied positions (called “pileup” in the paper) and then the performance will be poor. A good hash function can eliminate this problem in the sense that the expected cost per operation is constant. A hash function with uniformly distributed and independent function values would be such a good hash function, but it is very difficult to construct such a function. In many cases, however, a hash function which is random with respect to small sets of keys will suffice. The main result of the paper is that, under some technical assumptions, a hash function which is 5-wise independent gives a constant expected cost per operation. This means that given five keys, the values of h at the keys are independent random variables. The paper also discusses how one can construct such hash functions. Since this paper appeared, others have shown that 4-wise independence is not enough, so the result in the paper is sharp in that sense. The authors have taken pains in their SIGEST paper to make a very technical topic in computer science accessible to the SIREV readership. The introduction gives you enough information to play with hash functions on your own and experience pileup personally.

Full Text