This is a quick post and it’s here more or less to serve as a reminder to myself, in case I make the same mistake again.
A while ago, I required a hash table for a LISP interpreter in C that I was working on as part of the book Build Your Own Lisp by Daniel Holden. Rather than go straight for a library, I opted for this tutorial so that I could also learn more about how open-addressed hash-tables work.
For the sake of it, I wrote some tests with randomized inputs just to make sure everything is set. And the tests failed horribly. Two key bugs emerged:
- Some keys kept on being indexed into the same slot over and over again despite that slot being already taken. The program then entered into an infinite loop. This is exactly the kind of problem double-hashing is designed to solve but in my case, it was failing horribly
- Other keys were being hashed into a negative integer which when used to access a slot in the underlying array resulted in a segmentation fault.
The culprits were these two functions below. When given a string, the current
number of slots (or buckets) in the hash-table and our current attempt at
finding an empty slot (starts at 0), ht_get_hash
returns an integer which
serves as the slot index. It uses ht_hash
to calculate this index:
static const int HT_PRIME_1 = 151;
static const int HT_PRIME_2 = 163;
static int ht_hash(const char *str, const int a, const int m) {
long hash = 0;
const int len_str = strlen(str);
for (int i = 0; i < len_str; i++) {
hash += (long)pow(a, len_str - i + 1) * str[i];
hash = hash % m;
}
return (int)hash;
}
int ht_get_hash(const char *str, const int num_buckets, const int attempt) {
const int hash_a = ht_hash(str, HT_PRIME_1, num_buckets);
const int hash_b = ht_hash(str, HT_PRIME_2, num_buckets);
return (hash_a + (attempt * (hash_b + 1))) % num_buckets;
}
I was so convinced that the author had made some mistake somewhere in the
ht_hash
function or even in ht_get_hash
that I was just about to submit an
issue/bug report. Before doing so though, I ended spending a significant amount
of time reading data-structure books and watching lectures on youtube so as to
get a handle of open-addressing and double-hashing. Everything I came across
confirmed that the author was right as ever. The only factor left was that it
was I instead who made the mistake somewhere. Going through Routley’s code line
by line, I identified my error in the following line:
hash += (long)pow(a, len_str - i + 1) * str[i];
It was supposed to be:
hash += (long)pow(a, len_str - (i + 1)) * str[i];
Moral of the story, ignore operator associativity (and even precedence) at your own peril.