Putting the fun back in C with dictionaries
Previously on this blog, we wrote new classes for array
and string
.
What naturally follows is a so-called associative array or dictionary; a
look-up map for items, indexed by string. They are found in scripting
languages like Perl and Python, and they are convenient especially in the
context of shell scripting because shell scripts deal a lot with processing
command output, ie. strings. Dictionaries are hash-maps; they allow for
easy in-memory key-value stores. There are a few gotchas however, as
we will see.
Dictionary types are container classes that can store items of any type. The basic operations are:
- store item by key
- retrieve item by key
- delete item by key
- is there any item present for given key?
How it works is that a one-way hash value is computed for the key. The hash
value is then used as basis for an index into a backing array. So a
dictionary is really an array that is indexed by a string, hence the
name associative array.
The key is of type string
, but languages like C++ and Go have maps that
support any type of key. In my implementation, keys are always strings.
One advantage of having a fixed type for keys is that you never have to write
complicated type declarations for dictionary variables. Another advantage is
never having to write custom hashing functions, as a good hashing function is
already built-in. So while C++’s and Go’s map
types are technically more
flexible, here I clearly chose ease of use.
[Oddly enough, C++’s map
type is a red-black tree under the hood, a data
structure with very different characteristics. An associative array is more
akin to std::unordered_map
].
The first challenge is choosing a hash function that works well, but is
also simple enough so that it doesn’t take ages to compute hash values.
We are not looking for cryptographic strength here, but we want the hash
function to display a good random distribution (or “spread”) with few
collisions as possible. Funnily enough, the CRC-32 checksum function seems
to do the job, although it’s not the fastest algorithm. Initially I used the
sdbm
hash function (which is pretty decent), but later switched to the much
better FNV-1a. This Fowler-Noll-Vo hash function was especially designed for
fast hashing, and it’s incredibly easy to implement as well. It is centered
around XOR and multiply and two magic numbers:
uint hash_fnv1a(const byte *data, size_t len) {
assert(data != nullptr);
uint h = FNV_offset_basis;
while(len) {
h ^= *data;
h *= FNV_prime;
data++;
len--;
}
return h;
}
An implementation issue with dictionaries is how to deal with collisions. A good hashing function will not quickly collide, but because storage space is limited, collisions in the backing store will be common. Suppose we are storing an item in a dict, and it just so happens the slot in the array is already occupied. Now what to do. One easy solution is linear probing; try and occupy the next slot. If that one is already taken, try the next slot. A drawback of this solution is that you start occupying other slots, which could be reached quicker by other hash values. Finding the right item when retrieving will again involve trying the next slots. Linear probing tends to create “clusters” of items as they stumble over each other. Another technique is double hashing; if the slot is already occupied, calculate a hash of the hash and try that slot. At least you’re jumping out of the way and thus preventing clustering, but you’re using more time computing hashes and it’s not guaranteed to work well. If the array fills up, it may take a lot of tries before finding an empty slot. Another technique is storing collisions in a linked list; this means changing the backing store’s data structure from a straightforward array to an array of linked lists. While gently handling collisions, storing items now means allocating linked list entries, quite an expensive operation.
Although (linear or otherwise) probing is a relatively cheap solution, it does require extra bookkeeping to make item deletion work correctly. For example, consider this line of events taking place:
- insert item A. It lands in slot #3
- insert item B, also slot #3. Probing ends it up in slot #4
- delete item A. Slot #3 becomes empty
- lookup item B. Hash code refers to slot #3, which is empty
If there is nothing in place to say that item A was there before, but it is now deleted, the algorithm would incorrectly conclude that item B does not exist — the found slot is empty, so there is nothing there. However if we put a flag in place that it concerns a deleted entry, the algorithm continues the search and finds item B.
As you can see, the performance of a hash map that employs linear probing degrades over time as it fills up. As a rule of thumb, you should grow the backing store when utilization goes over 70%. Deleting items doesn’t help in regaining that loss in performance. While it’s certainly possible to shrink the backing store when the utilization drops, most implementations don’t bother, as memory is cheap nowadays, and there is little to no point in freeing up that bit of space.
Now that we know how dictionary operations work, let’s have a look at the
internal data structures. In these series we are using C++ to put the fun
back in C, so it is built using template classes. The key-value pairs are
grouped into a dictitem
. A dictitem
caches the hash value so it doesn’t
have to be re-computed all the time. The dict
class has a backing store
which is an array of dictitem
. We make good use of the previously written
grow-able array
and immutable string
classes.
template <typename T>
class dict {
class dictitem {
uint hashvalue;
string key_;
friend class dict<T>;
public:
const string& key;
T value;
};
array<dictitem> items;
public:
// store item
T& operator[](const string& key);
// retrieve item
const T& operator[](const string& key) const;
// item exists
bool has_key(const string& key) const;
// get all keys
array<string> keys(void) const;
};
In principle the dictitem
is internal to dict
. We take extra precautions
to protect the key however. The key is forced to be a read-only member of
dictitem
from the library user’s perspective. You really can not have
someone changing stored keys in a dict
because it would break the container.
(This was already covered by the string
being immutable, but hey).
You can’t directly see it here, but access to dictitem
is still possible
via an iterator. Also note the use of operator[]
in both const
and
non-const forms.
An example of using the dictionary class:
dict<int> bounty;
bounty["Joe"] = 150;
bounty["William"] = 100;
bounty["Jack"] = 75;
bounty["Averell"] = 50;
string name = "Emmett";
int b = bounty[name];
The last few lines carry a problem. What if we query the dictionary for a
non-existent key? Then the statement itself should fail; the assignment is
invalid. In fact, it should throw an exception. My vision for putting the fun
back in C does not include exceptions, and therefore I would let it bomb out
with an assertion—if only it were that simple. In practice, querying a
non-existent key will create a new (empty) value. This undesirable behavior
is actually (sadly) also present in C++’s std::map
.
Strangely enough, C++ is incapable of getting the desired behavior right.
The compiler does not see the operator[]
function on the right-hand side as
const
access unless the dictionary variable already was explicitly marked
const
. Hence the compiler does not select the const
version of the
operator function, and therefore a new entry gets created. From within the
operator[]
function you can not distinguish between left-hand side and
right-hand side use. In C++11 you can mark methods as being lvalue or rvalue
referencing, but I had no success in doing so. This is too bad, because
it makes it impossible to distinguish between create, read, or update actions.
Who knows, maybe this will be addressed in a future C++ standard.
Until then, the way around this is to either resort to get()
and set()
member functions, or to use has_key()
checks every single time.
string name = "Emmett";
if (!bounty.has_key(name)) {
printf("no such key: %s\n", name.c_str());
return false;
}
int b = bounty[name];
The result is code that contains a lot more explicit checks than you are used to in, for example, Python, which is geared towards exceptions.
Finally, because C++, we have an iterator that returns a dictitem
. This
allows looping over a dictionary:
for(auto it : my_dict) {
printf("key: %s value: %d\n", it.key.c_str(), it.value);
}
Although pretty simple on the outside, there are a number of interesting tradeoffs to be made when writing a dictionary class. In the end it does leave me craving more, wondering whether different decisions would turn out more advantageous.