Sunday, September 4, 2016

Putting the fun back in C with dictionaries

Previously on this blog, we wrote new classes for array and string. What naturally follows is a so-called associative array or dictionary; a look-up map for items, indexed by string. They are found in scripting languages like Perl and Python, and they are convenient especially in the context of shell scripting because shell scripts deal a lot with processing command output, ie. strings. Dictionaries are hash-maps; they allow for easy in-memory key-value stores. There are a few gotchas however, as we will see.

Dictionary types are container classes that can store items of any type. The basic operations are:

store item by key
retrieve item by key
delete item by key
is there any item present for given key?

How it works is that a one-way hash value is computed for the key. The hash value is then used as basis for an index into a backing array. So a dictionary is really an array that is indexed by a string, hence the name associative array. The key is of type string, but languages like C++ and Go have maps that support any type of key. In my implementation, keys are always strings. One advantage of having a fixed type for keys is that you never have to write complicated type declarations for dictionary variables. Another advantage is never having to write custom hashing functions, as a good hashing function is already built-in. So while C++’s and Go’s map types are technically more flexible, here I clearly chose ease of use. [Oddly enough, C++’s map type is a red-black tree under the hood, a data structure with very different characteristics. An associative array is more akin to std::unordered_map].

The first challenge is choosing a hash function that works well, but is also simple enough so that it doesn’t take ages to compute hash values. We are not looking for cryptographic strength here, but we want the hash function to display a good random distribution (or “spread”) with few collisions as possible. Funnily enough, the CRC-32 checksum function seems to do the job, although it’s not the fastest algorithm. Initially I used the sdbm hash function (which is pretty decent), but later switched to the much better FNV-1a. This Fowler-Noll-Vo hash function was especially designed for fast hashing, and it’s incredibly easy to implement as well. It is centered around XOR and multiply and two magic numbers:

uint hash_fnv1a(const byte *data, size_t len) {
    assert(data != nullptr);
    uint h = FNV_offset_basis;
    while(len) {
        h ^= *data;
        h *= FNV_prime;
        data++;
        len--;
    }
    return h;
}

An implementation issue with dictionaries is how to deal with collisions. A good hashing function will not quickly collide, but because storage space is limited, collisions in the backing store will be common. Suppose we are storing an item in a dict, and it just so happens the slot in the array is already occupied. Now what to do. One easy solution is linear probing; try and occupy the next slot. If that one is already taken, try the next slot. A drawback of this solution is that you start occupying other slots, which could be reached quicker by other hash values. Finding the right item when retrieving will again involve trying the next slots. Linear probing tends to create “clusters” of items as they stumble over each other. Another technique is double hashing; if the slot is already occupied, calculate a hash of the hash and try that slot. At least you’re jumping out of the way and thus preventing clustering, but you’re using more time computing hashes and it’s not guaranteed to work well. If the array fills up, it may take a lot of tries before finding an empty slot. Another technique is storing collisions in a linked list; this means changing the backing store’s data structure from a straightforward array to an array of linked lists. While gently handling collisions, storing items now means allocating linked list entries, quite an expensive operation.

Although (linear or otherwise) probing is a relatively cheap solution, it does require extra bookkeeping to make item deletion work correctly. For example, consider this line of events taking place:

insert item A. It lands in slot #3
insert item B, also slot #3. Probing ends it up in slot #4
delete item A. Slot #3 becomes empty
lookup item B. Hash code refers to slot #3, which is empty

If there is nothing in place to say that item A was there before, but it is now deleted, the algorithm would incorrectly conclude that item B does not exist — the found slot is empty, so there is nothing there. However if we put a flag in place that it concerns a deleted entry, the algorithm continues the search and finds item B.

As you can see, the performance of a hash map that employs linear probing degrades over time as it fills up. As a rule of thumb, you should grow the backing store when utilization goes over 70%. Deleting items doesn’t help in regaining that loss in performance. While it’s certainly possible to shrink the backing store when the utilization drops, most implementations don’t bother, as memory is cheap nowadays, and there is little to no point in freeing up that bit of space.

Now that we know how dictionary operations work, let’s have a look at the internal data structures. In these series we are using C++ to put the fun back in C, so it is built using template classes. The key-value pairs are grouped into a dictitem. A dictitem caches the hash value so it doesn’t have to be re-computed all the time. The dict class has a backing store which is an array of dictitem. We make good use of the previously written grow-able array and immutable string classes.

template <typename T>
class dict {

    class dictitem {
        uint hashvalue;
        string key_;

        friend class dict<T>;

    public:
        const string& key;
        T value;
    };

    array<dictitem> items;

public:
    // store item
    T& operator[](const string& key);

    // retrieve item
    const T& operator[](const string& key) const;

    // item exists
    bool has_key(const string& key) const;

    // get all keys
    array<string> keys(void) const;
};

In principle the dictitem is internal to dict. We take extra precautions to protect the key however. The key is forced to be a read-only member of dictitem from the library user’s perspective. You really can not have someone changing stored keys in a dict because it would break the container. (This was already covered by the string being immutable, but hey). You can’t directly see it here, but access to dictitem is still possible via an iterator. Also note the use of operator[] in both const and non-const forms.

An example of using the dictionary class:

dict<int> bounty;

bounty["Joe"] = 150;
bounty["William"] = 100;
bounty["Jack"] = 75;
bounty["Averell"] = 50;

string name = "Emmett";
int b = bounty[name];

The last few lines carry a problem. What if we query the dictionary for a non-existent key? Then the statement itself should fail; the assignment is invalid. In fact, it should throw an exception. My vision for putting the fun back in C does not include exceptions, and therefore I would let it bomb out with an assertion—if only it were that simple. In practice, querying a non-existent key will create a new (empty) value. This undesirable behavior is actually (sadly) also present in C++’s std::map.

Strangely enough, C++ is incapable of getting the desired behavior right. The compiler does not see the operator[] function on the right-hand side as const access unless the dictionary variable already was explicitly marked const. Hence the compiler does not select the const version of the operator function, and therefore a new entry gets created. From within the operator[] function you can not distinguish between left-hand side and right-hand side use. In C++11 you can mark methods as being lvalue or rvalue referencing, but I had no success in doing so. This is too bad, because it makes it impossible to distinguish between create, read, or update actions. Who knows, maybe this will be addressed in a future C++ standard. Until then, the way around this is to either resort to get() and set() member functions, or to use has_key() checks every single time.

string name = "Emmett";
if (!bounty.has_key(name)) {
    printf("no such key: %s\n", name.c_str());
    return false;
}
int b = bounty[name];

The result is code that contains a lot more explicit checks than you are used to in, for example, Python, which is geared towards exceptions.

Finally, because C++, we have an iterator that returns a dictitem. This allows looping over a dictionary:

for(auto it : my_dict) {
    printf("key: %s  value: %d\n", it.key.c_str(), it.value);
}

Although pretty simple on the outside, there are a number of interesting tradeoffs to be made when writing a dictionary class. In the end it does leave me craving more, wondering whether different decisions would turn out more advantageous.