Sunday, October 23, 2016

Putting the fun back in C with threads

On this blog we’ve been (re)implementing some basic data structures to make programming in C fun again. If there is anything a modern language or library needs these days, it is good multi-threading support. By ‘good’ I mean, of course, easy and fun to use. The C11 standard has given us threads, but support has been lacking and when you google for it, even today you’ll be hard-pressed to find useful information on the subject. In the meantime the world has moved on to C++, which offers std::thread. In reality these threads are based upon POSIX threads, which we have full access to via the C API functions of libpthread.

Bare pthread programming is a chore so to put the fun back in C we will wrap the usual suspects with C++ classes: thread, mutex, and condition variable. The thread, mutex, and condition_variable classes are given an interface that is a lot like Python’s. In pseudo-code:

class thread {
    void start(func, void *arg);
    void join();
    void exit();
}

class mutex {
    void lock();
    bool trylock();
    void unlock();
}

class condition_variable {
    bool wait(mutex&, ulong timeout=0UL);
    void notify();
    void notify_all();
}

Because it makes no sense to copy any of these, the copy constructor and assignment operator are explicitly deleted. Like so:

// can not copy construct
thread(const thread&) = delete;

// can not copy
thread& operator=(const thread&) = delete;

Because C++ is a RAII language, we can make a lockguard object that automatically acquires and releases mutex locks in a critical region.

// critical region
do {
    lockguard(global_lock);
    if (bad) {
        // mutex will be auto-released
        return;
    }
    do_good();
} while(0);

// implementation of the lockguard
class lockguard {
    mutex& m_;

public:
    lockguard(mutex& m) : m_(m) {
        m_.lock();
    }

    ~lockguard() {
        m_.unlock();
    }
};

The lockguard makes use of the fact that objects are destructed whenever they go out of scope.

Next up: channels. There is a language that totally blew me away when it comes to parallel programming, and it’s Go. In Golang you simply write go(func) to launch a function as a separate thread. Moreover, threads communicate over channels. A channel acts much like a pipe; one end listens for messages, while at the other end messages are put onto the channel. It’s producer-consumer made easy. Go has a confusing syntax for this (the arrow notation) that we won’t mimic—I guess we could use operator << and >> for this, but I prefer to stick to methods chan.put() and chan.get().

I wrote about this earlier way back in 2012, but a channel is basically a work queue. There is a condition variable where the producer signals that an item has been put onto the channel, and consumers wake up so that they may consume the item.

template <typename T>
class chan {
    mutex mtx;
    condition_variable not_full, not_empty;
    array<T> items;
    uint max_items;

    ... constructors and what not ...

    void put(const T& item) {
        mtx.lock();
        if (max_items > 0) {
            while(items.len() >= max_items) {
                // spurious wakeup
                not_full.wait(mtx);
            }
        }
        items.append(item);
        not_empty.notify();
        mtx.unlock();
    }

    T get(void) {
        mtx.lock();
        while(!items.len()) {
            // spurious wakeup
            not_empty.wait(mtx);
        }

        T item = items.pop(0);

        not_full.notify();
        mtx.unlock();
        return item;
    }
};

As you can see it’s not too hard to implement a channel once you have a simple API for mutexes and condition variables. Channels make parallel programming tons easier.

The go() function launches a detached thread, and guarantees that the thread is running by the time go() returns to the caller. A detached thread is never joined by its parent, but if you do need such a mechanism, it’s easily implemented using a channel:

chan<bool> done_channel;
for(int i = 0; i < nthreads; i++) {
    go(test_go, &done_channel);
}
// wait for threads to exit
for(int i = 0; i < nthreads; i++) {
    done_channel.get();
    debug("a go-routine is done");
}

void test_go(void *arg) {
    debug("#%u go-routine started", thread_id());
    chan<bool> *done = (chan<bool> *)arg;

    do_long_calculation();

    debug("#%u go-routine ending", thread_id());
    done->put(true);
}

Lastly, I like to show how threads are assigned a unique id. You can print the value from pthread_self(), but it won’t look pretty as it’s probably just the address of some internal structure. Instead, I like to give threads a more intuitive rank number. This rank is assigned when the thread launches.

// main thread always has thread_id #0
#if 1
thread_local static uint thread_local_rank = 0;
#else
// older Apple clang does not support 'thread_local'
// but it does know about __thread
__thread static uint thread_local_rank = 0;
#endif

static std::atomic<uint> eternal_rank(0);

Because the rank is local to the thread and eternal_rank is atomic, assigning a rank is just:

thread_local_rank = ++eternal_rank;

The C programming language has found itself getting caught up by the times. Multi-core CPUs are no longer the future; they are here now and parallel codes are the way to get that extra power. Do we need that extra power? Well yes, after all we are programming in C for a reason. And let’s keep it fun while we are doing it.