The Developer’s Cry

a blog about computer programming

Putting the fun back in C with strings

The C programming language is without doubt the most influential systems programming language ever. It is notoriously bad however when it comes to string handling. A string in C is nothing more than a pointer to a sequence of bytes (traditionally ASCII characters put into 8-bit bytes) ending with a null byte as terminator. It is simple and powerful. Unfortunately, the power of C leaves much room for human error. The nature of C strings combined with less than perfect programming are a security wasp nest. Buffer overflows in C code have been exploited to break system security.
One reason for C strings not having an internal length attribute is keeping the core language down to a bare minimum; there are no “string types” at the machine level, so nor does C have any complex string type of its own. In hindsight, C —or at least the C library— ought to have had a complex string type with an internal length attribute.

It should be clear that to put the fun back in C, we need a decent string type. In C++ there is std::string, but like all other things in the STL, it has a very unsatisfying interface, especially when compared to other fun languages like Python or Go. It may be reinventing the wheel, but let us try and make a string class that is as pleasant to use, in C. Things to take notice of;

That last point is of course very different from C. Why immutable? Because copy-on-write; assigning a string means sharing it, and changing it creates a new copy. Making a new copy is a relatively expensive operation. While this certainly incurs a performance penalty, consider it the price to pay for a safe string type. On the upside, we can use immutable strings to implement dictionaries: associative arrays, indexed by string. In principle, you can not have a mutable string as index to such an array (or is that too much computer science philosophical rambling?)
Immutability can be realized rather easily; last time we implemented copy-on-write arrays. We can leverage that array class to implement strings. As the array class has slicing, strings can now be sliced as well.

Essentially, the code is a lot like this:

class string {
    array<byte> arr;

public:
    ... constructors and such ...

    uint len(void) const { return arr.len(); }

    byte operator[](uint idx) const {
        return arr[idx];
    }

    string slice(int start, int end) {
        string s;
        s.arr = arr.slice(start, end);
        return s;
    }
};

Another thing very different from C is that this kind of string does not have a null terminator. As a consequence, none of the string functions in the standard C library work for the new string type. This is fine as long as we have good alternatives. Functions like strcpy(), strcat(), strcmp() are replaced with more intuitive operators. Adding strings together with operator+ is pure joy. There seems to be no good alternative however for printf(), as well as POSIX functions that operate on filenames, like open(), stat(), or even unlink(). Therefore we provide an escape to convert a string back to a const char pointer, just like C++ does. It involves a hack that does append a null byte, and returns a raw pointer to the array. Note that appending a null byte triggers copy-on-write. Although a hack, it is a reasonably safe hack. A trick that I like to use works by virtue of an explicit cast operator, and/or a str() conversion function:

const char *c_str(void) const;

explicit operator const char*() const { return c_str(); }

// outside class string
const char *str(const string& s) { return s.c_str(); }

// and in "user" code:
string name = "Ada";
printf("Name: %s\n", (const char *)name);
printf("Also: %s\n", str(name));

There are a couple of remarks to make about this example. One is that there is no place for printf() in C++ code because varargs isn’t type-safe. Another is that old-style type-casts have no place in C++. Well, to hell with that. We are putting the fun back in C. It wouldn’t be fun otherwise.