Putting the fun back in C with strings
The C programming language is without doubt the most influential systems
programming language ever. It is notoriously bad however when it comes to
string handling. A string in C is nothing more than a pointer to a sequence
of bytes (traditionally ASCII characters put into 8-bit bytes) ending with
a null byte as terminator. It is simple and powerful. Unfortunately, the
power of C leaves much room for human error. The nature of C strings combined
with less than perfect programming are a security wasp nest. Buffer overflows
in C code have been exploited to break system security.
One reason for C strings not having an internal length attribute is keeping
the core language down to a bare minimum; there are no “string types” at
the machine level, so nor does C have any complex string type of its own.
In hindsight, C —or at least the C library— ought to have had a complex
string type with an internal length attribute.
It should be clear that to put the fun back in C, we need a decent string
type. In C++ there is std::string, but like all other things in the STL,
it has a very unsatisfying interface, especially when compared to other fun
languages like Python or Go. It may be reinventing the wheel, but let us
try and make a string class that is as pleasant to use, in C.
Things to take notice of;
- in Python and Go strings are byte arrays
- strings are UTF-8 strings
- strings have length, which is the number of bytes
- character count is not the same as string length—because UTF-8
- the subscript
operator[]returns the byte value at position - Go provides an iterator that returns the character’s unicode value
- Python also has “ustring”s (UTF-32) where
operator[]returns the character - strings are immutable; changing it creates a new string
That last point is of course very different from C. Why immutable? Because
copy-on-write; assigning a string means sharing it, and changing it creates
a new copy. Making a new copy is a relatively expensive operation. While this
certainly incurs a performance penalty, consider it the price to pay for a
safe string type. On the upside, we can use immutable strings to implement
dictionaries: associative arrays, indexed by string. In principle, you can not
have a mutable string as index to such an array (or is that too much computer
science philosophical rambling?)
Immutability can be realized rather easily; last time we implemented
copy-on-write arrays. We can leverage that array class to implement strings.
As the array class has slicing, strings can now be sliced as well.
Essentially, the code is a lot like this:
class string {
array<byte> arr;
public:
... constructors and such ...
uint len(void) const { return arr.len(); }
byte operator[](uint idx) const {
return arr[idx];
}
string slice(int start, int end) {
string s;
s.arr = arr.slice(start, end);
return s;
}
};
Another thing very different from C is that this kind of string does not have
a null terminator. As a consequence, none of the string functions in the
standard C library work for the new string type. This is fine as long as
we have good alternatives. Functions like strcpy(), strcat(), strcmp()
are replaced with more intuitive operators. Adding strings together with
operator+ is pure joy. There seems to be no good alternative however for
printf(), as well as POSIX functions that operate on filenames, like
open(), stat(), or even unlink(). Therefore we provide an escape
to convert a string back to a const char pointer, just like C++ does.
It involves a hack that does append a null byte, and returns a raw pointer
to the array. Note that appending a null byte triggers copy-on-write.
Although a hack, it is a reasonably safe hack. A trick that I like to use
works by virtue of an explicit cast operator, and/or a str() conversion
function:
const char *c_str(void) const;
explicit operator const char*() const { return c_str(); }
// outside class string
const char *str(const string& s) { return s.c_str(); }
// and in "user" code:
string name = "Ada";
printf("Name: %s\n", (const char *)name);
printf("Also: %s\n", str(name));
There are a couple of remarks to make about this example. One is that there
is no place for printf() in C++ code because varargs isn’t type-safe.
Another is that old-style type-casts have no place in C++. Well, to hell
with that. We are putting the fun back in C. It wouldn’t be fun otherwise.