Putting the fun back in C with strings
The C programming language is without doubt the most influential systems
programming language ever. It is notoriously bad however when it comes to
string handling. A string in C is nothing more than a pointer to a sequence
of bytes (traditionally ASCII characters put into 8-bit bytes) ending with
a null byte as terminator. It is simple and powerful. Unfortunately, the
power of C leaves much room for human error. The nature of C strings combined
with less than perfect programming are a security wasp nest. Buffer overflows
in C code have been exploited to break system security.
One reason for C strings not having an internal length attribute is keeping
the core language down to a bare minimum; there are no “string types” at
the machine level, so nor does C have any complex string type of its own.
In hindsight, C —or at least the C library— ought to have had a complex
string type with an internal length attribute.
It should be clear that to put the fun back in C, we need a decent string
type. In C++ there is std::string
, but like all other things in the STL,
it has a very unsatisfying interface, especially when compared to other fun
languages like Python or Go. It may be reinventing the wheel, but let us
try and make a string class that is as pleasant to use, in C.
Things to take notice of;
- in Python and Go strings are byte arrays
- strings are UTF-8 strings
- strings have length, which is the number of bytes
- character count is not the same as string length—because UTF-8
- the subscript
operator[]
returns the byte value at position - Go provides an iterator that returns the character’s unicode value
- Python also has “ustring”s (UTF-32) where
operator[]
returns the character - strings are immutable; changing it creates a new string
That last point is of course very different from C. Why immutable? Because
copy-on-write; assigning a string means sharing it, and changing it creates
a new copy. Making a new copy is a relatively expensive operation. While this
certainly incurs a performance penalty, consider it the price to pay for a
safe string type. On the upside, we can use immutable strings to implement
dictionaries: associative arrays, indexed by string. In principle, you can not
have a mutable string as index to such an array (or is that too much computer
science philosophical rambling?)
Immutability can be realized rather easily; last time we implemented
copy-on-write arrays. We can leverage that array class to implement strings.
As the array class has slicing, strings can now be sliced as well.
Essentially, the code is a lot like this:
class string {
array<byte> arr;
public:
... constructors and such ...
uint len(void) const { return arr.len(); }
byte operator[](uint idx) const {
return arr[idx];
}
string slice(int start, int end) {
string s;
s.arr = arr.slice(start, end);
return s;
}
};
Another thing very different from C is that this kind of string does not have
a null terminator. As a consequence, none of the string functions in the
standard C library work for the new string type. This is fine as long as
we have good alternatives. Functions like strcpy()
, strcat()
, strcmp()
are replaced with more intuitive operators. Adding strings together with
operator+
is pure joy. There seems to be no good alternative however for
printf()
, as well as POSIX functions that operate on filenames, like
open()
, stat()
, or even unlink()
. Therefore we provide an escape
to convert a string back to a const char
pointer, just like C++ does.
It involves a hack that does append a null byte, and returns a raw pointer
to the array. Note that appending a null byte triggers copy-on-write.
Although a hack, it is a reasonably safe hack. A trick that I like to use
works by virtue of an explicit cast operator, and/or a str()
conversion
function:
const char *c_str(void) const;
explicit operator const char*() const { return c_str(); }
// outside class string
const char *str(const string& s) { return s.c_str(); }
// and in "user" code:
string name = "Ada";
printf("Name: %s\n", (const char *)name);
printf("Also: %s\n", str(name));
There are a couple of remarks to make about this example. One is that there
is no place for printf()
in C++ code because varargs isn’t type-safe.
Another is that old-style type-casts have no place in C++. Well, to hell
with that. We are putting the fun back in C. It wouldn’t be fun otherwise.