Putting the fun back in C with array slicing and dicing
Last month I wrote about putting the fun back in C. This brings me back to a
recurring theme: using C/C++ as if it were a scripting language. About two
years ago I developed the so-called oolib
for C++ that does just this.
The library doesn’t get the attention it presumably deserves, but on the other
hand, I did find it exhibits certain rather big performance problems.
So, it’s back to the drawing board and let’s see what we can do about that.
For putting the fun back in C, what we need is:
- a string class that works well with UTF-8
- a dynamic array class, that preferably works much like a Python list
- a dictionary class
- simple threading API with
go()
and channels
As you can see, I really like borrowing feats from other languages. You can do these things with the STL, but I can’t tell you enough how much I dislike the STL for its horrible API and dreadful syntax. It doesn’t make things easy, nor does it make things fun.
The problem with the string class in oolib
is that its UTF-8 strings decode
to unicode all the time, wasting cycles. Moreover, it’s making too many copies
of objects, copies which in turn are destructed and freed again. All of this
adds up to a significant performance hit. While this could be patched, I’m
going to redesign a couple of things. Before we can write a good string class,
we should have a look at arrays. This is because a string is essentially an
array of bytes.
Arrays in C are just a pointer to a sequence of bytes. It is up to the programmer to get the size right; no bounds checking is done by the language. This is great for close-to-the-metal programming, but we expect an array class to be a little more “friendly”;
- have an
append()
method - have a
pop()
method, as well aspop(index)
that removes and returns an element - bounds checking subscript
operator[]
- adding arrays together with
operator+
- automatic, dynamic sizing when appending, extending, even shrinking
- capacity and length
find()
method that returns an indexsort(sort_func)
method for sortingslice(index1, index2)
method that returns an array slice
That slice()
method is very interesting. An array slice is “a part of an
array”. As such, it can be described with just a pointer and a length.
Because of the way slices work, you can make a high performance pop()
operation that simply decreases the slice’s length. Similarly for pop(0)
;
simply increment the pointer. Another example, imagine that splitting a
string is just slicing it. The resulting array of strings are really just
slices that point at places in the original string.
If we allow access via a raw pointer into the array, then the slice
provides merely a view of a section of the array. Changing an element in the
slice would affect the underlying array. We might also choose to provide a
copy of the array section. This is an important design decision to make,
because it determines how arrays and slices interact, and how we would deal
with them.
For example, in Go arrays are very static, and slices are like raw pointers
into the array. Python slices behave like copies, although array assignment
in Python behaves like a reference.
You’d be surprised in how many different ways one might design the internals of arrays and slices. After lots of rewriting, I finally settled on the following:
- copy-on-write behavior;
- assignment
operator=
makes objects share the same backing store - modifying array content creates a new copy (if the backing store is shared)
- slices are arrays; they are of the same type
- but they have an offset into the backing store
- therefore arrays and slices can be used interchangeably
In pseudo-code:
array {
backend {
T *data;
uint cap;
};
shared_ptr<backend> arr;
uint offset, len;
};
The copy-on-write behavior is implemented with help of std::shared_ptr
.
Although the internals are kind of complicated, it is simply a joy to use. Having a decent array class that acts much like a scripting language puts the fun back in C.
Just for the hack of it (and for interoperability with STL algorithms
),
I also added a random access iterator. Writing STL-compatible iterators is
not funny—maybe a good topic for another blog post, some other time.