The Developer’s Cry

Yet another blog by a hobbyist programmer

File packing

Every now and then I like to watch people code on their Twitch channels. One of them is a gamedev guru Zen-master type, another one is a super talented nerdy kid. One day the guru showed how to pack files together into a single large file. You don’t want to ship a game with many separate sound and texture files dumped in a directory, so pack those files together, and then you can load them all at once. Soon thereafter, the kid did the same thing on his Twitch channel.

Making your own custom pack file format is easy and fun. Like often, it’s all in the details. I have coded custom pack formats a fair number of times now, and I still find it an interesting topic.

The WAD file

In December 1993 the world was shocked by the game DOOM. It’s famous for its brutality, addictive gameplay, and its technical prowess. The shareware version of DOOM (first episode: Knee Deep In The Dead) fit on a single 1.44 MB floppy disk. A large portion of that was taken up by DOOM.WAD, a single file of over 1 megabyte in size (!) which was absolutely astronomically large at the time. What was in that file? Simply put, game assets — everything but the game engine.

The file extension .WAD stands for “where is all the data?”. The format of the file is basically this: a concatenation of the data files, plus a directory that serves as an index to the data.

The exact WAD format is too simplistic for practical uses nowadays, but it does teach us one thing: Keep It Simple, Stupid.

Our custom PACK file

Personally I like to put the directory first, rather than at the end like how WAD is structured. It’s a personal preference that just looks much nicer to me.

To construct the pack file, first collect all metadata information and write out the index entries. Next append the data members, one by one. When loading the pack file, you can immediately load all index entries without having to seek to where the directory is.

For bonus points, let’s use variable length filenames in index entries. It makes sense to store them as (length, string) pairs. The length can be a single byte if you accept (and check!) that all member names are less than 256 bytes long. The problem is now that index entries are no longer of a fixed size. When loading the pack directory, it is really only a matter of doing buffered I/O right: read into a larger buffer, and grab only what you need.

The filenames are padded to align on 4 bytes. This is done so that an index entry is always aligned, otherwise you will experience program crashes on certain CPU architectures. Moreover, be aware of endianness when loading raw binary numbers (such as offsets and sizes, like we are doing here).

An easy way of having sub folders is simply allowing names to contain a slash. This alone is enough to create the illusion of having folders; there is no real need to create any additional structures in the pack file format. Again, keep it simple.

The standard TAR file

On a different note, a well-known packing format is tar (especially popular on UNIX/Linux). Tar is a tape streaming format; the data is structured as a stream of blocks. Consequently tar does not have an explicit directory; it literally is a sequential stream of:

Tar simply appends the next member right after. The archive usually ends with two blocks of zeroes, essentially indicating empty metadata structures.

When reading back a tar archive, there is no easy way of knowing how many members there are, and what to find where — because it lacks an explicit directory. A streaming format does not seek-to-offset; you can only read from beginning to end, and see what you find.

That said, you can still code your own tar loader that constructs an index in memory. Writing a tar loader is a nice exercise, but it’s not very productive when you are developing a game. The basic tar format is old-fashioned, and modern tar is full of peculiar details. While it’s certainly possible (I do have a working implementation somewhere) I still can’t really recommend using the tar format in this case. Besides, it’s much more fun going the DIY route and making a custom format like outlined above.