The Developer’s Cry

a blog about computer programming

Jingle Bell Rock

For about six months now, we’ve been putting the fun back in C by writing lots of library-like code that is shaping up to be a pretty nice collection of classes and functions to enable quick and easy programming. There are a couple more data structures to be discussed, but I decided not to bore you with that at this moment. After all, the end of the year is nearing and I want it to go out with a bang, so let’s have some fun and write sound code. Not just sane code, but sound & music code that’s actually to be used in a game.

For game code it’s important that we can mix multiple sound effects while the background music is playing. One feature that I really like in some old games (eg, Quake) is that you could pop in your own CD and have it play your favorite music. That’s no longer possible nowadays because nobody owns CDs anymore, or CD-ROM drives for that matter. I did write some code that lets you play mp3s from a directory, but I suppose that’s getting old as well. More and more people are now streaming music through Spotify and the like.

Before we get started with actual code, I first want to do a little warming-up regarding sound programming.

Sound primer

Audio are analog waves. These waves are digitized using pulse-code modulation (PCM). Basically it’s just bytes describing a wave. You can literally plot a wave into a buffer and play it, and a tone will sound. It is usually one or two bytes per sample (also known as bit depth), and the bytes may be in signed or unsigned format. Mono is a single channel, stereo is dual channel; more channels may be used. The data bytes of multichannel sound is stored in an interleaved fashion. The sample rate or frequency is the amount of samples taken per second, and is measured in Hertz. For example, one second of 22050 Hz stereo sound uses 44.1 kB of data if the bit depth is 8. Double the size if the bit depth is 16.

A wave with a high frequency (this is wave frequency, not to be confused with sample rate) gives a higher tone than a low frequency wave. Just for a ball-park figure, 100 Hz is a low bass, and 10.000 Hz is already screeching high. The human ear detects sounds between 20 and 20.000 Hz but this range decreases as one grows older. Cats and dogs can hear much higher frequencies than humans; your pets might react if you go onto YouTube and play some ultrasonic video.

You can add up waves to combine tones, mixing sounds. When adding up waves you have to be careful with overflow because it will give bad results. You can totally write your own mixer code, but I didn’t because it’s not necessary to do so, as we will see.

Five minutes of uncompressed 44 kHz stereo music is about 5 MiB worth of data. That is not a whole lot of data, but it was 15 years ago, and it probably still is too much today to fit into a cheap audio controller’s internal buffer. Since we don’t need all music data to be resident all of the time, typically music is streamed, while sound effects (that are short, but play often) are loaded and kept in memory.

Sound files come in various formats. Sound effects typically come in file formats that have some kind of header with meta-data (bit depth, signed or unsigned, number of channels, frequency or sample rate) followed by the PCM data bytes. Music typically comes as compressed data. The compressed data stream is usually wrapped, or packaged in some kind of transport layer.

Music may be compressed lossy or lossless. Think of lossless compression as akin to LZW or ZIP; it is bitwise data compression in which the output exactly matches the original input, down to the very last bit. Lossy audio compression means that the output sounds much alike the original input, but bitwise the data is not identical. So how does that work? Remember that sound is like a mixture of waves that were all added up together. These waves can be described with math. By using a fast Fourier transform (FFT) we can find the individual wave primitives that add up to form this sound wave. Describing the wave primitives in math is shorter than storing the wave data—hence compression. This process does not produce an identical byte-stream (but one that does sound alike), and is therefore lossy compression.

Uncompressing and processing audio takes some amount of CPU time and that’s why on the interwebs you are advised to use only 22 kHz samples in games. That advice is probably outdated by now. My poor ole ‘puter doesn’t break a sweat even when simultaneously streaming multiple 48 kHz high definition audio tracks from its poor old rotating hard drive, all the whilst maintaining a steady 60 frames per second in graphics. It doesn’t seem to care one bit. Even if your game code is so heavily loaded that this does become a problem (time it to prove it..!) you can easily put the sound code in its own thread. My experience so far is that it isn’t needed, but then again, my games have low polygon counts and don’t stream graphics from the disk as well, like some modern triple-A games do.

Lastly, there is volume (or gain), which is the loudness factor. In real life volume is measured in dB (decibel), and note that it features a logarithmic scale. In programming, the volume is usually a percentage of the maximum, meaning a value between 0 and 100. A sensible setting is 50. All modern desktop OSes let the user control the volume, so the proper use for the volume setting is for fading effects.

As is the case with all devices, programming audio output devices is a very platform-specific matter, so we will use a library that hides the goriest details. There are a number of options for choosing a library that produces sound & music. So let’s make some noise.

SDL2

The SDL2 library supplies some audio functions that allow you to play sound. Lack of good documentation with proper examples make it difficult to use however. When you google for examples, you are likely to find code that demonstrates SDL_mixer instead, which is a different kind of animal. Moreover, SDL2 by default only accepts WAV files. Since we also want to play music (like, MP3 music), we won’t get very far with just SDL2.

SDL2_mixer

The easiest way of getting sound & music working is probably SDL2_mixer. Mind the number two, as version 1 is really old and you should have moved on to 2 by now. Even though the API and the documentation is still the same, you need that number 2 when compiling and linking. Short code example (no error checking included..!):

SDL_Init(SDL_INIT_AUDIO);
Mix_OpenAudio(44100, MIX_DEFAULT_FORMAT, 2, 4096);
Mix_Music *music = Mix_LoadMUS(filename);
Mix_PlayMusic(music, -1);
while(Mix_PlayingMusic()) {
    twiddle_thumbs();
}

Seems easy enough. There are caveats though:

SDL_sound

SDL_sound is yet another sound library for the SDL family. It loads various formats of sound files. SDL_sound is mostly a convenience wrapper around other sound libraries; eg. it uses both SMPEG and mpg123 to decode MP3 files, and it uses MikMod to play MOD music tracks.

I have little need for it though. I wonder who uses AIFF, AU or VOC nowadays? Maybe I’m wrong, shrug. I suppose it’s great for all-round play-everything players, but I didn’t find it very appealing for this project.

Note, SDL_sound can be hard to google because Google searches for two words: “SDL sound” and will turn up SDL_mixer examples instead.

WAVE

Since we won’t be using SDL_sound, let’s load up some sound files by ourselves. The WAVE or WAV format is actually encoded in a RIFF format, which is a container for several chunks of meta-data, followed by a final chunk of data. What this means for WAVs is that there is a RIFF header chunk, then there is a meta-data chunk that holds things like sample rate, number of sound channels, etcetera, and finally there is a big chunk that holds the PCM sound data. Loading a WAV file boils down to mapping this struct:

struct WAV {
    static const uint32 ASCII_RIFF = 0x52494646;    // "RIFF"
    static const uint32 ASCII_WAVE = 0x57415645;    // "WAVE"
    static const uint32 ASCII_fmt = 0x666d7420;     // "fmt "
    static const uint32 ASCII_data = 0x64617461;    // "data"

    static const uint16 FMT_PCM = 1;
    static const uint16 MONO = 1;
    static const uint16 STEREO = 2;

    uint32 riff_id;         // "RIFF" header chunk
    uint32 riff_size;       // filesize - 8
    uint32 format;          // "WAVE"

    uint32 fmt_id;          // "fmt " chunk
    uint32 fmt_size;        // 16 for PCM; size of rest of this chunk
    uint16 audio_format;    // PCM = 1; other values indicate compression or other format
    uint16 num_channels;    // 1 = MONO, 2 = STEREO
    uint32 sample_rate;     // in Hz; eg. 44100
    uint32 byte_rate;       // == SampleRate * NumChannels * BitsPerSample/8
    uint16 block_align;     // == NumChannels * BitsPerSample/8
    uint16 bits_per_sample; // 8 bits = 8; 16 bits = 16

    uint32 data_id;         // "data" chunk
    uint32 data_size;       // size of data; NumSamples * NumChannels * BitsPerSample/8
};

What’s weird about this format is that there is lots of redundancy in the contained information. You can use this to double check things, and even to repair broken WAV headers.

In theory there may be more meta-data chunks, but these are rarely encountered out in the wild. A cleaner implementation would use separate structs for the various chunks.

SMPEG

For playing MP3 music we can use SMPEG. There is a libsmpeg2 (note the number two) that apparently integrates better with SDL2. With SMPEG you use two structures to play music: SMPEG and SMPEG_Info. I have no idea why these two have been separated, it seems clumsy, but whatever.

SMPEG_Info music_info;
SMPEG *music = SMPEG_new(filename, &music_info, 1);

SMPEG_enableaudio(music, 1);
SMPEG_setvolume(music, 50);
SMPEG_play(music);

while(SMPEG_status(music) == SMPEG_PLAYING) {
    twiddle_thumbs();
}

SMPEG_delete(music);

As you can see, SMPEG is super-easy to use. It even can do entirely without SDL, as SMPEG includes its own audio output handling code. Oddly enough documentation is hard to find. In fact, at this very moment I’m wondering how I ever got to write this code at all because I can’t seem to locate the docs.

There is one super-annoying issue with SMPEG and it’s that it doesn’t know the sound frequency of the MP3 file. (Or maybe I don’t know it because documentation is lacking). So in order to get the frequency I actually wrote some code in accordance with the MPEG spec to get the frequency out. MPEG streams consist of frames, where each frame has its own header. The header starts off with sync bits. This is like having a guarantee that the following data is indeed an MPEG frame. Then there are all sorts of bits describing various meta-data; these bits are often just indices into static, predefined tables that hold the actual values. So in order to get the frequency, we must examine the bits and lookup the actual frequency in a table.

int mpeg_frequency(const unsigned char *frame, int data_len) {
    assert(data_len >= 4);

    // frame sync
    if (frame[0] != 0xff || ((frame[1] & 0xe0) != 0xe0)) {
        // bad frame data
        debug("MPEG frame sync not found");
        debug("frame sync bytes: %02x %02x", frame[0], frame[1]);
        return;
    }

    int version_bits = (frame[1] >> 3) & 3;
    // just a trick to set the index to the frequency table
    int version_freq = (~version_bits) & 3;

    int freq_bits = (frame[2] >> 2) & 3;

    static const int freq_table[4 * 4] = {
        44100, 22050, -1, 11025,
        48000, 24000, -1, 12000,
        32000, 16000, -1, 8000,
        -1, -1, -1, -1
    };

    return freq_table[freq_bits * 4 + version_freq];
}

It makes no sense to have a library that plays MP3 music, but requires the programmer having to go back to the spec to extract the frequency from the MPEG frames. Maybe I missed something.

SMPEG doesn’t do ID3 tags either, so it doesn’t know what song/artist/album is playing. In order to get this working, you might use libid3tag or roll your own, which is what I did. Reading ID3v2 tags is such a lengthy and boring mess that I won’t elaborate on it right now.

In hindsight, SMPEG looks deserted and discontinued. Frankly, libid3tag isn’t all that either. Better look further.

mpg123

A perfectly good (or should I say, much better) alternative for playing MP3 files is mpg123. A code example, error handling stripped for readability:

mpg123_handle *handle = mpg123_new(nullptr, &err);
mpg123_open(handle, filename);
mpg123_getformat(handle, &rate, &num_channels, &encoding);
mpg123_read(handle, (unsigned char *)buf, bufsize, &bytes_read);

...

mpg123_close(handle);
mpg123_delete(handle);

libmpg123 by itself does not output any sound (it only decodes the MPEG stream), but you might use libout123 to output the music.

This library also includes easy to use functions for reading ID3 tags.

I noticed that when you feed a damaged MPEG stream into mpg123 it may burp some messages onto stderr but it will frame sync properly again and just continue playing as if nothing happened. I like this feature, it’s like having a scratched disc, but you can still play it.

libvorbisfile

The MP3 format is surrounded by patents and some people feel it’s dangerous grounds to tread on. There are indeed patents on MP3 content creation. The last patent on MP3 supposedly expires by the end of 2017. Moreover, there are no issues with writing player code. Still, there is plenty of reason to have a look at the open and free Ogg Vorbis format. There is a libvorbis library offering a rather low-level API, and there is the easier libvorbisfile library. I opted for the easy road here. Example code (mind ye, error checking not included):

OggVorbis_File ovfile;

ov_fopen(filename, &ovfile);
rate = ovfile.vi->rate;
num_channels = ovfile.vi->channels;
int current_section;    // we don't actually use this number
bytes_read = ov_read(&ovfile, buf, bufsize, 0, 2, 1, &current_section);

...

ov_clear(&ovfile);

Caution: ov_read() accepts a buffer size; in practice it refuses to decode more than 4096 bytes at a time. So to illustrate: You might use a 64 kiB buffer but ov_read() will return only 4 kiB of data.

The libvorbisfile API feels a little quirky, but it totally works. Like mpg123, it is just a decoder and it’s up to you to send the buffer to the audio device. So let’s get down to it.

OpenAL

At the beginning of this post, I said there were some caveats to SDL2_mixer. Therefore I went back to OpenAL. It feels a little dated by now, and I have no idea if it’s still being used a lot. In any case, in my opinion OpenAL is professional grade stuff. It can mix as many sounds and music tracks as your computer can handle. It can also do 7.1 surround sound, 3D sound, Doppler effects and you can play with the speed of sound. It’s advanced to the point where it makes SDL2_mixer look like a toy. Don’t get me wrong, I have great respect for the authors of SDL, but OpenAL comes across as truly professional. After all, it was made by Creative Labs of the well-known SoundBlaster hardware.

OpenAL maybe deserves a post on its own (and I have written about it before) but since this is the great devcry music blog post of the year, I will cover it here.

OpenAL uses sound “channels” (not mono, stereo in this context) which are just like the input channels on a physical mixing console. You can program these with settings to configure how the sound comes out. A sound channel by itself doesn’t do anything; sounds are produced by sound “sources”. The sound data is loaded into “buffers”. So in OpenAL you have to create a sound source, create a buffer, load the data into the buffer, and then you can play the sound on a channel. But not before opening the device, and creating a context object.

const ALCchar *default_device = alcGetString(nullptr, ALC_DEFAULT_DEVICE_SPECIFIER);
ALCdevice *openal_device = alcOpenDevice(default_device);
ALCcontext *openal_context = alcCreateContext(openal_device, nullptr);
alcMakeContextCurrent(openal_context);
alcProcessContext(openal_context);

ALuint openal_src;
alGenSources(1, &openal_src);
alSourcef(openal_src, AL_GAIN, 0.5f);    // volume
alSourcei(openal_src, AL_LOOPING, looping ? AL_TRUE : AL_FALSE);

ALuint openal_buf;
alGenBuffers(1, &openal_buf);
alBufferData(openal_buf, AL_FORMAT_STEREO16, data, data_size, sample_rate);
alSourcei(openal_src, AL_BUFFER, openal_buf);

alSourcePlay(openal_src);

This is just for playing sound. Much alike OpenGL, we should call alGetError() sometimes to check whether there were any errors.

If we want to play music, we will have to stream it by continuously queuing up buffers. OpenAL will play any buffers that are still queued; it is up to you to unqueue finished buffers and to queue up new buffers to continue playing music. Meanwhile decoding the MPEG or Vorbis stream, of course!

Let’s do this in two parts; first queue up a number of buffers and start playing.

// disable looping over a single buffer
alSourcei(openal_src, AL_LOOPING, AL_FALSE);

// queue up a couple of buffers and start playing

ALuint buffers[num_buffers];
alGenBuffers(num_buffers, buffers);
for(int i = 0; i < num_buffers; i++) {
    // read data with mpg123_read() or ov_read()
    num_bytes = read_data(data);
    alBufferData(buffers[i], AL_FORMAT_STEREO16, data, num_bytes, rate);
}

alSourceQueueBuffers(openal_src, num_buffers, buffers);
alSourcePlay(openal_src);

Next, ask OpenAL how many buffers have finished playing. In a tight 60 fps game loop, it should report either zero or one buffers have finished, but it’s well possible that multiple buffers finished playing if we ask less frequent. We must unqueue those buffers, refill them with data, and queue them again so OpenAL keeps playing.

alGetSourcei(openal_src, AL_BUFFERS_PROCESSED, &num_finished);
alSourceUnqueueBuffers(openal_src, num_finished, finished_buffers);

int num_buffered = 0;
for(int i = 0; i < num_finished; i++) {
    // read data with mpg123_read() or ov_read()
    num_bytes = read_data(data);
    alBufferData(openal_buf, AL_FORMAT_STEREO16, data, num_bytes, rate);

    // queue up buffer descriptors
    buffers[num_buffered++] = finished_buffers[i];
}

if (num_buffered) {
    alSourceQueueBuffers(openal_src, num_buffered, buffers);
}

You might be inclined to think that working with just two buffers would suffice, and that you can swap between them, much like programming graphics and swapping the back buffer to front. Not true; when you code it like that, OpenAL will under-run internally and become lost, and you will hear nothing. Remember that one second of audio quickly expands to dozens of kilobytes. An example: highdef 48 kHz audio x 2 channels x 16 bits per sample equals 192 kB. If we use 4 kB buffers, that amounts up to 46 buffers (!) for just one second of music. Using fewer buffers is risky because it may end up in the user hearing nothing at all.

Closing curtains

What an adventure, we covered a lot of ground here. After exploring all these libs, I finally settled on an OpenAL based sound system. It is combined with a custom WAV loader, a libmpg123 powered MP3 loader, and it also supports Ogg by virtue of libvorbisfile. And it rocks your socks off.

There are more choices out there, so if you haven’t had enough yet, check out the following:

References