Sunday, August 30, 2009

The end of the loop

Multi-core processors have been on the market for several years now. In commodity hardware we’ve seen dual processor systems (SMPs), four-way SMP systems, dual cores, quad cores, dual quad cores, quad dual cores, quad quad cores, and even the seemingly odd triple cores (XBox 360) and AMD’s 6-core processors. IBM stepped into this this game by putting 8 low power SPEs and a controlling PowerPC into a single chip: the Cell processor that is the engine under the hood in the Playstation 3.

Consider a simple for-loop, running over an array to do some simple operation:

FOR i = 0 TO 100
DO
    arr[i] = arr[i] + 1
DONE

The execution time of this loop is determined as follows: it’s a memory fetch and store, an increment, a loop counter increment, and conditional branch (jump back if the loop counter is less than 100), so it’s about 5 operations times 100, which adds up to 500 ticks. This is a very rough estimate because a modern CPU has some tricks up its sleeve like combining certain operations, but the 500 ticks will be good to illustrate what I’m talking about.

Now imagine you have a dual core system. On a dual core system, you could run this loop on one core, and play a CPU-consuming game on the second core. However, if you wanted to be clever rather than waste your time playing games, you could split the loop and be done with it in half the wallclock time.

LOCAL i                         LOCAL i
FOR i = 0 TO 50                 FOR i = 51 TO 100
DO                              DO
    arr[i] = arr[i] + 1             arr[i] = arr[i] + 1
DONE                            DONE

or you could split it like this, where the first core processes the even elements, and the second core processes the odd elements in the array:

LOCAL i                         LOCAL i
FOR i = 0 TO 100 STEP 2         FOR i = 1 TO 100 STEP 2
DO                              DO
    arr[i] = arr[i] + 1              arr[i] = arr[i] + 1
DONE                            DONE

This code finishes in 250 ticks.
So, by dividing the array up we are enabling it for processing in parallel. This is called data partitioning, and often called a divide and conquer method. Running it in parallel made it twice as fast.

Now imagine you have a hundred cores in your CPU. It doesn’t, it’s insane, but what if technology had advanced so that you’d have a hundred cores in your CPU. If this was the case, then there would be no need to write the loop statement at all. The loop over the array would look something like this:

LOCAL i
arr[i] = arr[i] + 1

This one statement would be processed by a hundred cores in parallel, and the outcome of the program would be identical to outcome we obtained previously. This code would finish in just 3 ticks (fetch, increment, store) and no loop statement. (NB. Maybe the architecture would be such that it would be even less than 3 ticks, but it wouldn’t be much more than just 3 ticks). Compare this to the 500 ticks that we started out with.
For a long time, many believed that this could only be achieved through quantum computing.

The funny thing is, this technique is possible today. A modern graphics card has a hundred cores (and even more than that). Technology has indeed advanced that much. The GPU is a massively parallel processing unit that was made for one purpose only: processing as many pixels (array elements) in parallel as possible. The graphics cards are programmable so that the graphics gurus can do cool things with shaders, which are essentially little programs that compute a value to put into each pixel/array element.
The computing power of graphics cards has been recognized before, but now the graphics card vendors have stepped into the world of high performance computing, literally delivering supercomputer power to your home. Moreover, Apple’s latest release of OSX includes OpenCL, a programming API to harness the power of the NVIDIA chipset that is in the Powerbook.

The API (at the moment, either OpenCL or CUDA) allows you to create and destroy thousands of threads on the fly so you don’t even have to care much how your array is going to be processed by the GPU cores. This is nice because often, you don’t even know how many cores are in the machine that the code is running on.
It changes the way of programming because you must no longer think in terms of loops, but instead you should think only about what it is you want to do with each array element (1).

Of course, it is not all too good to be true; some problems simply can not be parallelized in this way, e.g. due to interdependencies or relations between array elements.
Still, it’s pretty exciting to see supercomputing power coming to your laptop, and I can’t wait until NVIDIA comes up with its next big thing.

This is, of course, already possible with pthreads. The only problem is that when you create a hundred threads on a dual core system, the CPU overhead of creating the threads is much higher than simply looping the loop, and thus slowing the program down. If you had the cores ready to run at your disposal, you wouldn’t have this problem and you could run the code in a massively parallel manner without any overhead.