One of the really cool features of Mac OS X Snow Leopard is a new set of
technologies to make developing for multicore systems easier, named Grand
Central Dispatch. There has been a lot of hype about what they are, and until
recently you needed an NDA to discuss them. With the release of Snow
Leopard it is now possible to discuss those technologies publicly.
Not so new...
The first thing I need to say is that the assorted technologies involved with
GCD are not new. It is a number of existing technologies that have been
combined in a fairly innovative way and with a killer implementation.
I think the best way to explain this is to look at Apple's approach to devices
like the iPod and iPhone. Apple has a reputation for innovative devices, but
usually they don't innovate by inventing a category, they reinvent it. In
other words they find some market that exists but is poorly served, and they
throw their resources at it, blowing away the existing competition. Sometimes
they own the category, sometimes new competitors come in to try to compete on
the raised playing field.
The simplistic explanation is that GCD is a combination of thread pooling
and closures. The honest truth is that if thread pooling is a Motorola
ROKR then GCD is an iPhone 3GS... while they are both phones and nominally do
the same thing, no one would ever claim they are equal, or even in the same
league. The guys at Apple who came up with it deserve a ton of credit.
First off clos^H^H^H^H^H Blocks
The first piece of GCD is Apple's new C extension, Blocks. In short, blocks
are closures. For those of you are into the all the hot new scripting
languages, closures should not be anything new. For all the old LISP heads
reading this, closures are in fact a very old thing. For everyone who spends
most of their time writing C and C++ code, the basic idea is that a closure is
a bundle of a function pointer and a context pointer. So, everywhere you setup
a callback function and pass in a context pointer that gets passed into that
function you are manually forming a closure. The big difference with language
level support is the syntax:
//Without blocks
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
void myCallback(void **context) {
uintptr_t number = (uintptr_t)(*context);
printf("Out: %" PRIuPTR "\n", number);
*context = (void *)(number + 1);
}
void call_100_times(void (*callback)(void **context)) {
void *context = NULL;
uint32_t i;
for (i = 0; i < 100; i++) {
callback(&context);
}
}
int main(void) {
call_100_times(myCallback);
return 0;
}
vs
//With blocks
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
void call_100_times(void (^callback)()) {
uint32_t i;
for (i = 0; i < 100; i++) {
callback();
}
}
int main(void) {
__block uint32_t context = 0;
call_100_times(^{
printf("Out: %" PRIu32 "\n", context);
i++;
});
return 0;
}
So, lets do a quick analysis of the changes. First, rather than writing a
function and passing a pointer to it into call_100_times() I was able to write
the block right at the call site of the function I am passing it into, which
makes things easier to read.
Also, the function takes an explicit context pointer which I need to cast
around into what I know it is. In more complex case you may need to build
structs to pass into function. Our block doesn't take any parameters (they
can, but often they do not need to). Instead it captures the context variable
from the scope it is declared in. The reason we declare context as a "__block
uint32_t" is to tell the compiler that when blocks read and write to it they
should share the one from that scope, instead of getting their own copies. The
result is a very logical construction where I declare a variable with in an
interior scope that I pass into another function.
At the end of the day it is basically syntactic sugar with a small amount of
runtime code to deal with copying blocks to and from the heap, etc. But it
makes a huge difference when you are trying to write code.
Apple actually publicly disclosed Blocks a long time ago, and several good
blog posts about them exist. They have been proposed to WG14 for
inclusion in C1X.
libdispatch
The other part of GCD is libdispatch. As I said earlier, GCD is basically a
combination of closures and thread pooling. Since Blocks are Apple's closure
implementation, it is reasonable to assume libdispatch is their thread pooling
implementation, and that is exactly what it is.
For those unfamiliar, the basic concept of a thread pool is that rather than
having specific threads for specific tasks, you instead have a pool of generic
worker threads. The program issues chunks of work (traditionally as a function
pointer and a context pointer, but in this case generally a Block) to a queue,
the worker threads pull the work units off of the queue and runs them.
This approach has a number of advantages over fixed partitioning. First, it is
really hard to estimate how much work each thread should do. With a thread
pool the system handles that dynamically, it can spawn as many threads as it
wants and pull the work units off the queue, which tends to result in much
better CPU utilization. It can also shutdown threads if the queue is drained,
which lets the system reclaim dormant resources. All the dispatch logic runs
very quickly with no contention, since wait-free queues are well understood.
libdispatch is a new system library provided with Snow Leopard. It provides a
system wide set of thread pool APIs. The default use for most users will be to
create sequential queues, which issue blocks in order and wait until one block
has completed before the next one begins. That means that you can create a
queue and issue your critical sections to it, guaranteeing mutual exclusion
without taking a lock (though the code now runs asynchronously). Lets revisit
our example above, but with 2 additional constraints. First, lets run our
block on a queue. Second, lets assume that printf is only safe to run from the
main thread (printf's thread safety is actually unspecified in the C standard
since C does not actually specify anything about threading or memory models,
but as a practical matter most implementations are thread safe).
//With blocks and libdispatch
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include
//We add a completion handler that gets called when all the work is done,
//it is just another block
void call_100_times(void (^callback)(void), void (^completion)(void)) {
uint32_t i;
//Create a queue to issue blocks against
dispatch_queue_t queue = dispatch_queue_create("com.blogspot.devwhy", NULL);
for (i = 0; i < 100; i++) {
dispatch_async(queue, callback);
}
dispatch_async(queue, completion);
dispatch_release(queue);
}
int main(void) {
__block uint32_t i = 0;
dispatch_queue_t main_queue = dispatch_get_main_queue();
call_100_times(^{
//Issue an asynch printf from whatever queue we are on to the main queue
dispatch_async(main_queue, ^{
printf("Out: %" PRIu32 "\n", i);
i++;
});
},
^{
//Need to explicitly call exit now that the end of main is never reached
//We do it here in the completion block
dispatch_async(main_queue, ^{
exit(0);
});
});
//Start the dispatcher
dispatch_main();
//NOT REACHED
}
Okay, so that is a lot more code than before, but it has the benefit of
allowing some of the code run other CPUs, while still keeping everything in
order. I also never created a thread or a lock! In short, we made the whole
thing run asynchronously, sort of how jQuery works. First we call
call_100_times(), which creates a dispatch queue and adds the 100 blocks to
it. Those blocks will be handled in a completely serial manner (when one is
finished the next is started) by libdispatch, so they will only use one CPU at
a time, though the queue may migrate between threads and CPUs as appropriate.
Obviously this is a toy example that might not generate enough work for GCD to
bother spinning up a second thread, but it can if it deems that to be
advantageous, and it requires no programmer intervention.
We then call dispatch_main(), which starts the queues. As the blocks are
dequeued and run they start queuing the printf()'s onto the main queue (which
is a special queue that runs exclusively on the main thread), followed by the
exit() from the completion block. As those blocks are running the main queue
starts dequeuing its blocks and printing stuff out, so now our program is
doing the adding on one thread and the writing to the screen on another.
Once the completion block is dequeued it sends a block containing exit() to
the main queue. Once the main queue finishes all the proceeding printf blocks
it calls the exit() block and the app terminates. So, what we have done here
shows how we can implement critical sections without using locks, and queue
things to run on the main thread in an appropriate order. Another big benefit
of doing everything asynchronously is that it greatly cuts down on spinning
beachballs, since those are generally caused by something synchronously
waiting on the main thread.
Besides this, there are several other features of libdispatch, such as
concurrent queues (which do not guarantee blocks they issue complete before
the next one is run), semaphores, event sources, finalizers, and all of the
other frills necessary to push everything through GCD. The manpages for GCD
are very good, I recommend reading them: man 3 dispatch
Now, looking at the above, you might be thinking that it is very cool, and
while thinking up all of the bits must have been difficult, it seems fairly
simple to implement. That where Apple's integration expertise comes in. Not
only have they tuned the hell out of the library and made tons of
optimizations, but they also have implemented kernel hooks, and made their
dispatcher system aware so it can manage threads based on the overall system
state. That means instead of every program checking how many CPUs the system
has and trying to consume them all (causing contention between programs and
wasting time in the scheduler) the system selects how many threads should be
running based on the global system state and divys them up between all the
programs on the system.
Apple has released a PDF that provides a high level overview of GCD,
including a number of nice pictures.
Downsides
Any major shift has some downsides, though in GCDs case they seem pretty
small. The first downside is that makes programmers change some of their
fundamental assumptions. The most obvious is that when a function returns that
no longer means that the code you declared inside it has necessarily run.
Moving to this sort of asynchronous model takes some getting used to, but
honestly it is far simpler than understanding threads and locks. Of course if
you are already familiar with traditional threading it is something new to
learn.
Another issue is that the interactions with traditional threaded code can be
complicated. The unfortunate truth is that a lot of code that would have been
better implemented using GCD already exists and is using threads, and
converting it might not be worth the effort. That means GCD code may need to
interact with traditional threaded code. In general that should work fine with
some minor effort, especially if they existing code uses a single threaded
facade. If you need to actually directly manipulate threads from inside GCD
queues there are a lot of constraints, but they are well documented in the
manpages.
Finally, the biggest issue for me is that GCD is not cross platform. Apple has
released the Blocks patches for GCC, include Blocks in CLANG, and implemented
a C to C++ source level transformer ("clang-cc -rewrite-blocks") that rewrites
blocks code into traditional C++ code, as well as an open sourced Blocks
runtime. Unfortunately they have not (yet) been so forth coming with
libdispatch implementation. While it might a competitive advantage, I suspect
that getting it freely available so that the various open source packages
Apple uses can consider moving to GCD would be a significant benefit.
Ultimately I think this is a temporary issue, because the actual blocks API is
not that complex to implement if one only cares about compatibility.
Extracting all of the performance Apple has gotten through their system wide
implementation and optimization would probably require quite a bit of work,
which would also work in Apple's favor (GCD is good everywhere, but best on OS
X!).
Conclusion
GCD will improve the utilization of multiple processors, and reduce the number
of spinning beachballs (even in what are nominally single threaded programs).
It is not a panacea, and how effective is depends on how pervasively it used.
I suspect it will end up being quite pervasive, because the paradigm it is
trying to replace have proved difficult to use.