Twiiter

Twitter Updates

    follow me on Twitter
    Loading..
    Loading..
    « Apple has opensourced libdispatch | Main | From write() down to the flash chips »
    Friday
    Aug282009

    Grand Central Dispatch

    One of the really cool features of Mac OS X Snow Leopard is a new set of technologies to make developing for multicore systems easier, named Grand Central Dispatch. There has been a lot of hype about what they are, and until recently you needed an NDA to discuss them. With the release of Snow Leopard it is now possible to discuss those technologies publicly.

    Not so new...

    The first thing I need to say is that the assorted technologies involved with GCD are not new. It is a number of existing technologies that have been combined in a fairly innovative way and with a killer implementation.

    I think the best way to explain this is to look at Apple's approach to devices like the iPod and iPhone. Apple has a reputation for innovative devices, but usually they don't innovate by inventing a category, they reinvent it. In other words they find some market that exists but is poorly served, and they throw their resources at it, blowing away the existing competition. Sometimes they own the category, sometimes new competitors come in to try to compete on the raised playing field.

    The simplistic explanation is that GCD is a combination of thread pooling and closures. The honest truth is that if thread pooling is a Motorola ROKR then GCD is an iPhone 3GS... while they are both phones and nominally do the same thing, no one would ever claim they are equal, or even in the same league. The guys at Apple who came up with it deserve a ton of credit.

    First off clos^H^H^H^H^H Blocks

    The first piece of GCD is Apple's new C extension, Blocks. In short, blocks are closures. For those of you are into the all the hot new scripting languages, closures should not be anything new. For all the old LISP heads reading this, closures are in fact a very old thing. For everyone who spends most of their time writing C and C++ code, the basic idea is that a closure is a bundle of a function pointer and a context pointer. So, everywhere you setup a callback function and pass in a context pointer that gets passed into that function you are manually forming a closure. The big difference with language level support is the syntax:

    //Without blocks
    
    #include <stdio.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    void myCallback(void **context) {
      uintptr_t number = (uintptr_t)(*context);
      printf("Out: %" PRIuPTR "\n", number);
      *context = (void *)(number + 1);
    }
    
    void call_100_times(void (*callback)(void **context)) {
      void *context = NULL;
      uint32_t i;
    
      for (i = 0; i < 100; i++) {
        callback(&context);
      }
    }
    
    int main(void) {
      call_100_times(myCallback);
    
      return 0;
    }
    

    vs

    //With blocks
    
    #include <stdio.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    void call_100_times(void (^callback)()) {
      uint32_t i;
    
      for (i = 0; i < 100; i++) {
        callback();
      }
    }
    
    int main(void) {
      __block uint32_t context = 0;
    
      call_100_times(^{
        printf("Out: %" PRIu32 "\n", context);
        i++;
      });
    
      return 0;
    }
    

    So, lets do a quick analysis of the changes. First, rather than writing a function and passing a pointer to it into call_100_times() I was able to write the block right at the call site of the function I am passing it into, which makes things easier to read.

    Also, the function takes an explicit context pointer which I need to cast around into what I know it is. In more complex case you may need to build structs to pass into function. Our block doesn't take any parameters (they can, but often they do not need to). Instead it captures the context variable from the scope it is declared in. The reason we declare context as a "__block uint32_t" is to tell the compiler that when blocks read and write to it they should share the one from that scope, instead of getting their own copies. The result is a very logical construction where I declare a variable with in an interior scope that I pass into another function.

    At the end of the day it is basically syntactic sugar with a small amount of runtime code to deal with copying blocks to and from the heap, etc. But it makes a huge difference when you are trying to write code.

    Apple actually publicly disclosed Blocks a long time ago, and several good blog posts about them exist. They have been proposed to WG14 for inclusion in C1X.

    libdispatch

    The other part of GCD is libdispatch. As I said earlier, GCD is basically a combination of closures and thread pooling. Since Blocks are Apple's closure implementation, it is reasonable to assume libdispatch is their thread pooling implementation, and that is exactly what it is.

    For those unfamiliar, the basic concept of a thread pool is that rather than having specific threads for specific tasks, you instead have a pool of generic worker threads. The program issues chunks of work (traditionally as a function pointer and a context pointer, but in this case generally a Block) to a queue, the worker threads pull the work units off of the queue and runs them.

    This approach has a number of advantages over fixed partitioning. First, it is really hard to estimate how much work each thread should do. With a thread pool the system handles that dynamically, it can spawn as many threads as it wants and pull the work units off the queue, which tends to result in much better CPU utilization. It can also shutdown threads if the queue is drained, which lets the system reclaim dormant resources. All the dispatch logic runs very quickly with no contention, since wait-free queues are well understood.

    libdispatch is a new system library provided with Snow Leopard. It provides a system wide set of thread pool APIs. The default use for most users will be to create sequential queues, which issue blocks in order and wait until one block has completed before the next one begins. That means that you can create a queue and issue your critical sections to it, guaranteeing mutual exclusion without taking a lock (though the code now runs asynchronously). Lets revisit our example above, but with 2 additional constraints. First, lets run our block on a queue. Second, lets assume that printf is only safe to run from the main thread (printf's thread safety is actually unspecified in the C standard since C does not actually specify anything about threading or memory models, but as a practical matter most implementations are thread safe).

    //With blocks and libdispatch
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    #include 
    
    //We add a completion handler that gets called when all the work is done,
    //it is just another block
    
    void call_100_times(void (^callback)(void), void (^completion)(void)) {
      uint32_t i;
    
      //Create a queue to issue blocks against
      dispatch_queue_t queue = dispatch_queue_create("com.blogspot.devwhy", NULL);
    
      for (i = 0; i < 100; i++) {
        dispatch_async(queue, callback);
      }
    
      dispatch_async(queue, completion);
      dispatch_release(queue);
    }
    
    int main(void) {
      __block uint32_t i = 0;
      dispatch_queue_t main_queue = dispatch_get_main_queue();
    
      call_100_times(^{
        //Issue an asynch printf from whatever queue we are on to the main queue
    
        dispatch_async(main_queue, ^{
          printf("Out: %" PRIu32 "\n", i);
          i++;
        });
      },
      ^{
        //Need to explicitly call exit now that the end of main is never reached
        //We do it here in the completion block
    
        dispatch_async(main_queue, ^{
          exit(0);
        });
      });
    
      //Start the dispatcher
      dispatch_main();
    
      //NOT REACHED
    }
    

    Okay, so that is a lot more code than before, but it has the benefit of allowing some of the code run other CPUs, while still keeping everything in order. I also never created a thread or a lock! In short, we made the whole thing run asynchronously, sort of how jQuery works. First we call call_100_times(), which creates a dispatch queue and adds the 100 blocks to it. Those blocks will be handled in a completely serial manner (when one is finished the next is started) by libdispatch, so they will only use one CPU at a time, though the queue may migrate between threads and CPUs as appropriate. Obviously this is a toy example that might not generate enough work for GCD to bother spinning up a second thread, but it can if it deems that to be advantageous, and it requires no programmer intervention.

    We then call dispatch_main(), which starts the queues. As the blocks are dequeued and run they start queuing the printf()'s onto the main queue (which is a special queue that runs exclusively on the main thread), followed by the exit() from the completion block. As those blocks are running the main queue starts dequeuing its blocks and printing stuff out, so now our program is doing the adding on one thread and the writing to the screen on another.

    Once the completion block is dequeued it sends a block containing exit() to the main queue. Once the main queue finishes all the proceeding printf blocks it calls the exit() block and the app terminates. So, what we have done here shows how we can implement critical sections without using locks, and queue things to run on the main thread in an appropriate order. Another big benefit of doing everything asynchronously is that it greatly cuts down on spinning beachballs, since those are generally caused by something synchronously waiting on the main thread.

    Besides this, there are several other features of libdispatch, such as concurrent queues (which do not guarantee blocks they issue complete before the next one is run), semaphores, event sources, finalizers, and all of the other frills necessary to push everything through GCD. The manpages for GCD are very good, I recommend reading them: man 3 dispatch

    Now, looking at the above, you might be thinking that it is very cool, and while thinking up all of the bits must have been difficult, it seems fairly simple to implement. That where Apple's integration expertise comes in. Not only have they tuned the hell out of the library and made tons of optimizations, but they also have implemented kernel hooks, and made their dispatcher system aware so it can manage threads based on the overall system state. That means instead of every program checking how many CPUs the system has and trying to consume them all (causing contention between programs and wasting time in the scheduler) the system selects how many threads should be running based on the global system state and divys them up between all the programs on the system.

    Apple has released a PDF that provides a high level overview of GCD, including a number of nice pictures.

    Downsides

    Any major shift has some downsides, though in GCDs case they seem pretty small. The first downside is that makes programmers change some of their fundamental assumptions. The most obvious is that when a function returns that no longer means that the code you declared inside it has necessarily run. Moving to this sort of asynchronous model takes some getting used to, but honestly it is far simpler than understanding threads and locks. Of course if you are already familiar with traditional threading it is something new to learn.

    Another issue is that the interactions with traditional threaded code can be complicated. The unfortunate truth is that a lot of code that would have been better implemented using GCD already exists and is using threads, and converting it might not be worth the effort. That means GCD code may need to interact with traditional threaded code. In general that should work fine with some minor effort, especially if they existing code uses a single threaded facade. If you need to actually directly manipulate threads from inside GCD queues there are a lot of constraints, but they are well documented in the manpages.

    Finally, the biggest issue for me is that GCD is not cross platform. Apple has released the Blocks patches for GCC, include Blocks in CLANG, and implemented a C to C++ source level transformer ("clang-cc -rewrite-blocks") that rewrites blocks code into traditional C++ code, as well as an open sourced Blocks runtime. Unfortunately they have not (yet) been so forth coming with libdispatch implementation. While it might a competitive advantage, I suspect that getting it freely available so that the various open source packages Apple uses can consider moving to GCD would be a significant benefit. Ultimately I think this is a temporary issue, because the actual blocks API is not that complex to implement if one only cares about compatibility. Extracting all of the performance Apple has gotten through their system wide implementation and optimization would probably require quite a bit of work, which would also work in Apple's favor (GCD is good everywhere, but best on OS X!).

    Conclusion

    GCD will improve the utilization of multiple processors, and reduce the number of spinning beachballs (even in what are nominally single threaded programs). It is not a panacea, and how effective is depends on how pervasively it used. I suspect it will end up being quite pervasive, because the paradigm it is trying to replace have proved difficult to use.