Twiiter

Twitter Updates

    follow me on Twitter
    Loading..
    Loading..

    Entries in Snow Leopard (2)

    Saturday
    Oct242009

    The loss of ZFS

    Well, in case you haven't read any of the myriad stories about it, it appears that Apple has decided not to use ZFS on Mac OS X. Gruber has sources that say it was primarily licensing concerns, which is consistent with what people have implied to me, both recently, and around WWDC (although at that time I think there was probably still hope of resolving the issues).

    Now, some people jump may comment that it couldn't be licensing issues, since ZFS is opensource (under the CDDL), and that Apple already uses CDDL software (DTrace). That may be true, but often in deals that involve large companies there is more to it than that. Apple may have wanted guarantees of indemnification in the NetApp lawsuit. Maybe it wanted guarantees that certain modifications it wanted to make would be accepted upstream, or even to get Sun to make certain changes. It also might have wanted additional distribution rights that were not granted under the CDDL. It is typical for companies to negotiate custom agreements in such cases (and for some money to change hands), so the idea that licensing issues are why it fell through is entirely reasonable, even though it is an opensource product. Obviously Sun's steady decline in the market place, and the uncertainty caused by the Oracle acquisition may have greatly complicated any such negotiations.

    Why not do a new filesystem?

    Apple has a lot of talented filesystem engineers. They are certainly capable of doing something comparable to ZFS, at least for their target market. The problem with developing a new modern filesystem is that it generally takes longer than a single OS release cycle. Most companies are really bad at having large teams focused on projects that will not ship in the next version of the project they are working on.

    This is a particularly acute problem at Apple, which traditionally has done things with very few engineers. I don't want to get into exact numbers, but I recall having a discussion with the head of a university FS team who was discussing the FS he was working on. He was pitching it to a group of Apple engineers. It was some interesting work, but there were some unsolved problems. When he was asked about them he commented that they didn't have enough people to deal with them, but he had some ideas and it shouldn't be an issue for a company with a real FS team. It turned out his research team had about the same number of people working on their FS as Apple had working on HFS, HFS+, UFS, NFS, WebDAV, FAT, and NTFS combined. I think people don't appreciate how productive Apple is on a per-engineer basis. The downside of that is that sometimes it is hard to find the resources to do something large and time consuming, particularly when it is not something that most users will notice in a direct sense. That is especially true if senior management is not excited about the idea.

    Because of that, I was fairly convinced ZFS was a credible future primary FS for Apple. Not because it was an optimal design for them (it isn't), but because it was a lot less work than doing a new design from scratch. The fact its fundamental architecture is 20 years newer than HFS meant it would still be better than HFS+ in almost all respects even if it was not designed for Apple's exact needs. Clearly I was wrong, since Apple has stopped the ZFS project.

    What changed?

    Well, a couple of things have happened. The first is that Mac OS X has gotten more mature. They no longer need to port all of those FSes, they already have them working, and in most cases they work fairly well. That frees up some engineers. Apple has also greatly expanded the number of people working on their kernel since it is amortized over many different products (Mac OS X, iPhone, AppleTV, etc).

    Suddenly the notion of doing a new filesystem seems doable, so long as it is a real priority and the FS team doesn't get pulled to keep adding features or doing major work to legacy FSes. That is still a lot of work when Apple had ZFS approaching production quality on OS X.

    Apple can do better than ZFS

    Sun calls ZFS "The Last Word in Filesystems", but that is hyperbole. ZFS is one of the first widely deployed copy on write FSes. That certainly makes it a tremendous improvement over existing FSes, but pioneers are the ones with arrows in their back. By looking at ZFS's development it is certainly possible to identify mistakes that they made, and ways to do things better if one were to start from scratch. From where I sit, there are 3 obvious ways doing a new FS will be better for Apple than ZFS:

    1. There have been new fundamental research since ZFS was designed that simplifies many of the issues involved with it. In particular the "B-trees, Shadowing, and Clones" (PDF). That paper is the basis for the design of BtrFS, which has a very similar feature set to ZFS, but internally is entirely different. LWN has an article about BtrFS that explains the significance in some detail (it is written Valerie Aurora, who worked on ZFS at Sun).

    2. ZFS was designed for the storage interfaces available a decade ago. Spinning disks are going to be with us for a long time, especially for bulk storage in data centers and on backup devices. The future is all about solid state. Flash SSDs have significantly different performance characteristics than spinning media, and there may be FS design decisions one could make that would benefit from that. Now, any FS Apple designs will have to work acceptably on traditional drives, but if they are designing for the future then flash is what to target.

      ZFS has had some optimization work for flash, but it is all in terms of using flash as part of a storage hierarchy. That makes complete sense, since ZFS's primary deployment targets are high-end systems and data center storage. Those systems have multiple drives, so the idea of separate flash drives for a ZIL and L2ARC are completely reasonable. Most consumers have one drive in their system, and maybe an external drive for bulk data, data exchange, and backup.

    3. That brings up the last point. ZFS is designed for big systems. It works on small systems, but most of the tradeoffs favor very large computers, with lots of drives. This shows up in a number of ways. The first is that ZFS is not currently capable of adding single drives to an existing vdev or migrating vdevs between various types (mirror, raidz, raidz2). This is a major feature for smaller users who might want to add a single drive, but is a non-issue for data center users who tend to add large number of drives all at once, since they will add whole vdevs. Another issue is that ZFS assumes you have a lot of ram. NEC has been doing a port of OpenSolaris to ARM, and they determined they could not get ZFS to use less than 8 megabytes of ram without making incompatible format changes (Compacted ZFS). With those changes they could squeeze it into a more reasonable 2 megabytes. On a desktop that doesn't seem like a big deal, but on an iPhone 3G or a Time Capsule 8MB of wired memory is an enormous issue.

    The only major downside is that if Apple is just starting on a next generation FS now it could be a long time before we get our hands on it.

    But now we are going to have another incompatible next generation filesystem

    Wolf brought this point up during some of the ZFS talk on twitter yesterday. My general opinion is that it doesn't matter. People use drives for two largely unrelated tasks. One is running their computers. This is fixed storage. The other is for data exchange. In the old days people used floppies for their sneakernet media, which made the situation much simpler to understand. In recent years the market realities have caused people to move to using SD cards, thumbdrives, and hard drives as the exchange medium of sneakernet.

    The important point is that understand is that while the physical devices may be the same, the use model is different, just as the using a floppy disk and an internal hard drive were different. Nobody would balk at the notion that floppies should use different FSes than internal drives. Likewise, most people shouldn't care that their external drives are formatted differently than their internal drives.

    There are complicated features you want for your boot drives and system disks. Ideally you could have them on your interchange disks, but there are other features that are more important, particularly interoperability, and simplicity. ZFS didn't bring either of those. There might have been a few people who were psyched to be able to use ZFS to share disks between a Mac and a Solaris or FreeBSD box, but honestly those people are few and far between. Whether Apple used ZFS or something else it is just as interoperable with Linux and Windows (which is to say, not at all). So that fact that Apple looks to be doing a new FS does not impact interoperability in any real sense.

    The other feature you really want for an interchange FS is simplicity. There are a lot of devices out there that use an FS to communicate with a computer. The simplest example is a digital camera via its media cards, but there are many others. Something like ZFS is way too complex for those devices, and honestly most of the features of ZFS like multiple drive support and snapshots are useless since the devices don't have the physical interconnects or user interfaces to expose those features. There is certainly an argument to be made that we could use something a bit better than FAT32 or exFAT as that format, but ZFS was not the right solution for that.

    In other words, for that disk you want to use as an external drive to drag between computers you don't want something like ZFS, you want something that is simple enough that a firmware engineer can write a read-only implementation from the specs in less than a week. For the disk embedded in your computer (operationally or literally) you want something like ZFS, but it doesn't matter if it is interoperable with anything else because you won't be moving it between systems.

    This is basically how Windows works. Microsoft generally uses NTFS for internal drives, but FAT for external drives. Ultimately somebody should design a filesystem explicitly for use as an interchange format and license it for free, then everyone can deal with their internal FSes and do what makes the most sense for their OSes and markets.

    Friday
    Aug282009

    Grand Central Dispatch

    One of the really cool features of Mac OS X Snow Leopard is a new set of technologies to make developing for multicore systems easier, named Grand Central Dispatch. There has been a lot of hype about what they are, and until recently you needed an NDA to discuss them. With the release of Snow Leopard it is now possible to discuss those technologies publicly.

    Not so new...

    The first thing I need to say is that the assorted technologies involved with GCD are not new. It is a number of existing technologies that have been combined in a fairly innovative way and with a killer implementation.

    I think the best way to explain this is to look at Apple's approach to devices like the iPod and iPhone. Apple has a reputation for innovative devices, but usually they don't innovate by inventing a category, they reinvent it. In other words they find some market that exists but is poorly served, and they throw their resources at it, blowing away the existing competition. Sometimes they own the category, sometimes new competitors come in to try to compete on the raised playing field.

    The simplistic explanation is that GCD is a combination of thread pooling and closures. The honest truth is that if thread pooling is a Motorola ROKR then GCD is an iPhone 3GS... while they are both phones and nominally do the same thing, no one would ever claim they are equal, or even in the same league. The guys at Apple who came up with it deserve a ton of credit.

    First off clos^H^H^H^H^H Blocks

    The first piece of GCD is Apple's new C extension, Blocks. In short, blocks are closures. For those of you are into the all the hot new scripting languages, closures should not be anything new. For all the old LISP heads reading this, closures are in fact a very old thing. For everyone who spends most of their time writing C and C++ code, the basic idea is that a closure is a bundle of a function pointer and a context pointer. So, everywhere you setup a callback function and pass in a context pointer that gets passed into that function you are manually forming a closure. The big difference with language level support is the syntax:

    //Without blocks
    
    #include <stdio.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    void myCallback(void **context) {
      uintptr_t number = (uintptr_t)(*context);
      printf("Out: %" PRIuPTR "\n", number);
      *context = (void *)(number + 1);
    }
    
    void call_100_times(void (*callback)(void **context)) {
      void *context = NULL;
      uint32_t i;
    
      for (i = 0; i < 100; i++) {
        callback(&context);
      }
    }
    
    int main(void) {
      call_100_times(myCallback);
    
      return 0;
    }
    

    vs

    //With blocks
    
    #include <stdio.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    void call_100_times(void (^callback)()) {
      uint32_t i;
    
      for (i = 0; i < 100; i++) {
        callback();
      }
    }
    
    int main(void) {
      __block uint32_t context = 0;
    
      call_100_times(^{
        printf("Out: %" PRIu32 "\n", context);
        i++;
      });
    
      return 0;
    }
    

    So, lets do a quick analysis of the changes. First, rather than writing a function and passing a pointer to it into call_100_times() I was able to write the block right at the call site of the function I am passing it into, which makes things easier to read.

    Also, the function takes an explicit context pointer which I need to cast around into what I know it is. In more complex case you may need to build structs to pass into function. Our block doesn't take any parameters (they can, but often they do not need to). Instead it captures the context variable from the scope it is declared in. The reason we declare context as a "__block uint32_t" is to tell the compiler that when blocks read and write to it they should share the one from that scope, instead of getting their own copies. The result is a very logical construction where I declare a variable with in an interior scope that I pass into another function.

    At the end of the day it is basically syntactic sugar with a small amount of runtime code to deal with copying blocks to and from the heap, etc. But it makes a huge difference when you are trying to write code.

    Apple actually publicly disclosed Blocks a long time ago, and several good blog posts about them exist. They have been proposed to WG14 for inclusion in C1X.

    libdispatch

    The other part of GCD is libdispatch. As I said earlier, GCD is basically a combination of closures and thread pooling. Since Blocks are Apple's closure implementation, it is reasonable to assume libdispatch is their thread pooling implementation, and that is exactly what it is.

    For those unfamiliar, the basic concept of a thread pool is that rather than having specific threads for specific tasks, you instead have a pool of generic worker threads. The program issues chunks of work (traditionally as a function pointer and a context pointer, but in this case generally a Block) to a queue, the worker threads pull the work units off of the queue and runs them.

    This approach has a number of advantages over fixed partitioning. First, it is really hard to estimate how much work each thread should do. With a thread pool the system handles that dynamically, it can spawn as many threads as it wants and pull the work units off the queue, which tends to result in much better CPU utilization. It can also shutdown threads if the queue is drained, which lets the system reclaim dormant resources. All the dispatch logic runs very quickly with no contention, since wait-free queues are well understood.

    libdispatch is a new system library provided with Snow Leopard. It provides a system wide set of thread pool APIs. The default use for most users will be to create sequential queues, which issue blocks in order and wait until one block has completed before the next one begins. That means that you can create a queue and issue your critical sections to it, guaranteeing mutual exclusion without taking a lock (though the code now runs asynchronously). Lets revisit our example above, but with 2 additional constraints. First, lets run our block on a queue. Second, lets assume that printf is only safe to run from the main thread (printf's thread safety is actually unspecified in the C standard since C does not actually specify anything about threading or memory models, but as a practical matter most implementations are thread safe).

    //With blocks and libdispatch
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    #include 
    
    //We add a completion handler that gets called when all the work is done,
    //it is just another block
    
    void call_100_times(void (^callback)(void), void (^completion)(void)) {
      uint32_t i;
    
      //Create a queue to issue blocks against
      dispatch_queue_t queue = dispatch_queue_create("com.blogspot.devwhy", NULL);
    
      for (i = 0; i < 100; i++) {
        dispatch_async(queue, callback);
      }
    
      dispatch_async(queue, completion);
      dispatch_release(queue);
    }
    
    int main(void) {
      __block uint32_t i = 0;
      dispatch_queue_t main_queue = dispatch_get_main_queue();
    
      call_100_times(^{
        //Issue an asynch printf from whatever queue we are on to the main queue
    
        dispatch_async(main_queue, ^{
          printf("Out: %" PRIu32 "\n", i);
          i++;
        });
      },
      ^{
        //Need to explicitly call exit now that the end of main is never reached
        //We do it here in the completion block
    
        dispatch_async(main_queue, ^{
          exit(0);
        });
      });
    
      //Start the dispatcher
      dispatch_main();
    
      //NOT REACHED
    }
    

    Okay, so that is a lot more code than before, but it has the benefit of allowing some of the code run other CPUs, while still keeping everything in order. I also never created a thread or a lock! In short, we made the whole thing run asynchronously, sort of how jQuery works. First we call call_100_times(), which creates a dispatch queue and adds the 100 blocks to it. Those blocks will be handled in a completely serial manner (when one is finished the next is started) by libdispatch, so they will only use one CPU at a time, though the queue may migrate between threads and CPUs as appropriate. Obviously this is a toy example that might not generate enough work for GCD to bother spinning up a second thread, but it can if it deems that to be advantageous, and it requires no programmer intervention.

    We then call dispatch_main(), which starts the queues. As the blocks are dequeued and run they start queuing the printf()'s onto the main queue (which is a special queue that runs exclusively on the main thread), followed by the exit() from the completion block. As those blocks are running the main queue starts dequeuing its blocks and printing stuff out, so now our program is doing the adding on one thread and the writing to the screen on another.

    Once the completion block is dequeued it sends a block containing exit() to the main queue. Once the main queue finishes all the proceeding printf blocks it calls the exit() block and the app terminates. So, what we have done here shows how we can implement critical sections without using locks, and queue things to run on the main thread in an appropriate order. Another big benefit of doing everything asynchronously is that it greatly cuts down on spinning beachballs, since those are generally caused by something synchronously waiting on the main thread.

    Besides this, there are several other features of libdispatch, such as concurrent queues (which do not guarantee blocks they issue complete before the next one is run), semaphores, event sources, finalizers, and all of the other frills necessary to push everything through GCD. The manpages for GCD are very good, I recommend reading them: man 3 dispatch

    Now, looking at the above, you might be thinking that it is very cool, and while thinking up all of the bits must have been difficult, it seems fairly simple to implement. That where Apple's integration expertise comes in. Not only have they tuned the hell out of the library and made tons of optimizations, but they also have implemented kernel hooks, and made their dispatcher system aware so it can manage threads based on the overall system state. That means instead of every program checking how many CPUs the system has and trying to consume them all (causing contention between programs and wasting time in the scheduler) the system selects how many threads should be running based on the global system state and divys them up between all the programs on the system.

    Apple has released a PDF that provides a high level overview of GCD, including a number of nice pictures.

    Downsides

    Any major shift has some downsides, though in GCDs case they seem pretty small. The first downside is that makes programmers change some of their fundamental assumptions. The most obvious is that when a function returns that no longer means that the code you declared inside it has necessarily run. Moving to this sort of asynchronous model takes some getting used to, but honestly it is far simpler than understanding threads and locks. Of course if you are already familiar with traditional threading it is something new to learn.

    Another issue is that the interactions with traditional threaded code can be complicated. The unfortunate truth is that a lot of code that would have been better implemented using GCD already exists and is using threads, and converting it might not be worth the effort. That means GCD code may need to interact with traditional threaded code. In general that should work fine with some minor effort, especially if they existing code uses a single threaded facade. If you need to actually directly manipulate threads from inside GCD queues there are a lot of constraints, but they are well documented in the manpages.

    Finally, the biggest issue for me is that GCD is not cross platform. Apple has released the Blocks patches for GCC, include Blocks in CLANG, and implemented a C to C++ source level transformer ("clang-cc -rewrite-blocks") that rewrites blocks code into traditional C++ code, as well as an open sourced Blocks runtime. Unfortunately they have not (yet) been so forth coming with libdispatch implementation. While it might a competitive advantage, I suspect that getting it freely available so that the various open source packages Apple uses can consider moving to GCD would be a significant benefit. Ultimately I think this is a temporary issue, because the actual blocks API is not that complex to implement if one only cares about compatibility. Extracting all of the performance Apple has gotten through their system wide implementation and optimization would probably require quite a bit of work, which would also work in Apple's favor (GCD is good everywhere, but best on OS X!).

    Conclusion

    GCD will improve the utilization of multiple processors, and reduce the number of spinning beachballs (even in what are nominally single threaded programs). It is not a panacea, and how effective is depends on how pervasively it used. I suspect it will end up being quite pervasive, because the paradigm it is trying to replace have proved difficult to use.