Twiiter

Twitter Updates

    follow me on Twitter
    Loading..
    Loading..
    Thursday
    Oct082009

    Flash on the iPhone

    Update: One of the developers of Trading Stuff posted a comment below, and has a great blog post about his experience getting it running on the iPhone.

    There has been a lot of discussion about running Flash apps on the iPhone over
    the last few days. It was precipitated by Adobe's announcement that Flash
    Professional CS5
    would have support for publishing apps as iPhone native
    executables. They went into a little more detail, saying that they were going to
    use an Ahead of Time (AOT) compiler backend based on LLVM, and that
    there are already several apps on the store using it. This generated a large
    number of responses from various people, some knee-jerk, some well
    reasoned out. Of course, the fact that there are samples we can dissect means
    that it is possible to make some informed analysis about them.

     

    Personal views

     

    Before we get into the technical details of this, let me go into my background a
    bit, lest I be accused of being biased or having an agenda. I have around 10
    years of Objective C development experience, and almost no experience writing
    anything substantial with Flash or ActionScript. I am also primarily a Macintosh
    user, where the Flash experience (even in the browser) is often less than ideal.
    I am told it is a better experience on Windows. I do use the Hulu Desktop app, which is written in Flash, and think it is pretty nice (though
    they should make cut and paste work in their text fields, grrr). Of course,
    something like Hulu is an immersive app that has no need to integrate with the
    native OS experience, but neither do most games.

    There are a lot of neat little web games and what not that are written in Flash,
    that I would like to run. If that was the tool the author felt it best to
    express their ideas and it worked for them then great. In particular, for
    software that I don't feel needs interface with OS and which can use a
    completely custom UI (in other words, games) I think there is no difference to
    the user. So long as the environment generates good code that can run at full
    frame rate without killing the battery I am in favor of getting Flash apps to
    run on the iPhone.

     

    Warm up

     

    Okay, so first off lets look at what we know. Adobe is using the LLVM arm
    backend to generate code. Right off the bat that gets me bit worried. Don't get
    me wrong, I love LLVM. Hell, I was one of the people who wrote
    the LLVM ppc backend. Having said that, I wouldn't use the LLVM arm backend at
    this time. The reason I wouldn't is because every time I have asked the
    people who are actively working on it they tell me it is not ready for prime
    time. That is the reason why LLVM-gcc and clang are not supported compiler
    targets for iPhone, despite them being supported (and encouraged) for development
    on Snow Leopard. Apple has a lot of compiler engineers and has basically stated
    that LLVM is their future compiler direction, so if those guys are telling us to
    hold back, then how is it good enough for AOT Flash?

    Now, Adobe could potentially have another arm backend they developed, or maybe
    they branched off a particular build and have fixed whatever bugs would impact
    them while ignoring stuff they didn't need. It is entirely possible that they
    could have reasonable code coming out of this thing, and I don't have access to
    the toolchain itself to inspect it, but it definitely gets me nervous. This may
    also be one of the reasons why they are not ready to widely release the tools
    yet.

     

    Lets get to it

     

    Okay, so I downloaded one of the games that was available, Trading
    Stuff
    , decompressed its IPA and had a look inside. At first
    glance it looked like a pretty normal iPhone app. Then I noticed there were no
    resources besides a basic MainWindow.nib. No images, no sounds, no
    localizations. The next thing I noticed was that the binary was ~13 megabytes,
    or approximately ~95% the size of the entire app. That is enormous for a binary.
    For reference, compare that to a normal iPhone game, like The Oregon
    Trail
    , which is ~106 megabyte game has ~1 megabyte executable, or
    about 1% the size of the app.

    What is going on is that the Flash build environment is not using any of the
    standard Mac OS X/iPhone OS bundling or localization mechanisms. Instead they
    are transforming all their assets into embeddable objects and shoving them
    directly into their application's TEXT section. At first glance that might not
    seem so bad, but it has a bunch of consequences. It defeats almost any sort of
    caching or prefetch logic the OS has for specific data types (like images), and
    instead places all of the pressure directly on the VM and paging subsystems.

    To be clear this is no way violation of the SDK agreement, and embedding objects
    into an app in this way is occasionally appropriate, but the degree to which it
    is happening with these cross compiled apps is different, and likely will have a
    number of significant (negative) performance implications.

    Now, moving on from the obvious, its time to actually start poking at the
    generated code. If we just take a cursory look at the linkage, we can see some
    bad stuff going on here. Why are they calling dlopen, dlclose, and dlsym? The
    only reason to use those is load in frameworks and resolve symbols after launch,
    something that is strictly verboten. In the best case that might be some dead
    code they use from debugging that should have been stripped that should have
    out. In the worst case, they depend on them to get access to symbols they are
    not supposed to use. I want to be clear about this: There is no legitimate
    reason why any app that follows the terms of the SDK agreement should use these
    functions
    , and I find it shocking that Apple lets apps that link against them
    on the store at all. I should also note this is not exhaustive, those are just the most obvious things, but in my cursory inspection I saw a dozen or so objective C selectors that I believe are private.

    Actually that brings up another point. Despite the app developer doing nothing
    wrong, one of their toolchain or middleware vendors is doing something that could
    be an issue. When I write apps for the store I might choose to play fast and
    loose with something if there is a compelling reason, but if I am providing a
    library for someone else I never do. Using private APIs is putting your
    customer's apps at risk. Not only do I find that unacceptable,
    but I think any vendor who does that is generally irresponsible and makes it me
    hesitate to use any of their other products because I feel it shows a certain
    casualness about how you treat their customers. If there is a legitimate reason
    you need to do something that risks your customer's products, then as a company
    you need to disclose it so your customers can make an informed decision.

    I should note this is not an intrinsic issue with Flash, I know for a fact
    certain major vendors ship iPhone libraries that call APIs that can get your app
    rejected from the store without informing developers. For instance, various analytics companies
    really shouldn't be poking at private APIs to try to find cached location
    framework data. It isn't just a privacy breach, it places your clients apps at risk.

     

    So what. How does it run?

     

    On my iPhone 3G it runs really choppy, on my 3GS it runs acceptably, but it
    still isn't smooth. Given the OpenGL performance people have seen on the 3GS
    that is still pretty bad. I have not done any invasive tests by instrumenting
    the binary, that is just what I can get via basic usage. The sad thing is that
    there is no reason it has to have performance like this. This is not an inherent
    issue with the ActionScript used in this app (though that may have issues), it
    is that what is coming out of the toolchain is a huge, monstrous binary that stresses
    the runtime and has performance characteristics completely different than
    anything the iPhone is currently setup for.

    Also, remember, the slower the frame rate the more work the phone is doing per frame, and the
    more battery the app is using. When you see an app that can do 120 FPS in its
    demo loop, that means that when it is running at 30 FPS it is using ~25% of the
    CPU/GPU assets. When you see one that can only get 20 FPS that means it cannot
    hit 30 FPS to clamp at despite maxing some or the costly (in terms of battery)
    system assets.

     

    Punchline

     

    Technically speaking, these do appear to basically be within letter of the SDK
    agreement, modulo the fact that Adobe appears to making private API calls. They
    should be able to do what they need to without making those calls, so ultimately
    that should be a non-issue.

    Now, the notion that what this thing emits is indistinguishable from
    something Xcode emits is laughable. They are very different, and not in a good
    way. While the apps may get acceptable frame rates on an iPhone 3GS, they don't
    on earlier hardware, and they almost certainly use substantially more
    battery power than native games.

    I want to be excited about this thing, both because it is a seriously cool
    piece of tech, and because there are Flash games I would like to run on my
    phone, but looking at what this thing is spitting out I think the apps it will
    generate will perpetuate the stereotypes about Flash (especially on cell
    phones), and give Objective C programmers a (somewhat misplaced) sense of
    vindication about their views on Flash.

    This is all still in beta, it could end up a lot better than it currently is. It
    could be something that can make some great games available on the iPhone.
    Unfortunately looking it right now I am very skeptical, and I think that is the
    right position to have given Flash's performance elsewhere. Yes, this is
    entirely new technology, but it comes from the same company with the same
    priorities. Given the product they have delivered to me on my desktop for the
    last 5 years they don't get benefit the doubt, they have to pull themselves out
    of the doghouse as far as I am concerned. Come on Adobe, prove me wrong!

    Sunday
    Sep272009

    C4[3]

    C4[3] has just ended, and I am sitting in the hotel lobby right now. It was a great show this year, a lot of interesting talks, I hope everyone had as much fun as I did. I am posting the slides from my Blitz Talk (How to become a compiler engineer in 5 minutes) here (PDF).

    Thursday
    Sep102009

    Apple has opensourced libdispatch

    Apple has opensourced the libdispatch Mac OS X implementation. This is excellent news, and hopefully we will quickly see ports to other platforms. This is a the last core piece of GCD that was proprietary, so it is a very exciting piece of news.

    Friday
    Aug282009

    Grand Central Dispatch

    One of the really cool features of Mac OS X Snow Leopard is a new set of technologies to make developing for multicore systems easier, named Grand Central Dispatch. There has been a lot of hype about what they are, and until recently you needed an NDA to discuss them. With the release of Snow Leopard it is now possible to discuss those technologies publicly.

    Not so new...

    The first thing I need to say is that the assorted technologies involved with GCD are not new. It is a number of existing technologies that have been combined in a fairly innovative way and with a killer implementation.

    I think the best way to explain this is to look at Apple's approach to devices like the iPod and iPhone. Apple has a reputation for innovative devices, but usually they don't innovate by inventing a category, they reinvent it. In other words they find some market that exists but is poorly served, and they throw their resources at it, blowing away the existing competition. Sometimes they own the category, sometimes new competitors come in to try to compete on the raised playing field.

    The simplistic explanation is that GCD is a combination of thread pooling and closures. The honest truth is that if thread pooling is a Motorola ROKR then GCD is an iPhone 3GS... while they are both phones and nominally do the same thing, no one would ever claim they are equal, or even in the same league. The guys at Apple who came up with it deserve a ton of credit.

    First off clos^H^H^H^H^H Blocks

    The first piece of GCD is Apple's new C extension, Blocks. In short, blocks are closures. For those of you are into the all the hot new scripting languages, closures should not be anything new. For all the old LISP heads reading this, closures are in fact a very old thing. For everyone who spends most of their time writing C and C++ code, the basic idea is that a closure is a bundle of a function pointer and a context pointer. So, everywhere you setup a callback function and pass in a context pointer that gets passed into that function you are manually forming a closure. The big difference with language level support is the syntax:

    //Without blocks
    
    #include <stdio.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    void myCallback(void **context) {
      uintptr_t number = (uintptr_t)(*context);
      printf("Out: %" PRIuPTR "\n", number);
      *context = (void *)(number + 1);
    }
    
    void call_100_times(void (*callback)(void **context)) {
      void *context = NULL;
      uint32_t i;
    
      for (i = 0; i < 100; i++) {
        callback(&context);
      }
    }
    
    int main(void) {
      call_100_times(myCallback);
    
      return 0;
    }
    

    vs

    //With blocks
    
    #include <stdio.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    void call_100_times(void (^callback)()) {
      uint32_t i;
    
      for (i = 0; i < 100; i++) {
        callback();
      }
    }
    
    int main(void) {
      __block uint32_t context = 0;
    
      call_100_times(^{
        printf("Out: %" PRIu32 "\n", context);
        i++;
      });
    
      return 0;
    }
    

    So, lets do a quick analysis of the changes. First, rather than writing a function and passing a pointer to it into call_100_times() I was able to write the block right at the call site of the function I am passing it into, which makes things easier to read.

    Also, the function takes an explicit context pointer which I need to cast around into what I know it is. In more complex case you may need to build structs to pass into function. Our block doesn't take any parameters (they can, but often they do not need to). Instead it captures the context variable from the scope it is declared in. The reason we declare context as a "__block uint32_t" is to tell the compiler that when blocks read and write to it they should share the one from that scope, instead of getting their own copies. The result is a very logical construction where I declare a variable with in an interior scope that I pass into another function.

    At the end of the day it is basically syntactic sugar with a small amount of runtime code to deal with copying blocks to and from the heap, etc. But it makes a huge difference when you are trying to write code.

    Apple actually publicly disclosed Blocks a long time ago, and several good blog posts about them exist. They have been proposed to WG14 for inclusion in C1X.

    libdispatch

    The other part of GCD is libdispatch. As I said earlier, GCD is basically a combination of closures and thread pooling. Since Blocks are Apple's closure implementation, it is reasonable to assume libdispatch is their thread pooling implementation, and that is exactly what it is.

    For those unfamiliar, the basic concept of a thread pool is that rather than having specific threads for specific tasks, you instead have a pool of generic worker threads. The program issues chunks of work (traditionally as a function pointer and a context pointer, but in this case generally a Block) to a queue, the worker threads pull the work units off of the queue and runs them.

    This approach has a number of advantages over fixed partitioning. First, it is really hard to estimate how much work each thread should do. With a thread pool the system handles that dynamically, it can spawn as many threads as it wants and pull the work units off the queue, which tends to result in much better CPU utilization. It can also shutdown threads if the queue is drained, which lets the system reclaim dormant resources. All the dispatch logic runs very quickly with no contention, since wait-free queues are well understood.

    libdispatch is a new system library provided with Snow Leopard. It provides a system wide set of thread pool APIs. The default use for most users will be to create sequential queues, which issue blocks in order and wait until one block has completed before the next one begins. That means that you can create a queue and issue your critical sections to it, guaranteeing mutual exclusion without taking a lock (though the code now runs asynchronously). Lets revisit our example above, but with 2 additional constraints. First, lets run our block on a queue. Second, lets assume that printf is only safe to run from the main thread (printf's thread safety is actually unspecified in the C standard since C does not actually specify anything about threading or memory models, but as a practical matter most implementations are thread safe).

    //With blocks and libdispatch
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <inttypes.h>
    
    #include 
    
    //We add a completion handler that gets called when all the work is done,
    //it is just another block
    
    void call_100_times(void (^callback)(void), void (^completion)(void)) {
      uint32_t i;
    
      //Create a queue to issue blocks against
      dispatch_queue_t queue = dispatch_queue_create("com.blogspot.devwhy", NULL);
    
      for (i = 0; i < 100; i++) {
        dispatch_async(queue, callback);
      }
    
      dispatch_async(queue, completion);
      dispatch_release(queue);
    }
    
    int main(void) {
      __block uint32_t i = 0;
      dispatch_queue_t main_queue = dispatch_get_main_queue();
    
      call_100_times(^{
        //Issue an asynch printf from whatever queue we are on to the main queue
    
        dispatch_async(main_queue, ^{
          printf("Out: %" PRIu32 "\n", i);
          i++;
        });
      },
      ^{
        //Need to explicitly call exit now that the end of main is never reached
        //We do it here in the completion block
    
        dispatch_async(main_queue, ^{
          exit(0);
        });
      });
    
      //Start the dispatcher
      dispatch_main();
    
      //NOT REACHED
    }
    

    Okay, so that is a lot more code than before, but it has the benefit of allowing some of the code run other CPUs, while still keeping everything in order. I also never created a thread or a lock! In short, we made the whole thing run asynchronously, sort of how jQuery works. First we call call_100_times(), which creates a dispatch queue and adds the 100 blocks to it. Those blocks will be handled in a completely serial manner (when one is finished the next is started) by libdispatch, so they will only use one CPU at a time, though the queue may migrate between threads and CPUs as appropriate. Obviously this is a toy example that might not generate enough work for GCD to bother spinning up a second thread, but it can if it deems that to be advantageous, and it requires no programmer intervention.

    We then call dispatch_main(), which starts the queues. As the blocks are dequeued and run they start queuing the printf()'s onto the main queue (which is a special queue that runs exclusively on the main thread), followed by the exit() from the completion block. As those blocks are running the main queue starts dequeuing its blocks and printing stuff out, so now our program is doing the adding on one thread and the writing to the screen on another.

    Once the completion block is dequeued it sends a block containing exit() to the main queue. Once the main queue finishes all the proceeding printf blocks it calls the exit() block and the app terminates. So, what we have done here shows how we can implement critical sections without using locks, and queue things to run on the main thread in an appropriate order. Another big benefit of doing everything asynchronously is that it greatly cuts down on spinning beachballs, since those are generally caused by something synchronously waiting on the main thread.

    Besides this, there are several other features of libdispatch, such as concurrent queues (which do not guarantee blocks they issue complete before the next one is run), semaphores, event sources, finalizers, and all of the other frills necessary to push everything through GCD. The manpages for GCD are very good, I recommend reading them: man 3 dispatch

    Now, looking at the above, you might be thinking that it is very cool, and while thinking up all of the bits must have been difficult, it seems fairly simple to implement. That where Apple's integration expertise comes in. Not only have they tuned the hell out of the library and made tons of optimizations, but they also have implemented kernel hooks, and made their dispatcher system aware so it can manage threads based on the overall system state. That means instead of every program checking how many CPUs the system has and trying to consume them all (causing contention between programs and wasting time in the scheduler) the system selects how many threads should be running based on the global system state and divys them up between all the programs on the system.

    Apple has released a PDF that provides a high level overview of GCD, including a number of nice pictures.

    Downsides

    Any major shift has some downsides, though in GCDs case they seem pretty small. The first downside is that makes programmers change some of their fundamental assumptions. The most obvious is that when a function returns that no longer means that the code you declared inside it has necessarily run. Moving to this sort of asynchronous model takes some getting used to, but honestly it is far simpler than understanding threads and locks. Of course if you are already familiar with traditional threading it is something new to learn.

    Another issue is that the interactions with traditional threaded code can be complicated. The unfortunate truth is that a lot of code that would have been better implemented using GCD already exists and is using threads, and converting it might not be worth the effort. That means GCD code may need to interact with traditional threaded code. In general that should work fine with some minor effort, especially if they existing code uses a single threaded facade. If you need to actually directly manipulate threads from inside GCD queues there are a lot of constraints, but they are well documented in the manpages.

    Finally, the biggest issue for me is that GCD is not cross platform. Apple has released the Blocks patches for GCC, include Blocks in CLANG, and implemented a C to C++ source level transformer ("clang-cc -rewrite-blocks") that rewrites blocks code into traditional C++ code, as well as an open sourced Blocks runtime. Unfortunately they have not (yet) been so forth coming with libdispatch implementation. While it might a competitive advantage, I suspect that getting it freely available so that the various open source packages Apple uses can consider moving to GCD would be a significant benefit. Ultimately I think this is a temporary issue, because the actual blocks API is not that complex to implement if one only cares about compatibility. Extracting all of the performance Apple has gotten through their system wide implementation and optimization would probably require quite a bit of work, which would also work in Apple's favor (GCD is good everywhere, but best on OS X!).

    Conclusion

    GCD will improve the utilization of multiple processors, and reduce the number of spinning beachballs (even in what are nominally single threaded programs). It is not a panacea, and how effective is depends on how pervasively it used. I suspect it will end up being quite pervasive, because the paradigm it is trying to replace have proved difficult to use.

    Tuesday
    Aug042009

    From write() down to the flash chips

    Like a lot of people, I have made the move to an SSD, and I love it. While there is quite a bit of variability between the different vendors of high end drives, all of them far exceed the performance of any conventional HDs. Moving to an SSD has easily been the best improvement in terms of performance and experience of any HW upgrade I have purchased this decade.

    Anyway, the catch with these new SSDs is that they are really new. They all have had some bumpy firmware issues, so unless you are an enthusiast who pays attention to those sorts of things and keeps your firmware up to date I can't quite recommend without reservations. While discussing an upcoming firmware upgrade for my drive (OCZ Technology Vertex 250GB) it became clear that a number of people did not understand how the entire storage stack above the SSD worked, and that was causing some serious misunderstandings about what a firmware up date could and could not do.

    Because of that, I decided it might be worthwhile to write up an explanation of how a modern storage stack (FS through flash chips) worked. For the sake of this discussion we will assume an overwrite filesystem (such as NTFS, HFS+, or ext4), an ATA bus, and what is generally referred to as a 2nd Generation FTL (Flash Translation Layer). Obviously the exact details may change quite a bit if any of these parameters are changed. But this is by far the most common setup people use today.

    The filesystem's view of the world

    The filesystem is responsible for a number of things, like permissions and name lookup, but none of those has an impact on where it puts blocks on a drive. Once we remove all of that, we can consider the file system to be an "object to block" mapping layer. In general a storage object just means a separate distinct entity, so each file is an object.

    It is the file systems job to take these various object requests (create, delete, read write) and service them using the chunk of block storage it is provided by some underlying device.

    ATA's view of the world

    ATA is a command based transport that allows a lot of things, but for example we don't care about ATAPI and all that stuff, we just care about the block storage services it provides. ATA doesn't have any object commands, it just has block commands. Basically your entire drive divided into a series of LBAs (logical block addresses). The drive has commands to read and write data to a particular LBA. Note that it does not have a command to delete an LBA, that will become an enormous issue later.

    Flash's view of the world

    Flash is a relatively complicated storage medium, and has its own view of the world. It works in terms of pages and blocks. Usually a page is the smallest amount of space you can reasonably read or write to a a flash chip (for our discussion, 4K), and a block is the smallest chunk of space you can erase at a time (for our discussion 128 pages). With a fresh (unwritten block) all the bits are set to "1", and during a write they can only be transition to "0." That means in order to rewrite a page you must erase it first. This is a very important point, you can't just go and erase a page of the flash, you need to erase the 128 contiguous pages contained in a whole block at a the same time.

    The SSD controller's view of the world

    Okay, the SSD controller has to deal with ATAs view of the world on one side (LBA), but it also has to deal with flash chips view of the world on the other. This is a hugely complicated task, and it is the reason that the quality of drives varies widely. Because the SSD only deals with those points of view, and not the higher level parts of the OS view (objects) we can characterize its behaviors in those terms, though later we will look at how the stack operates as a whole.

    Okay, so lets assume for a second we have a 1MB flash device with 2 512KB blocks. This would be sold to the consumer as a 512KB flash drive, because some amount of the internal storage needs to be used for bookkeeping as we shall see. When the controller starts up what the flash looks like (from the point of view of the controller) is this:

     

    empty.png

     

    In the above picture the blue page represents the internal tracking data the drive controller is using for remembering things like wear counts and page indirections. The OS cannot see that page, it is completely internal to the drive.

    The controller can see all of that flash, but it tells the ATA chip on your motherboard there is a 1 512KB ATA device there. Now, lets imagine you go to write a page to the drive, what happens:

     

    first-write.png

     

    Okay, the SSD was handed a chunk of data from the ATA bus (the green page). It found somewhere to put it. It also needs to update its tables so it can find it in the future. But if you recall in order to erase the original table it would have to erase the whole block. That would be super wasteful, since most of the block is empty and flash has a limited number of erase cycles, so instead it just writes a new copy to a free spot in the block, and the old copy of the drives control data is marked as invalid. Lets imagine we write another page to the drive, we will see something similiar occur:

     

    second-write.png

     

    Now, lets write over the first page. This is where things get interesting. We do basically the same thing is as before, except that when the write occurs the SSD looks in its tables and sees that the address we are writing to is in use, so it goes and marks the page as invalid in its tables.

     

    overwrite.png

     

    Going on in this manner eventually a flash block will have more invalid pages than valid. At that time the drive will sweep through and gather all the valid data into a new page:

     

    gc1.png

     

    Then erase the old page:

     

    gc2.png

     

    At the moment the only way a user generated piece of data can be marked as invalid is if it is overwritten (though that is changing). This has some serious repercussions for the drive GC, as we will see.

    Throwing in the filesystem

    Okay, so we have a basic understanding of how a modern SSD works. Now lets throw a file system into the mix. Lets imagine we have a filesystem with a single file in it:

     

    fs1.png

     

    In the above picture the blue page is the SSDs internal tables (which the OS cannot see), the orange is the FSes internal tables (which the SSD can see, but cannot understand), and the green is the actual file data. Since the FS cannot see the SSDs tables and the SSD cannot understand the FSes tables we are now in a situation where no part of the stack has a complete understanding of what is going on. The repercussions of this are most apparent when you delete a file.

     

    fs2.png

     

    So in the above case the filesystem updated its internal tables and wrote it out to the SSD, the SSD then found somewhere to put the new page and modified its tables. But notice how nothing happened to the actual file data (now colored dark green), since a deletion is just a matter of marking a few bits in the FSes tables. The SSD does not know those pages no longer needed by the FS since it doesn't understand the FSes internal structures, so when the SSD runs its GC algorithms it must preserve them:

     

    fs2-gc.png

     

    Note that drive preserved all those pages that will never be read again, since the OS considers them free and will use them to write out new files. This is where the new ATA TRIM command makes a difference. What TRIM does is let the OS tell the drive "I have not yet overwritten this data, but I never need it again, so you can throw it out." Lets redo the last scenario with a TRIM aware drive:

     

    fs1.png

     

    Now we delete the file and the OS sends TRIMs for the file pages to the drive:

     

    fs2-trim.png

     

    Notice how the file pages are not marked invalid. At this point when GC runs it can throw them out:

     

    fs2-trim-gc.png

     

    Which results in the drive needing to preserve fewer pages during its GC process. That has several impacts, including:

     


    1. Reducing the time GC takes

    2. Increasing the amount of freespace available after a GC (which increases the time it takes for performance to degrade after a GC)

    3. It lets the FTL have a wider selection of pages to choose from when it when it need a new page to write to, which means it has a better chance of finding low write count pages, increasing the lifespan of the drive

     

    Now, I want to be clear, a sufficiently clever GC on a drive that has enough reserved space might be able to do very well on its own, but ultimately what TRIM does is give a drive GC algorithm better information to work with, which of course makes the GC more effective. What I showed above was a super simple GC, real drive GCes take a lot more information into account. First off they have to deal with more than two blocks, and their data takes up more than a single page. They track data locality, they only run against blocks have hit certain threshold of invalid pages or have really bad data locality. There are a ton of research papers and patents on the various techniques they use. But they all have to follow certain rules based on on the environment they work in, hopefully this post makes some of those clear.