Twiiter

Twitter Updates

    follow me on Twitter
    Loading..
    Loading..

    Entries in SSD (3)

    Saturday
    Oct242009

    The loss of ZFS

    Well, in case you haven't read any of the myriad stories about it, it appears that Apple has decided not to use ZFS on Mac OS X. Gruber has sources that say it was primarily licensing concerns, which is consistent with what people have implied to me, both recently, and around WWDC (although at that time I think there was probably still hope of resolving the issues).

    Now, some people jump may comment that it couldn't be licensing issues, since ZFS is opensource (under the CDDL), and that Apple already uses CDDL software (DTrace). That may be true, but often in deals that involve large companies there is more to it than that. Apple may have wanted guarantees of indemnification in the NetApp lawsuit. Maybe it wanted guarantees that certain modifications it wanted to make would be accepted upstream, or even to get Sun to make certain changes. It also might have wanted additional distribution rights that were not granted under the CDDL. It is typical for companies to negotiate custom agreements in such cases (and for some money to change hands), so the idea that licensing issues are why it fell through is entirely reasonable, even though it is an opensource product. Obviously Sun's steady decline in the market place, and the uncertainty caused by the Oracle acquisition may have greatly complicated any such negotiations.

    Why not do a new filesystem?

    Apple has a lot of talented filesystem engineers. They are certainly capable of doing something comparable to ZFS, at least for their target market. The problem with developing a new modern filesystem is that it generally takes longer than a single OS release cycle. Most companies are really bad at having large teams focused on projects that will not ship in the next version of the project they are working on.

    This is a particularly acute problem at Apple, which traditionally has done things with very few engineers. I don't want to get into exact numbers, but I recall having a discussion with the head of a university FS team who was discussing the FS he was working on. He was pitching it to a group of Apple engineers. It was some interesting work, but there were some unsolved problems. When he was asked about them he commented that they didn't have enough people to deal with them, but he had some ideas and it shouldn't be an issue for a company with a real FS team. It turned out his research team had about the same number of people working on their FS as Apple had working on HFS, HFS+, UFS, NFS, WebDAV, FAT, and NTFS combined. I think people don't appreciate how productive Apple is on a per-engineer basis. The downside of that is that sometimes it is hard to find the resources to do something large and time consuming, particularly when it is not something that most users will notice in a direct sense. That is especially true if senior management is not excited about the idea.

    Because of that, I was fairly convinced ZFS was a credible future primary FS for Apple. Not because it was an optimal design for them (it isn't), but because it was a lot less work than doing a new design from scratch. The fact its fundamental architecture is 20 years newer than HFS meant it would still be better than HFS+ in almost all respects even if it was not designed for Apple's exact needs. Clearly I was wrong, since Apple has stopped the ZFS project.

    What changed?

    Well, a couple of things have happened. The first is that Mac OS X has gotten more mature. They no longer need to port all of those FSes, they already have them working, and in most cases they work fairly well. That frees up some engineers. Apple has also greatly expanded the number of people working on their kernel since it is amortized over many different products (Mac OS X, iPhone, AppleTV, etc).

    Suddenly the notion of doing a new filesystem seems doable, so long as it is a real priority and the FS team doesn't get pulled to keep adding features or doing major work to legacy FSes. That is still a lot of work when Apple had ZFS approaching production quality on OS X.

    Apple can do better than ZFS

    Sun calls ZFS "The Last Word in Filesystems", but that is hyperbole. ZFS is one of the first widely deployed copy on write FSes. That certainly makes it a tremendous improvement over existing FSes, but pioneers are the ones with arrows in their back. By looking at ZFS's development it is certainly possible to identify mistakes that they made, and ways to do things better if one were to start from scratch. From where I sit, there are 3 obvious ways doing a new FS will be better for Apple than ZFS:

    1. There have been new fundamental research since ZFS was designed that simplifies many of the issues involved with it. In particular the "B-trees, Shadowing, and Clones" (PDF). That paper is the basis for the design of BtrFS, which has a very similar feature set to ZFS, but internally is entirely different. LWN has an article about BtrFS that explains the significance in some detail (it is written Valerie Aurora, who worked on ZFS at Sun).

    2. ZFS was designed for the storage interfaces available a decade ago. Spinning disks are going to be with us for a long time, especially for bulk storage in data centers and on backup devices. The future is all about solid state. Flash SSDs have significantly different performance characteristics than spinning media, and there may be FS design decisions one could make that would benefit from that. Now, any FS Apple designs will have to work acceptably on traditional drives, but if they are designing for the future then flash is what to target.

      ZFS has had some optimization work for flash, but it is all in terms of using flash as part of a storage hierarchy. That makes complete sense, since ZFS's primary deployment targets are high-end systems and data center storage. Those systems have multiple drives, so the idea of separate flash drives for a ZIL and L2ARC are completely reasonable. Most consumers have one drive in their system, and maybe an external drive for bulk data, data exchange, and backup.

    3. That brings up the last point. ZFS is designed for big systems. It works on small systems, but most of the tradeoffs favor very large computers, with lots of drives. This shows up in a number of ways. The first is that ZFS is not currently capable of adding single drives to an existing vdev or migrating vdevs between various types (mirror, raidz, raidz2). This is a major feature for smaller users who might want to add a single drive, but is a non-issue for data center users who tend to add large number of drives all at once, since they will add whole vdevs. Another issue is that ZFS assumes you have a lot of ram. NEC has been doing a port of OpenSolaris to ARM, and they determined they could not get ZFS to use less than 8 megabytes of ram without making incompatible format changes (Compacted ZFS). With those changes they could squeeze it into a more reasonable 2 megabytes. On a desktop that doesn't seem like a big deal, but on an iPhone 3G or a Time Capsule 8MB of wired memory is an enormous issue.

    The only major downside is that if Apple is just starting on a next generation FS now it could be a long time before we get our hands on it.

    But now we are going to have another incompatible next generation filesystem

    Wolf brought this point up during some of the ZFS talk on twitter yesterday. My general opinion is that it doesn't matter. People use drives for two largely unrelated tasks. One is running their computers. This is fixed storage. The other is for data exchange. In the old days people used floppies for their sneakernet media, which made the situation much simpler to understand. In recent years the market realities have caused people to move to using SD cards, thumbdrives, and hard drives as the exchange medium of sneakernet.

    The important point is that understand is that while the physical devices may be the same, the use model is different, just as the using a floppy disk and an internal hard drive were different. Nobody would balk at the notion that floppies should use different FSes than internal drives. Likewise, most people shouldn't care that their external drives are formatted differently than their internal drives.

    There are complicated features you want for your boot drives and system disks. Ideally you could have them on your interchange disks, but there are other features that are more important, particularly interoperability, and simplicity. ZFS didn't bring either of those. There might have been a few people who were psyched to be able to use ZFS to share disks between a Mac and a Solaris or FreeBSD box, but honestly those people are few and far between. Whether Apple used ZFS or something else it is just as interoperable with Linux and Windows (which is to say, not at all). So that fact that Apple looks to be doing a new FS does not impact interoperability in any real sense.

    The other feature you really want for an interchange FS is simplicity. There are a lot of devices out there that use an FS to communicate with a computer. The simplest example is a digital camera via its media cards, but there are many others. Something like ZFS is way too complex for those devices, and honestly most of the features of ZFS like multiple drive support and snapshots are useless since the devices don't have the physical interconnects or user interfaces to expose those features. There is certainly an argument to be made that we could use something a bit better than FAT32 or exFAT as that format, but ZFS was not the right solution for that.

    In other words, for that disk you want to use as an external drive to drag between computers you don't want something like ZFS, you want something that is simple enough that a firmware engineer can write a read-only implementation from the specs in less than a week. For the disk embedded in your computer (operationally or literally) you want something like ZFS, but it doesn't matter if it is interoperable with anything else because you won't be moving it between systems.

    This is basically how Windows works. Microsoft generally uses NTFS for internal drives, but FAT for external drives. Ultimately somebody should design a filesystem explicitly for use as an interchange format and license it for free, then everyone can deal with their internal FSes and do what makes the most sense for their OSes and markets.

    Tuesday
    Aug042009

    From write() down to the flash chips

    Like a lot of people, I have made the move to an SSD, and I love it. While there is quite a bit of variability between the different vendors of high end drives, all of them far exceed the performance of any conventional HDs. Moving to an SSD has easily been the best improvement in terms of performance and experience of any HW upgrade I have purchased this decade.

    Anyway, the catch with these new SSDs is that they are really new. They all have had some bumpy firmware issues, so unless you are an enthusiast who pays attention to those sorts of things and keeps your firmware up to date I can't quite recommend without reservations. While discussing an upcoming firmware upgrade for my drive (OCZ Technology Vertex 250GB) it became clear that a number of people did not understand how the entire storage stack above the SSD worked, and that was causing some serious misunderstandings about what a firmware up date could and could not do.

    Because of that, I decided it might be worthwhile to write up an explanation of how a modern storage stack (FS through flash chips) worked. For the sake of this discussion we will assume an overwrite filesystem (such as NTFS, HFS+, or ext4), an ATA bus, and what is generally referred to as a 2nd Generation FTL (Flash Translation Layer). Obviously the exact details may change quite a bit if any of these parameters are changed. But this is by far the most common setup people use today.

    The filesystem's view of the world

    The filesystem is responsible for a number of things, like permissions and name lookup, but none of those has an impact on where it puts blocks on a drive. Once we remove all of that, we can consider the file system to be an "object to block" mapping layer. In general a storage object just means a separate distinct entity, so each file is an object.

    It is the file systems job to take these various object requests (create, delete, read write) and service them using the chunk of block storage it is provided by some underlying device.

    ATA's view of the world

    ATA is a command based transport that allows a lot of things, but for example we don't care about ATAPI and all that stuff, we just care about the block storage services it provides. ATA doesn't have any object commands, it just has block commands. Basically your entire drive divided into a series of LBAs (logical block addresses). The drive has commands to read and write data to a particular LBA. Note that it does not have a command to delete an LBA, that will become an enormous issue later.

    Flash's view of the world

    Flash is a relatively complicated storage medium, and has its own view of the world. It works in terms of pages and blocks. Usually a page is the smallest amount of space you can reasonably read or write to a a flash chip (for our discussion, 4K), and a block is the smallest chunk of space you can erase at a time (for our discussion 128 pages). With a fresh (unwritten block) all the bits are set to "1", and during a write they can only be transition to "0." That means in order to rewrite a page you must erase it first. This is a very important point, you can't just go and erase a page of the flash, you need to erase the 128 contiguous pages contained in a whole block at a the same time.

    The SSD controller's view of the world

    Okay, the SSD controller has to deal with ATAs view of the world on one side (LBA), but it also has to deal with flash chips view of the world on the other. This is a hugely complicated task, and it is the reason that the quality of drives varies widely. Because the SSD only deals with those points of view, and not the higher level parts of the OS view (objects) we can characterize its behaviors in those terms, though later we will look at how the stack operates as a whole.

    Okay, so lets assume for a second we have a 1MB flash device with 2 512KB blocks. This would be sold to the consumer as a 512KB flash drive, because some amount of the internal storage needs to be used for bookkeeping as we shall see. When the controller starts up what the flash looks like (from the point of view of the controller) is this:

     

    empty.png

     

    In the above picture the blue page represents the internal tracking data the drive controller is using for remembering things like wear counts and page indirections. The OS cannot see that page, it is completely internal to the drive.

    The controller can see all of that flash, but it tells the ATA chip on your motherboard there is a 1 512KB ATA device there. Now, lets imagine you go to write a page to the drive, what happens:

     

    first-write.png

     

    Okay, the SSD was handed a chunk of data from the ATA bus (the green page). It found somewhere to put it. It also needs to update its tables so it can find it in the future. But if you recall in order to erase the original table it would have to erase the whole block. That would be super wasteful, since most of the block is empty and flash has a limited number of erase cycles, so instead it just writes a new copy to a free spot in the block, and the old copy of the drives control data is marked as invalid. Lets imagine we write another page to the drive, we will see something similiar occur:

     

    second-write.png

     

    Now, lets write over the first page. This is where things get interesting. We do basically the same thing is as before, except that when the write occurs the SSD looks in its tables and sees that the address we are writing to is in use, so it goes and marks the page as invalid in its tables.

     

    overwrite.png

     

    Going on in this manner eventually a flash block will have more invalid pages than valid. At that time the drive will sweep through and gather all the valid data into a new page:

     

    gc1.png

     

    Then erase the old page:

     

    gc2.png

     

    At the moment the only way a user generated piece of data can be marked as invalid is if it is overwritten (though that is changing). This has some serious repercussions for the drive GC, as we will see.

    Throwing in the filesystem

    Okay, so we have a basic understanding of how a modern SSD works. Now lets throw a file system into the mix. Lets imagine we have a filesystem with a single file in it:

     

    fs1.png

     

    In the above picture the blue page is the SSDs internal tables (which the OS cannot see), the orange is the FSes internal tables (which the SSD can see, but cannot understand), and the green is the actual file data. Since the FS cannot see the SSDs tables and the SSD cannot understand the FSes tables we are now in a situation where no part of the stack has a complete understanding of what is going on. The repercussions of this are most apparent when you delete a file.

     

    fs2.png

     

    So in the above case the filesystem updated its internal tables and wrote it out to the SSD, the SSD then found somewhere to put the new page and modified its tables. But notice how nothing happened to the actual file data (now colored dark green), since a deletion is just a matter of marking a few bits in the FSes tables. The SSD does not know those pages no longer needed by the FS since it doesn't understand the FSes internal structures, so when the SSD runs its GC algorithms it must preserve them:

     

    fs2-gc.png

     

    Note that drive preserved all those pages that will never be read again, since the OS considers them free and will use them to write out new files. This is where the new ATA TRIM command makes a difference. What TRIM does is let the OS tell the drive "I have not yet overwritten this data, but I never need it again, so you can throw it out." Lets redo the last scenario with a TRIM aware drive:

     

    fs1.png

     

    Now we delete the file and the OS sends TRIMs for the file pages to the drive:

     

    fs2-trim.png

     

    Notice how the file pages are not marked invalid. At this point when GC runs it can throw them out:

     

    fs2-trim-gc.png

     

    Which results in the drive needing to preserve fewer pages during its GC process. That has several impacts, including:

     


    1. Reducing the time GC takes

    2. Increasing the amount of freespace available after a GC (which increases the time it takes for performance to degrade after a GC)

    3. It lets the FTL have a wider selection of pages to choose from when it when it need a new page to write to, which means it has a better chance of finding low write count pages, increasing the lifespan of the drive

     

    Now, I want to be clear, a sufficiently clever GC on a drive that has enough reserved space might be able to do very well on its own, but ultimately what TRIM does is give a drive GC algorithm better information to work with, which of course makes the GC more effective. What I showed above was a super simple GC, real drive GCes take a lot more information into account. First off they have to deal with more than two blocks, and their data takes up more than a single page. They track data locality, they only run against blocks have hit certain threshold of invalid pages or have really bad data locality. There are a ton of research papers and patents on the various techniques they use. But they all have to follow certain rules based on on the environment they work in, hopefully this post makes some of those clear.

    Sunday
    Apr192009

    Why you can't just unplug a drive (a brief history of filesystem consistency)

    After my first post Andrew Wooster made few really good tweets in response. 140 characters is enough space for two people who understand the issues involved to say a few things, but for anyone who is not familiar with the details those tweets probably don make a great deal of sense. What Andrew brought up really deserves a good explanation, unfortunately you are going to have to settle for mine.

    How do filesystems work?

    Well, filesystems are very complicated things, and they differ quite a bit, but the basic notion is that they track two things. The first thing they track is data, which is organized into files. That is what users tend to care about. The other thing they track is metadata, which is "data about data." Some metadata has meaning to the user (the time the file was written, the name of the file, the location of the file in the directory structure), and some of it has no direct meaning to the user (the blocks on the disk the file is stored in, checksums about the data, etc). What kind of metadata a filesystem stores is an interesting topic, but it irrelevant to the issue of disk consistency. Arstechnica discusses metadata in detail in their history of filesystems article.

    So lets imagine a have a file, and in the middle of it we want to change one letter. You simple read that block off the disk, modify it, and write it back out. Its simple, and there is not much opportunity for something to go wrong. Now lets imagine a slightly more complicated scenario, where we want to add some more data, which extends the length of the file. To do that we will need to find some extra free space, update the meta data to point the file to that space, write the new data to that space, and indicate the blocks holding the old data are free, in whatever structures the filesystem uses to track free space.

    Careful observers probably already see what is coming bu....

    System Crash!!!

    Okay, so I just rebooted, and when I came back this file was totally garbled. Lets figure out what happened. I hit save, and the file system found some free space, updated the metadata, but before I had written the new data out to that spot the system crashed. When it rebooted it read the disk metadata which pointed to the new blocks, which still contained whatever garbage they used to contain. That is totally unacceptable, but there is a solution. If the filesystem does things in a specific order everything is much safer:

     


    1. Find free space

    2. Write new data to free space

    3. Change metadata to point to new data

    4. Mark old data as free

     

    If we always do it in that order the filesystem will always point to valid data. It might be the old data or the new data (depending on when the crash happens). There is also a chance we might accidentally leave unused blocks marked as used, but that is completely acceptable since it just means that we lose a little bit of space that we can recover through a normal disk scan later.

    The above process is more simpler than what is done on real filesystems for various reasons, but is essentially the way that traditional UNIX filesystems worked. If you want a good explanation of the gory details Marshall Kirk McKusick goes into it great detail in his books and lectures.

    Okay, so if that is how things work, why can't I disconnect my USB drive whenever I want?

    That is not how it works now, that is how it worked 25 years ago.It did not even work quite like that back then, but it was close enough that it is not worth getting into the minutiae that are historical footnotes. So why doesn't it work that way any more? In short, it doesn't work that way because hard disks are slow

    But my HD does 90MB/s!

    Here is a dirty little secret. Those hard drives only do 90MB/sin scenarios that are not particularly relevant most users. Specifically, that number is attained by doing sequential 128KB IO requests on the exterior track of the drive. So lets do some math

     


    • 90 MB/s * 1024 KB/1MB = 92160 KB/s

    • 92160 KB/s / (128KB/1IO) == 720 IO/s

     

    So your drive is doing 720 IOs per second (IOPS). The thing is a128K IO takes the same amount of time as a 512 byte IO. So if we look at those above steps, 1 is at least a single metadata read, 2 is our data getting written, and 3 and 4 are each at least a single metadata write. So for each one of those 128KB writes the user is doing, there are at a minimum 3 other IOs going on behind their back to keep track of information. If we do the 4 step commit process described above that means your 90MB/s drive is going to top out 22.5MB/s.

    It gets even worse than that. Most people are not writing sequential streamed data, so their IO sizes will be below 128K. The other thing is the metadata and the data are not sequential. Since they are not sequential the drive head has to seek, which is very slow. While it seeking you can't perform an IO, so the IOPS of the drive will reduce greatly if you are doing random IOs (regardless of size). The result is that 90MB/s HD of yours probably only gets ~1-2MB/s when doing 4KB random IOs.

    That sucks

    Yeah it does, and it is why SSDs are so awesome. I recently bought an OCZ Vertex and it is the single biggest performance upgrade I have had in years. That is a completely separate issue for another blog post.

    So, how does it being slow make it unsafe to unplug?

    Well, the thing is that if software had to wait that long for the drive computers would be unusable. Modern systems are large, they are graphically intensive, and people do things like playing moves and listening to audio. Hard drives have not increased their IOPS performance in the way ram and processor performance has increased, so to mitigate the drive performance issues (which is generally the limiting factor for most users, 9 times out of 10 that spinning beach ball is caused because you are waiting for something from a disk somewhere) people decided to get clever, and whenever anyone gets clever there is downside. In this case the two main improvements are caching and asynchronous IO.

    We'll go through caching first, because it is simpler. The basic notion of caching is that you store something that is slow to get somewhere it is fast to get. Generally the fast location is more expensive, which makes it limited. In particular ram is more expensive per MB than disk space, but it also tremendously faster. Modern OSes all use a portion of their ram to cache disk contents. While they do cache frequently read data, the more important issue is that they write cache. What that means is that when a program tells the OS to write something to the disk it doesn't get written right then. Instead the OS stores the data for a while. That lets it try to perform several optimizations. First off it can attempt to cluster together multiple small writes into a single big write. Second it can attempt to schedule the writes to minimize the amount of seeking going on. Finally, often a program rewrites the same file multiple times. By delaying writing the file it is possible to discard some of the intermediary writes if you see they have been overwritten.It also lets the OS merge write requests from completely unrelated processes.

    The second issue is asynchronous IO. Earlier when I described the 4 step file update it was implicit that we would wait for step 1 to finish before we started step 2. That is referred to synchronous IO. It would also be possible to implement a system where you send commands to the drive, and don't wait for them to complete, you just keep sending more commands. That is asynchronous IO. Asynchronous IO is much faster, because you don't have to wait for IOs to complete. If you send to many IOs to the drive it will eventually tell you to back off, but just like the OS is doing caching, so is the drive. The drives cache allows it to queue up many IOs even it is busy doing other things.

    These two enhancements greatly improve performance, but at a significant cost. Caching means that Even if you have told the OS to write files to the disk, they might not be there. If all the commands were issued from the cache to the drive in the same order they were issued from the programs unplugging the drive might still be safe, but if they were issued in that order with no changes there would almost no speed up since all the optimizations that write caching allows reorder writes. Of course that is anon-issue because of the asynchronous writes.

    Asynchronous writes allows the drive to perform do similar optimizations as the OSes within the drive itself. That means that even if the OS tells the drive to write block 1 then block 2 there is no guarantee the drive will choose to write them in that order. Since the drive doesn't guarantee the ordering any more there is fact that OS doesn't order them is irrelevant.

    That is just insane!

    Yeah, it is. Fortunately that is not what happens. At least not any more. This problem has been obvious for years, and people have been working on it for years. The have been many clever ideas, including designing filesystems that are laid out in fundamentally different ways to better cope with the performance constraints drives of hard drives. The details of how all of those work and the evolution of them would is not worth getting into in depth here, but the thing to understand is that the changes necessary are complicated, and despite the first such filesystems appearing ~26 years ago (Sprite), and a constant stream of research designs over the years (LFS, LinLogFS) and embedded filesystems (WAFL). We are just finally starting to see general purposes designs that are usable large user bases (ZFS, btrfs, Tux3). ZFS is just now starting to be used in production environments, and it has been under active development for ~8 years now. It just takes a long time to develop and stabilize a high new filesystem designs.

    So we have had a problem that has been clear ~30 years, and we have people trying to solve it for almost as long. In the software world that is an eternity, so we do what we always do, we came up with a clever hack that sort of dealt with it for a while.

    Enter Journaling

    Journaling is a hack, get over it ;-) People often talk about journaling FSes, like journaling is a big deal in the design of the FS. The important thing about journaling is that it is really easy to retrofit into an existing filesystem that was not designed with asynchronous consistency in mind. Not only is it easy to add to an existing filesystem, it is easy to add in a way that does not change the volume format in way that is incompatible with existing implementations. I know someone who added journaling to a FAT16 implementation. That is why all existing filesystems are journaled, and almost no next generation ones are. Its not optimal, but it a very simple extension to filesystems that are otherwise based on research from the 1970s and don't take into account how drives synch speed have not scaled with other aspects of computers.

    So, how does journaling work? Its simple. The idea is to dedicate a portion of the disk to be the journal. Then what you do is you look at the safe ordering of operations you need to perform. You write out the list of things you need to do to the journal all at once, as a single synchronous write. You then write them where they are supposed to go asynchronously. If the system crashes you simply read through journal and write out the list of things it said needed to be done. If they already happened before the crash you are overwriting what was written with identical data, so there is no problem. If it did not make it to the disk you are writing out consistent data that you knew you needed to write out before the crash so no problem.

    To retrofit that into an existing filesystem you just create a special reserved journal file. Older implementations will ignore the journal file, so they will have to run synchronously, but they will be able to read the drive. Newer implementations will get to make most of their writes asynchronously. There is some potential extra risk moving drives back and forth between systems that do and don't support it, but in practice that tends to be insignificant.

    There are of course many variations on this that subtly change the semantics of what is going on. You can journal every write to the drive, or just the metadata. If you journal just the metadata there are other dependency issues that may occur. The recent debate about ext3's data=ordered behavior is an example of that. If you want a more detailed explanation of journaling (and many other details on filesystem design), I recommend Practical File System Design with the Be File System (free pdf) by Dominic Giampaolo.

    So, wait, if journaling does that, I should be able to just pull out my drive

    Sorry, no. There are two big gotchas.

    The first is that journaling made it so the drive was never going to be in a corrupt state, but the fact that the OS is caching data means you still don't know whether the any particular file has been written to the drive unless you force it (from a programmer's standpoint by synching the drive, from a users standpoint by ejecting the drive). Also, if you remove and reinsert the drive the OS has no way to know what is on the drive, so it can't just start writing out the cache. It is entirely possible you plugged into another computer in the interim and modified it. Before anyone claims only idiots would do that, consider how often the average user puts their laptop to sleep, then unplugs their thumb drive, then puts it into another computer. It is worth noting that some of the newer FS (like ZFS and btrfs) are designed in such a way that doing that is safe, but it is one of the consequence of being designed from the ground up to deal with that sort of thing, there is no obvious way to retrofit that sort of functionality onto existing filesystems.

    The second is a little more disturbing, and that is that hard drives just flat out lie to the OS.

    WHAT!?!?!

    It is a little know fact that hard drives pretty much straight up lie. Let me explain. Remember when I discussed the idea of asynchronous IO, and I mentioned that the HD had an internal cache? Well, I didn't exactly explain what is going on. From the OS's standpoint a write is synchronous if it waits until the HD says that it has the information, not that it has written the information down on magnetic media. Once it is on the disk it schedule to written out, and it tend to get written out fairly quickly, but high-end consumer drives have~32MB of cache, which can take multiple seconds to write out. If they lose power then the data in that cache is lost. To guarantee that data hits the disk you need to instruct the drive to flush its internal cache. Some OSes provide an additional hook to let applications force them to flush the disk (F_FULLSYNC on Mac OS X). Internally when a journal is written it needs to make sure the journal hit the disc, so even if the OS does expose that functionality it needs to use it internally.

    The problem is that there is almost no incentive for HD manufacturers to honor a synchronize command. After all, if they just do nothing when they get the command everything runs faster. It also doesn't make a difference unless something else goes wrong, and the majority of the time it works out okay even then. When it doesn't work out the consumer has no way to know that the problem was their drive, they tend to blame the power company for an outage, and MS or Apple for their OSes, but most people never think that the drives isn't writing out the data the OS told it to.

    Don't get me wrong, people blame their drives when the drives physically die, but when a drive dies or reports a bad sector, but most people just aren't equipped to determine that a drive is lying to them. Hell, I know they probably lie to me, but to determine whether a particularly drive is honoring synchs or not it takes a tremendous amount of effort, and it is impossible for me to 100% confirm they are compliant without dissecting their firmware. Fortunately big computer companies care about this. For instance, Apple cares that they don't get blamed for data loss, so they force their HD providers to honor the command. The same is true for many other big name computer vendors, though I don't have first hand knowledge of who else does it. Don't get me wrong, Apple labelled HDs are still a huge price premium for a feature that should be turned on for all HDs, but you should blaming the HD manufacturers for gaming the benchmarking results at the expense of their user's data.

    Okay, so if I buy an Apple HD and stick in a USB-SATA dock and pull the plug I might not have all the dataI was just working on written out to disk, but at least the FS won't be corrupt, right...

    I keep giving people bad news. The drive supports synch, and the OS is sending them, but the thing is that there is a chip in between the USB and SATA ports that is converting between them, and almost all bridge chips just ignore that command instead of converting it. The only firewire or USB drives that I know of that honor synch commands are iPods and Macs in target disk mode. See, the USB chip guys are just trying to implement these things cheaply. Just like the HD guys, they don't get blamed and it is mostly undetectable, so they skimp.

    I should point out it is not just the cheap guys though. I once was testing a high end embedded HW raid controller that uses fibrechannel to the host and ATA to the drives, and after seeing some data corruption we hooked it up to an analyzer and saw that it completely dropped all the synch commands. I guess most of that companies customers did not have ATA analyzers handy....

    My data is doomed 8-(

    No, it isn't doomed, but it is not as safe as it should be. You will be okay if you do three simple things:

     


    1. Always unmount your drives and wait for their activity lights to stop before you disconnect them

    2. Use SATA, eSATA, SAS, or Fibrechannel for primary storage. Those are unbridged, so the OS's synchronize commands will get to the drive
    3. Use server class drives, or OEM drives from big vendors. Those drives will issue the synchs correctly. That should include all SAS and Fibrechannel drives, and most high-end SATA drives

     

     

    Limit your use of USB and firewire drives to bulk storage, or disconnected backups. So long as you fully unmount them before disconnect the dropped synch issue will have the potential to cause trouble if you lose power or kernel panic mid write. While that is still a problem, it is a fairly small window, and you are unlikely to have issues in practice. Of course, my system kernel panicked coming out of sleep while I was writing this post (I lost 3 pages, but it is a better post the second time around), but I think that was just the universe trying to remind that sometimes Mac OS X does crash...