After my first post Andrew Wooster made few really good tweets in response. 140 characters is enough space for two people who understand the issues involved to say a few things, but for anyone who is not familiar with the details those tweets probably don make a great deal of sense. What Andrew brought up really deserves a good explanation, unfortunately you are going to have to settle for mine.
How do filesystems work?
Well, filesystems are very complicated things, and they differ quite a bit, but the basic notion is that they track two things. The first thing they track is data, which is organized into files. That is what users tend to care about. The other thing they track is metadata, which is "data about data." Some metadata has meaning to the user (the time the file was written, the name of the file, the location of the file in the directory structure), and some of it has no direct meaning to the user (the blocks on the disk the file is stored in, checksums about the data, etc). What kind of metadata a filesystem stores is an interesting topic, but it irrelevant to the issue of disk consistency. Arstechnica discusses metadata in detail in their history of filesystems article.
So lets imagine a have a file, and in the middle of it we want to change one letter. You simple read that block off the disk, modify it, and write it back out. Its simple, and there is not much opportunity for something to go wrong. Now lets imagine a slightly more complicated scenario, where we want to add some more data, which extends the length of the file. To do that we will need to find some extra free space, update the meta data to point the file to that space, write the new data to that space, and indicate the blocks holding the old data are free, in whatever structures the filesystem uses to track free space.
Careful observers probably already see what is coming bu....
System Crash!!!
Okay, so I just rebooted, and when I came back this file was totally garbled. Lets figure out what happened. I hit save, and the file system found some free space, updated the metadata, but before I had written the new data out to that spot the system crashed. When it rebooted it read the disk metadata which pointed to the new blocks, which still contained whatever garbage they used to contain. That is totally unacceptable, but there is a solution. If the filesystem does things in a specific order everything is much safer:
- Find free space
- Write new data to free space
- Change metadata to point to new data
- Mark old data as free
If we always do it in that order the filesystem will always point to valid data. It might be the old data or the new data (depending on when the crash happens). There is also a chance we might accidentally leave unused blocks marked as used, but that is completely acceptable since it just means that we lose a little bit of space that we can recover through a normal disk scan later.
The above process is more simpler than what is done on real filesystems for various reasons, but is essentially the way that traditional UNIX filesystems worked. If you want a good explanation of the gory details Marshall Kirk McKusick goes into it great detail in his books and lectures.
Okay, so if that is how things work, why can't I disconnect my USB drive whenever I want?
That is not how it works now, that is how it worked 25 years ago.It did not even work quite like that back then, but it was close enough that it is not worth getting into the minutiae that are historical footnotes. So why doesn't it work that way any more? In short, it doesn't work that way because hard disks are slow
But my HD does 90MB/s!
Here is a dirty little secret. Those hard drives only do 90MB/sin scenarios that are not particularly relevant most users. Specifically, that number is attained by doing sequential 128KB IO requests on the exterior track of the drive. So lets do some math
- 90 MB/s * 1024 KB/1MB = 92160 KB/s
- 92160 KB/s / (128KB/1IO) == 720 IO/s
So your drive is doing 720 IOs per second (IOPS). The thing is a128K IO takes the same amount of time as a 512 byte IO. So if we look at those above steps, 1 is at least a single metadata read, 2 is our data getting written, and 3 and 4 are each at least a single metadata write. So for each one of those 128KB writes the user is doing, there are at a minimum 3 other IOs going on behind their back to keep track of information. If we do the 4 step commit process described above that means your 90MB/s drive is going to top out 22.5MB/s.
It gets even worse than that. Most people are not writing sequential streamed data, so their IO sizes will be below 128K. The other thing is the metadata and the data are not sequential. Since they are not sequential the drive head has to seek, which is very slow. While it seeking you can't perform an IO, so the IOPS of the drive will reduce greatly if you are doing random IOs (regardless of size). The result is that 90MB/s HD of yours probably only gets ~1-2MB/s when doing 4KB random IOs.
That sucks
Yeah it does, and it is why SSDs are so awesome. I recently bought an OCZ Vertex and it is the single biggest performance upgrade I have had in years. That is a completely separate issue for another blog post.
So, how does it being slow make it unsafe to unplug?
Well, the thing is that if software had to wait that long for the drive computers would be unusable. Modern systems are large, they are graphically intensive, and people do things like playing moves and listening to audio. Hard drives have not increased their IOPS performance in the way ram and processor performance has increased, so to mitigate the drive performance issues (which is generally the limiting factor for most users, 9 times out of 10 that spinning beach ball is caused because you are waiting for something from a disk somewhere) people decided to get clever, and whenever anyone gets clever there is downside. In this case the two main improvements are caching and asynchronous IO.
We'll go through caching first, because it is simpler. The basic notion of caching is that you store something that is slow to get somewhere it is fast to get. Generally the fast location is more expensive, which makes it limited. In particular ram is more expensive per MB than disk space, but it also tremendously faster. Modern OSes all use a portion of their ram to cache disk contents. While they do cache frequently read data, the more important issue is that they write cache. What that means is that when a program tells the OS to write something to the disk it doesn't get written right then. Instead the OS stores the data for a while. That lets it try to perform several optimizations. First off it can attempt to cluster together multiple small writes into a single big write. Second it can attempt to schedule the writes to minimize the amount of seeking going on. Finally, often a program rewrites the same file multiple times. By delaying writing the file it is possible to discard some of the intermediary writes if you see they have been overwritten.It also lets the OS merge write requests from completely unrelated processes.
The second issue is asynchronous IO. Earlier when I described the 4 step file update it was implicit that we would wait for step 1 to finish before we started step 2. That is referred to synchronous IO. It would also be possible to implement a system where you send commands to the drive, and don't wait for them to complete, you just keep sending more commands. That is asynchronous IO. Asynchronous IO is much faster, because you don't have to wait for IOs to complete. If you send to many IOs to the drive it will eventually tell you to back off, but just like the OS is doing caching, so is the drive. The drives cache allows it to queue up many IOs even it is busy doing other things.
These two enhancements greatly improve performance, but at a significant cost. Caching means that Even if you have told the OS to write files to the disk, they might not be there. If all the commands were issued from the cache to the drive in the same order they were issued from the programs unplugging the drive might still be safe, but if they were issued in that order with no changes there would almost no speed up since all the optimizations that write caching allows reorder writes. Of course that is anon-issue because of the asynchronous writes.
Asynchronous writes allows the drive to perform do similar optimizations as the OSes within the drive itself. That means that even if the OS tells the drive to write block 1 then block 2 there is no guarantee the drive will choose to write them in that order. Since the drive doesn't guarantee the ordering any more there is fact that OS doesn't order them is irrelevant.
That is just insane!
Yeah, it is. Fortunately that is not what happens. At least not any more. This problem has been obvious for years, and people have been working on it for years. The have been many clever ideas, including designing filesystems that are laid out in fundamentally different ways to better cope with the performance constraints drives of hard drives. The details of how all of those work and the evolution of them would is not worth getting into in depth here, but the thing to understand is that the changes necessary are complicated, and despite the first such filesystems appearing ~26 years ago (Sprite), and a constant stream of research designs over the years (LFS, LinLogFS) and embedded filesystems (WAFL). We are just finally starting to see general purposes designs that are usable large user bases (ZFS, btrfs, Tux3). ZFS is just now starting to be used in production environments, and it has been under active development for ~8 years now. It just takes a long time to develop and stabilize a high new filesystem designs.
So we have had a problem that has been clear ~30 years, and we have people trying to solve it for almost as long. In the software world that is an eternity, so we do what we always do, we came up with a clever hack that sort of dealt with it for a while.
Enter Journaling
Journaling is a hack, get over it ;-) People often talk about journaling FSes, like journaling is a big deal in the design of the FS. The important thing about journaling is that it is really easy to retrofit into an existing filesystem that was not designed with asynchronous consistency in mind. Not only is it easy to add to an existing filesystem, it is easy to add in a way that does not change the volume format in way that is incompatible with existing implementations. I know someone who added journaling to a FAT16 implementation. That is why all existing filesystems are journaled, and almost no next generation ones are. Its not optimal, but it a very simple extension to filesystems that are otherwise based on research from the 1970s and don't take into account how drives synch speed have not scaled with other aspects of computers.
So, how does journaling work? Its simple. The idea is to dedicate a portion of the disk to be the journal. Then what you do is you look at the safe ordering of operations you need to perform. You write out the list of things you need to do to the journal all at once, as a single synchronous write. You then write them where they are supposed to go asynchronously. If the system crashes you simply read through journal and write out the list of things it said needed to be done. If they already happened before the crash you are overwriting what was written with identical data, so there is no problem. If it did not make it to the disk you are writing out consistent data that you knew you needed to write out before the crash so no problem.
To retrofit that into an existing filesystem you just create a special reserved journal file. Older implementations will ignore the journal file, so they will have to run synchronously, but they will be able to read the drive. Newer implementations will get to make most of their writes asynchronously. There is some potential extra risk moving drives back and forth between systems that do and don't support it, but in practice that tends to be insignificant.
There are of course many variations on this that subtly change the semantics of what is going on. You can journal every write to the drive, or just the metadata. If you journal just the metadata there are other dependency issues that may occur. The recent debate about ext3's data=ordered behavior is an example of that. If you want a more detailed explanation of journaling (and many other details on filesystem design), I recommend Practical File System Design with the Be File System (free pdf) by Dominic Giampaolo.
So, wait, if journaling does that, I should be able to just pull out my drive
Sorry, no. There are two big gotchas.
The first is that journaling made it so the drive was never going to be in a corrupt state, but the fact that the OS is caching data means you still don't know whether the any particular file has been written to the drive unless you force it (from a programmer's standpoint by synching the drive, from a users standpoint by ejecting the drive). Also, if you remove and reinsert the drive the OS has no way to know what is on the drive, so it can't just start writing out the cache. It is entirely possible you plugged into another computer in the interim and modified it. Before anyone claims only idiots would do that, consider how often the average user puts their laptop to sleep, then unplugs their thumb drive, then puts it into another computer. It is worth noting that some of the newer FS (like ZFS and btrfs) are designed in such a way that doing that is safe, but it is one of the consequence of being designed from the ground up to deal with that sort of thing, there is no obvious way to retrofit that sort of functionality onto existing filesystems.
The second is a little more disturbing, and that is that hard drives just flat out lie to the OS.
WHAT!?!?!
It is a little know fact that hard drives pretty much straight up lie. Let me explain. Remember when I discussed the idea of asynchronous IO, and I mentioned that the HD had an internal cache? Well, I didn't exactly explain what is going on. From the OS's standpoint a write is synchronous if it waits until the HD says that it has the information, not that it has written the information down on magnetic media. Once it is on the disk it schedule to written out, and it tend to get written out fairly quickly, but high-end consumer drives have~32MB of cache, which can take multiple seconds to write out. If they lose power then the data in that cache is lost. To guarantee that data hits the disk you need to instruct the drive to flush its internal cache. Some OSes provide an additional hook to let applications force them to flush the disk (F_FULLSYNC on Mac OS X). Internally when a journal is written it needs to make sure the journal hit the disc, so even if the OS does expose that functionality it needs to use it internally.
The problem is that there is almost no incentive for HD manufacturers to honor a synchronize command. After all, if they just do nothing when they get the command everything runs faster. It also doesn't make a difference unless something else goes wrong, and the majority of the time it works out okay even then. When it doesn't work out the consumer has no way to know that the problem was their drive, they tend to blame the power company for an outage, and MS or Apple for their OSes, but most people never think that the drives isn't writing out the data the OS told it to.
Don't get me wrong, people blame their drives when the drives physically die, but when a drive dies or reports a bad sector, but most people just aren't equipped to determine that a drive is lying to them. Hell, I know they probably lie to me, but to determine whether a particularly drive is honoring synchs or not it takes a tremendous amount of effort, and it is impossible for me to 100% confirm they are compliant without dissecting their firmware. Fortunately big computer companies care about this. For instance, Apple cares that they don't get blamed for data loss, so they force their HD providers to honor the command. The same is true for many other big name computer vendors, though I don't have first hand knowledge of who else does it. Don't get me wrong, Apple labelled HDs are still a huge price premium for a feature that should be turned on for all HDs, but you should blaming the HD manufacturers for gaming the benchmarking results at the expense of their user's data.
Okay, so if I buy an Apple HD and stick in a USB-SATA dock and pull the plug I might not have all the dataI was just working on written out to disk, but at least the FS won't be corrupt, right...
I keep giving people bad news. The drive supports synch, and the OS is sending them, but the thing is that there is a chip in between the USB and SATA ports that is converting between them, and almost all bridge chips just ignore that command instead of converting it. The only firewire or USB drives that I know of that honor synch commands are iPods and Macs in target disk mode. See, the USB chip guys are just trying to implement these things cheaply. Just like the HD guys, they don't get blamed and it is mostly undetectable, so they skimp.
I should point out it is not just the cheap guys though. I once was testing a high end embedded HW raid controller that uses fibrechannel to the host and ATA to the drives, and after seeing some data corruption we hooked it up to an analyzer and saw that it completely dropped all the synch commands. I guess most of that companies customers did not have ATA analyzers handy....
My data is doomed 8-(
No, it isn't doomed, but it is not as safe as it should be. You will be okay if you do three simple things:
- Always unmount your drives and wait for their activity lights to stop before you disconnect them
- Use SATA, eSATA, SAS, or Fibrechannel for primary storage. Those are unbridged, so the OS's synchronize commands will get to the drive
- Use server class drives, or OEM drives from big vendors. Those drives will issue the synchs correctly. That should include all SAS and Fibrechannel drives, and most high-end SATA drives
Limit your use of USB and firewire drives to bulk storage, or disconnected backups. So long as you fully unmount them before disconnect the dropped synch issue will have the potential to cause trouble if you lose power or kernel panic mid write. While that is still a problem, it is a fairly small window, and you are unlikely to have issues in practice. Of course, my system kernel panicked coming out of sleep while I was writing this post (I lost 3 pages, but it is a better post the second time around), but I think that was just the universe trying to remind that sometimes Mac OS X does crash...