Why Time Capsule is doomed to suck
Saturday, April 18, 2009 at 7:08PM Update: There is some new information and some good news at the bottom of the article.
One of the features that was introduced with Mac OS X Leopard was Time Machine. I use Time Machine constantly. I make sure my laptop (which is my primary machine) is backed up before I do anything particularly risky, like running tools that modify my drive, or taking my machine out of the house. That way I know that no matter what happens there is a safe copy of my data waiting at home.
The problem is that Time Machine is not automatic if you are a laptop user. I need to walk over, plug my laptop into a drive, and then wait while it runs. On my system it usually runs quickly, but it is still requires me to getting involved with the backup. It would be better if it could automatically backup across my wifi network. Apple supports network Time Machine backups between Leopard machines as well as selling a backup NAS product, Time Capsule. I have a Time Capsule and a Leopard desktop machine that I use as an AFP server, but I have given up on using either of them for Time Machine backups, since they have corrupted my backups multiple times. Unfortunately the current Time Machine over network implementation is fundamentally flawed and will never work correctly.
How Time Machine Works
Time machine works by literally cloning your drive into a subdirectory of another drive. If search for it on google you will find references to HFS+ hardlinks and metadata, but those are all internal implementation details to make it run with acceptable performance. If you drill down into a .backupdb bundle you will see several folders, and each one of them is a complete clone of your system at a specific point in time, minus any folders you have chosen to omit.
This is great in many ways. In particular, it means that all the applications that use Time Machine don't have to pull the files out of some archive format to work on them. That lets Finder navigate through them quickly, and hand them to third party QuickLook filters. It also means that that any filesystem that those files are stored on must support all of HFS+/HFSX's features or there will be a loss of fidelity. By fidelity I mean precise accuracy of all details of the file data and metadata, including full name (in whatever encoding your volume was using), extended attributes, permissions, acls, forks, etc.
Historically most filesystems have not been able to store a file originated on an HFS volume with full fidelity (that is why Apple used to tuck data in ._ files, they were used to stash all the data that would be lost), though that has been getting better in recent years. While losing some info might be okay when transferring a file to a foreign computer, it is never okay for a backup system to lose that kind of information. Because fidelity is such an issue and Apple has to use a filesystem that supports all HFS+ and HFSX's semantics Apple generally creates HFSX volumes for time machine volumes, since they can store content of both HFS+ and other HFSX volumes with no loss.
Backing up to a network
Okay, so in the local case Apple copies files between two drives, and it works great. Once you move to networks things get a lot more complicated. Besides from the reduced speed, most people are using laptops via wireless. Between the increased length of the backups, and the transient nature of the connections it is much likely that you will have an interrupted backup (though that can also happen with a local disk based backup, people love to just unplug drives...). Also, unless you are using something like iSCSI you can't directly use HFS+ on a remote disk, so something has to change. There are a couple of obvious solutions, all of which have drawbacks.
1) Use a network filesystem
This would be an ideal solution, if not for the filesystem fidelity issue. There are currently no network filesystems in wide usage that preserve all HFS+/HFSX semantics (particularly if you include the directory hardlink "implementation detail" of Time Machine). Of course Apple has its own network filesystem, AFP, which it could rev to support features it needs. There are two major problems with that. The first is that that most network filesystems leak the semantics of the underlying filesystem. For instance, some SMB volumes preserve case and some don't, and that is a side effect of whether or not the filesystem of the server preserves case.
So even if Apple revved AFP, the best they could do is guarantee that AFP served from HFSX using their server software would have HFSX semantics. Second, a large number of devices use embedded AFP servers on completely different OSes and FSes. There is no way Apple can know how netatalk on a consumer NAS serving files off of ext3 will handle things, but it is a good bet it will not match the semantics they depend on. So Apple would need to either block all 3rd party devices, or implement some sort of mangling in Time Machine to try to preserve all attributes in a way that would be durable. Since everyone hated ._ files the first time, that seems like a bad idea.
2) Use iSCSI/ndb/AoE
Time Machine already works with an HFSX backup disk connected via USB, so why not just connect the disk over the network. That would certainly solve any potential fidelity issues. The problem is that it introduces a completely separate set of issues. When you lose a network connection while doing a file transfer via a network filesystem the behavior is deterministic. The last files you sent over got there, the next ones you were planning to send didn't, and the one you were in the middle of might be there or not depending on exactly what happened, but you can pickup where you left off once you check that one file.
Disk drives aren't that simple. Since your machine is directly responsible for the block allocation it goes through the entire driver stack, just like it was a disk. It does io scheduling, block layout, etc. When you cut a network connection it is the equivalent of pulling out a USB cable without unmounting the drive. Mac OS X complains when you do that, because it can lead to data corruption. Most of the time it doesn't, but it is much more likely to if you are in the middle of writing stuff. Now take a situation where the cable is ethereal, it gets cut everytime your computer is put to sleep, and it is only connected when it is actively backing up files (doing lots of writes). It is a recipe for unrecoverable filesystem corruption on your backup drive.
The fact that Apple does not include support for any of these technologies in OS X or its embedded storage products certainly does not improve the case for using them.
3) Use a custom protocol
This is what commercial network backup systems do. It lets them deal with disconnects in a sensible way, and they don't care about filesystem fidelity because instead of storing files 1 to 1 they store the backed file as a blob in a database somewhere, and can store all of the attributes about it in their database. This is a lot more work to implement because now everything in time machine is no longer accessible through the normal filesystem interface. Depending on exactly how they implemented this they might be able to do it on a network filesystem, a raw network block store, or they might need a custom server.
What Apple actually did…
Okay, so those are the 3 obvious options. I left out things like "Design a whole new local and new network filesystem from scratch" as pie in the sky and not doable in the short term, though those are certainly options. Apple did not take any of the 3 obvious choices. Instead it did something allowed them to approximate solution number 2 using their existing technology stack. In short they used HFSX disk images stored on AFP volumes.
The problem is that doing that has all the downsides solution number 2. Every time you put your computer to sleep midback up it is like pulling the plug of a HD mid backup. Except that the drive is connected over a slow connection, and is thin provisioned (which makes it seem larger than it is), which makes actually preforming fscks on it completely impractical, so they have to be omitted or reduced. And disconnects happens quite frequently, so the OS does not pester you about disconnecting the drive. It is even worse because it is doing it over a network filesystem, which adds a whole extra layer of indirection and other issues.
If there was some way to make this solution work it would also mean there is a way to make it safe to randomly unplug hard drives. Trust me, if Apple knew how to do that it would be done, and the OS would not chastise you for doing something stupid when you unplug your USB pendrive without telling it first. Since they haven't figured out how to let you safely unplug USB drives unannounced it seems like a bad idea to base a backup solution on what is in essence a wireless USB cable that is phasing in and out of existence.
There are have been a bunch of great comments, but I want to call attention to one from Dominic. While my recent lost backup occurred even with all the newest updates, the backup was created before the latest software update or Time Capsule firmware. It is entirely possible the original corruption happened a while ago, but only lead to data loss recently. It sounds like if everything you are using is up to date and your backups are not already corrupted then everything should work. I am creating a fresh backup right now in order to test it out.
If you have not updated you should make sure you are using at least:
Time Capsule 7.4.1 (thanks to gerritvanaaken for pointing out I had the wrong version listed) AND Mac OS X 10.5.6 (10.5.0-10.5.6 Combo Update)
Louis Gerbarg | Comments Disabled |
Leopard,
Mac,
Time Capsule,
Time Machine 
Reader Comments (36)
Unless I'm mistaken, it seems as if they're relying upon HFS journaling to make the disconnection safe.
On the one hand - this should work, since HFS journaling *does* work. It should protect against unexpected power loss (i.e. disconnection), writes going out-of-order, etc. So the filesystem should not need to be fsck'ed.
On the other hand, it moves journaling from the category of "disaster prevention" to "relying on it every day". Which is not really where you want to be - when your backup system becomes your only system, you're screwed as soon as there's a problem with your backup system.
HFS+ is metadata journaling only, so the journal will not prevent corruption of the underlying backup data on the volume. Since the underlying sparse images are only used for backups corrupting the volume metadata or the backup's index (which is data from the FS's view, but data about the backuped data) is effectively the same thing to the user.
Even if the metadata never becomes corrupt, a cleanly unmounted HFS+ volume is not the same as an uncleanly unmounted one. Since the FSEvent daya can no longer be guaranteed all sorts of things behave in perceptibly different way. For instance, if you rip out the cord to your HFS+ drive and reinsert it spotlight will have to revalidate its entire index (which may mean stating everything on the drive).
Of course, all of that is moot because my volumes are being corrupted ;-) Admittedly it could be a bug in the Time Capsules AFP implementation, its HFS+ implementation, disk driver, VFS, or anything else. But the fact that there is a really tall stack makes bugs much more likely and makes transient bugs much harder to find and fix.
My TM backups are working perfectly for the last two months for several laptops. The TM volume is advertised from an OS X Server and all backups are performed over a wireless AirPort connection.
I attempted a restore two weeks ago, and except for a few glitches, the restore worked great.
I haven't seen the corruption you experienced, but I'll let you know if I do.
How can you tell if your volumes are being corrupted? I just finished setting up Time Machine to use a remote sparsebundle on a FreeNAS server through AFP. I've closed the lid once during a backup, gotten an error message the next time I opened my laptop, but I haven't seen any signs of corruption. Would they be obvious, or only visible upon close inspection?
I've been using Time Capsule since last November, and haven't experienced any problems at all on either of the 2 machines I back up to it. Both machines backup via airport, and while I did have a little trouble doing the initial backup (ended up connecting via ethernet for the initial setup, as Apple suggest), I've not had any corrupt routines since.
TM has some other shortcomings, as I detail at the URL below. I've not had any problems restoring, except that the files I wanted weren't stored in the first place.
Kinda' makes it hard to restore files that aren't backed up...
http://www.bill.eccles.net/bills_words/2008/08/designed-to-fail-apple-time-ma.html
How come I can unplug my ipod touch without getting any warnings whereas before with older version of other ipods I did?
I disagree. The backup record is a set of subdirectories inside a disk image, each of which is written and then never changed again until it's deleted. Apple is free to use/implement atomic synchronization operations at the AFP level to guarantee that a set of writes either all complete or none of them do. Any partial update left after a network disconnect is confined to the latest backup subdirectory which can have a special check/cleanup process applied to it each time the disk image is mounted. This is such a specialized case that it should be possible to make this very reliable.
Hrrrmmm, a smarter storage server might be helpful.
4. user a smart storage server and dedicated client. It seems like they are trying to avoid a client-server architecture and put all the onus on the system doing the backup. Cheap and fast, which precludes good.
Use Use something like rsync to update an instance of your current disk on Time Capsule, and then have Time Capsule running XFS and the smarts to periodically snapshot that instance.
Interestingly, I've seen drastically better reliability with a TM volume served from a mini via AFP than the identical hardware plugged into an AEBS (which I assume is relatively similar to what a Time Capsule does). The AEBS version seemed to regularly corrupt itself within a half-dozen backups, even via wired connections and with no apparent dropouts.
The AFP-served volume, however, has been running reliably for almost a month via a hardwired connection (one that's sleep-ed regularly, at that), and while I'm not tempting fate and only running wireless backups manually now, that hasn't had any issues thus far, either, nor did it in my first two weeks of every-hour testing with lots of sleeping.
Though the details discussed here are still all very relevant, it does seem that a "real" server feeding AFP does a much more reliable job of it than an embedded one in an AEBS. It's also two or three times faster, interestingly, from the same physical USB2 drive. Probably the embedded one is CPU limited (maybe also why it corrupts more?), while that's not even close to an issue with a C2D server.
>How come I can unplug my ipod touch without getting any warnings >whereas before with older version of other ipods I did?
Your iPod touch is not mounted as a device like other ipods. You can't just drop into the terminal and cd over to it. Thus, unless iTunes is busy talking to it, it can be disconnected at any time.
I believe you, Louis, as I'm inherently skeptical of over-the-air backups, as well. But how do you know that your backups are corrupted? Just curious.
My experience with TM has been mostly quite good. I've got three machines that I originally backed up to an external disc connected to an Airport Extreme router via USB (yes, the "unauthorized" hack.) One of the machines ended up needing a full restore, and the restore mostly worked, except for some very ugly issues with a corrupted keychain that took many, many hours to fix. But I can't necessarily attribute that to the backup, nor should I blame Time Machine, since I wasn't following the rules. Either way, the personal data that I really cared about was intact.
The experience did make me switch to a Time Capsule, and I've been using it for about six months. I've done one full restore out of curiosity, and it worked fine. I'm happy. However, I have noticed several things as potential "gotchas" - none of which have actually occurred, and which may never occur, but which I've guarded against just in case.
First: I started all my backups via direct Ethernet, and did them incrementally. That means I excluded my biggest user folders - music, pictures, and movies - from the first go-round. When the backup was complete, I added movies; then music; then pictures On my desktop, I actually go one step further: I don't back up that music or picture folders (which is almost 100 gigs) to the Time Capsule at all. Since I never delete files from it, it is a lot simpler and safer to simply mirror it to an external drive every night and exclude it from TC. But I think the policy of incrementally building your initial Time Capsule backups via a direct Ethernet connection before "untethering" them for their subsequent backups is a good one.
Second: I don't let the notebooks do continuous backup. Instead, I use a freeware utility called "Time Machine Editor" to schedule the backups at night (you can set specific times or specific intervals.) This reduces network traffic and also lets me control things like whether or not the machines are sleeping. Manual backups can still be done during the day if I feel there's a particular need.
Third: I do a second kind of data backup with a sync utility (I use Chronosync.) I want to make sure that the data I work with daily - text files, coding, etc. is preserved separately, so I make sure that every night I make a mirror to my separate "manual" external backup drive. I also use Chronosync to duplicate my master data directory between my laptops and my desktop, so that I always have the most current file on each machine (Chronosync will keep copies of deleted files, just in case.)
Fourth: There is one genuine data corruption gotcha for TM/TC - and that's for programs that bundle everything together in a single package. This is a by-product of the issue that a bunch of well-known blog-thinker types debated a while back about the danger/utility of the so-called "Everything Bucket" programs, which conveniently gather and organize all kinds of data into a single file, presumably so that the user can then take advantage of that organization. My spouse and I are both writers; she organizes her work with a program called Scrivener, and I use DevonThink. Both are "bucket" style offerings - and both have easily identifiable conditions under which their TM/TC backups can become unreliable, which are:
- A large data file
- Which is open at the time of the back
- And which is being backed up wirelessly (slowly)
I've verified this personally several times. Integrity can be -almost- assured by making sure that the program is quit prior to running at least one Time Machine backup. For the laptops, which are on a backup schedule, I simply have an applescript that runs just before the scheduled backup time. But I also make sure that my Devon (and the Scrivener files on the other machine) are copied during the Chronosync backup. (The issue isn't entirely unique to Time Machine, btw; these huge, "packaged up" files are prone to such corruption with other methods, as well, but Time Machine more so, it seems, by my tests, and also, I think the drop-dead simplicity of TM makes it somehow more likely that the user will discover this corruption the hard way.)
But my bottom line feelings about TM/TC are good. Very good. Reason one is that my spouse NEVER could adhere to a backup regimen in the past. Never. Never. Never. And it made me nuts, and cost me tons of time. Now, for the most part, she doesn't have to. My earlier solution of running ChronoSync on her machine wasn't nearly as transparent as TM. I love that.
TM does have flaws - notably with the Data Bucket programs I've mentioned. I've noticed that the current version of iPhoto has also become a bucket, no longer storing images as discrete files, but I'd assume Apple has thought this through and made it work with TM. The forums at Devon and Scrivener are filled with mixed reports on whether or not those programs function with TM. I can say for sure that in certain circumstances, they don't - which, IMHO of the backup world, equals a major fail.
My recommendation to friends?
Desktop external drives are cheap. If you have a single iMac or a Pro, don't bother with a Time Capsule. If you don't travel much with your notebook, ditto. If you have a multiple machine household, then consider a Time Capsule - a genuine Time Capsule, not the hacked version. Go for the 1TB. I'd still recommend getting a cheap USB or FW external for "static" folders - meaning ones that only get bigger and don't experience a lot of deletions - and do a daily, sync-style backup, excluding them from TM, if at all possible. It will speed the TM backups, which I think makes them more reliable. Finally, always make a manual copy of your most important self-created data, whether that's to an FTP site, or a thumb drive, or just between your two machines. That's not just because TM may be reliable or not, depending on your situation/point of view, but because you never know when Godzilla is going to stamp on your house. Redundant redundancy is always a good thing is such circumstances.
Sorry for the long post, but this is a bit of an obsession for me...I haven't read it over for grammar or spelling mistakes. I really can write in the real world...
- Dan
Louis... there's way too much to read here but suffice it to say that I have fought tooth and nail to make sure that TimeMachine backups are reliable both for data and metadata. As of 10.5.6 and the latest firmware for the TimeCapsule, we have not been able to induce corruption in any scenario aside from the drive itself dying. And that includes power-failures of the TimeCapsule, the client, flaky wireless connections, etc. At the lowest levels flush-track-cache requests get propagated across the network to the TimeCapsule which honors them, backupd is smart enough to write check points and rewind to a known stable point when resuming a backup and the AFP protocol has been slightly reworked to insure that when a connection is flaky, we properly re-establish state between the client and server.
Email me if you want additional details.
Journaling can protect regular data too, as long as you properly sync. That's up to the app, so it's not normally guaranteed for every single app.
But in this case Apple controls everything from the app level on down... and I know every layer underneath TM is capable of proper flushing and synchronization, including (as you pointed out in an earlier post) the Apple-approved hard drives. It seems likely given the capabilities of the rest of the system that TM would at least try to do the right thing. But then again, maybe not. Apple's 1.0 implementations often suck. :-)
> Of course, all of that is moot because my volumes are being corrupted ;-)
Fair enough! Totally agree about the problems of a tall stack. The subtle interactions can get ugly fast, and Apple doesn't do enough automated stress testing IMHO.
Time Machine mounts the disk image before a backup and umounts it when it is finished. It's easy to tell when a backup is occurring because I turned the menu bar indicator on. So as long as I don't put my MacBook to sleep during a backup, I should be fine. This seems the same as not yanking the USB cable during a backup on a directly connected drive.
So, "doomed" seems a little dramatic.
Norm:
The problem is that is not how anything works on HFS+. All the file information and directory structures are stored in B*Trees. Depending on what you add and what needs to moved to keep the tree balanced writing new files to the disk may perturb the metadata about existing folders. In other words if there is a journal failure while I am writing in directory A the damage is in no way limited to directory A, it could very well wreck one of those directories that you have already written to and are never going to write to again.
In other words, what you think of logically immutable is physically mutating as the disk is written to, so it is still at risk.
bluedude:
I can tell my backups are corrupted because the sparse volumes they are on refuse to mount and fscking them fails with unrecoverable errors.
Apple http://www.readynas.com/forum/viewtopic.php?p=72402#p72402" REL="nofollow">added an undocumented extension to AFP push an F_FULLSYNC out to the remote volume in attempt guarantee that the network filesystem will flush the track cache on the volume. Without that extension the journaling in a remote disk image does not actually have any atomic semantics and doesn't work. As far as I know none of the thirdparty AFP servers support this extensions. The sad truth is that almost everyone uses http://netatalk.sourceforge.net/" REL="nofollow">netatalk which has not has a new release in ~4 years.
Without that extension F_FULLSYNCs will fail, and if they fail the journal doesn't protect your data, and you NEED the journal to protect you since that is the only think stopping metadata corruption from sudden disconnects that happen constantly with networked Time Machine backups. While it is possible my issues are bugs on the Time Capsule that cause an F_FULLSYNC not to work, you are pretty much guaranteed to see corruption eventually unless you are meticulously careful about making sure never to interrupt a backup. It may go quite a while without detection because TM presumes the journal works and the volume never needs an fsck, which means by the time you notice it is probably quite severe. Personally I would not use Time Machine with any thirdparty NASes unless and until netatalk is revved to support the relevant extensions.
Marc:
I absolutely agree, and the volume hosted on my desktop did last a lot longer. It is entirely possible there are underlying bugs in the Time Capsule that really exacerbate the situation. People who have hacked up a Time Machine have claimed it is running NetBSD. Since it uses HFS+ disks and serves AFP it seems pretty likely they ported Apple's kernel HFS+ implementation and userspace AFP server. Moving filesystems between kernels is pretty tricky, there are a lot of subtle differences between VFS that can cause weird edge cases and failure modes. AFP and HFS+ on OS X are tested a lot more throughly and used in a lot more places. In the future it seems likely Apple will move their Airport product line off of NetBSD onto an embedded version of OS X, now that they have done all the work to get OS X running on small low memory devices. That would likely provide a tangible improvement just by virtue of the fact the code would have a lot more testing.
My query:
1) Is this a problem with Time Machine?
2) Is this a problem with Time Capsule?
3) Is this a problem with network backups?
In short, is there a workaround?
If I backup laptops wirelessly via an incremental backup through something like Carbon Copy Cloner to a drive connected to an OS X serving machine, am I likely to see the same problems?
I wonder if zfs snapshots would help in this situation (when, of course, OS X fully supports zfs).
Is this really equivalent to unplugging a USB drive mid-write? I think the difference is that Apple can treat the TM backup volume as something that will remain totally untouched until the next time that it connects to your machine. If Apple could assume that your USB drive would never be written to by any other machine before the next time that you plugged it in (to the original machine), would it necessarily be impossible for them to make it safe for you to randomly unplug hard drives?
Dominic:
You have an email waiting
benzado:
I concede it might have been a bit sensationalist...
peteypowderblue:
There is no reason why network backups can't be reliable, this post is about the particulars of how TM/TC work. Given what Dominic said those problems might actually be resolved at this point.
Quite a bit of care has to be taken in order to make this kind of thing work over a network. Unless CCC has a lot of special code to handle it I seriously doubt it is safe. On the other hand plenty of people use solutions like rsync or Retrospect all the time.
luis fernandes:
ZFS solves a lot of the problems here very well, but it introduces its own set of headaches. If ZFS was the default volume format of OS X a lot of these issues would be totally trivialized, but I don't think it is quite mature enough for use a general volume format. ZFS (and similiar filesystems) have a bright future, Sun/Oracle not withstanding.
Michael:
I actually was going to mention that though it was a side rant about things you could do with iSCSI so I dropped it. Yes, if you are 100% sure the drive will never be written to by another source you can make certain assumption that make things a lot better. That would work for something like Time Machine, where the app owns the disk and has a grasp of the semantics, but in general it would still not be safe because an applications view of what is on the disk reflects both the contents of the disk and the unflushed dirty data in the cache.
I suppose one could imagine a setup where when the drive was pulled all the UBC entries backed by that disk had all their data data serialized out to the main drive, and when the disk came back they were restored so that there was not data loss, but the implementing that seems very problematic.
I've had problems with Time Machines backups to Time Capsule but not recently. I have three Macs using TC to for backup and I've had to blow away the .bundle a few times.
"How come I can unplug my ipod touch without getting any warnings whereas before with older version of other ipods I did?"
Because the iPod Touch and the iPhone don't allow disk-mode access. Note that normal iPods can be removed without ejecting when you disabled disk-mode.
It's really amazing that TM works as well as it does. As my father discovered recently, you can backup up successfully and *still* have silent corruption under 10.5.6.
The sad fact is that it works, but with no way to validate a backup you are literally at the mercy of luck and that is exactly what backups are meant to protect against.
Apple really needs to scrap this system and start over, possibly with something like zfs where the snapshots can be replicated and verified cleanly.
In the course of backing up three computers (one iMac, two MacBooks) continuously to a Time Capsule over 12 months, I've had a half a dozen occasions where the Time Machine backup fails because the disk image on the Time Capsule can't be mounted. Fortunately, in all of the cases, through a combination of fsck, chanting, and voodoo, I've been able to repair the disk images (which takes a _long_ time, even while connected by Ethernet), and proceed.
This hasn't happened (knock on wood) in probably 4-5 months, so perhaps Dominic's comment about the latest firmware and 10.5.6 will prove true.
Anecdotes, data, all that. All I can say is I've been backing up a network of machines to Time Capsule since December '07, and have had no problems with any of them. I did have some minor issues with backups taking too long in early '08, but that was fixed by reindexing Spotlight. I am fairly careful, however, NOT to put the computer to sleep when running a backup. My wife on the other hand has no such compunction, and has had no problems.
MacBook Pro, Mini, and MacBook, all backing up fine. Even an Eee "hackintosh" with no problems. And 99% of the time, it's over 802.11g or 802.11n wireless.