Twiiter

Twitter Updates

    follow me on Twitter
    Loading..
    Loading..
    « Information durability | Main | Networked Time Machine Best Practices »
    Monday
    Apr272009

    Maybe doomed was too strong a word

    When I wrote my first post about Time Capsule it was mostly a rant so I could point friends to it instead of going over it again. I am naive, I know once you put something out on the Internet it is there for everyone to see. I didn't mind people reading it, I tweeted that I had written it, but I didn't expect to get fireballed. The honest truth is I was still fiddling around with blog templates an doing changes that were temporarily making the blog unreadable when a friend texted me asking "Are you the one writing /dev/why!?!" My response was "Yes, and how did you hear about it?"

    In my first post I might have been overly harsh, and I probably focused on one particular issue too much. To be fair, I have had multiple corrupted backups, and I feel that does entitle me to be a bit harsh. On the other hand some very talented people have spent a lot of time trying to make Time Capsule into a product that finally makes it pleasant for users to backup their data. Given that most users are unwilling to expend any effort backing things up it is probably worth cutting Apple some slack, even if it has been a bit of bumpy ride getting there.

    So what about the issues?

    I spent a lot of time focused on the issue of data reliability. As Drew and Dominic pointed out that is a bit of redherring. I acknowledged in the comments that it was more an issue of stacked complexity, that the deeper things get the more complicated and harder to trace they are, and that Apple chose a relatively deep stack (HFS+ on Mac OS X -> a disk image on Mac OS X -> AFP client on Mac OS X -> AFP server on NetBSD -> HFS+ filesystem on NetBSD). In general doubling the number of components in a stack like that more than doubles the complexity, at least in terms of catching all the weird edge cases you need to make it reliable. After the number of corrupted systems I had over the course of a year I just assumed it was too complicated a setup to make work. This was reinforced by several Software Updates that had the note "Improves Time Machine reliability with Time Capsule," but didn't solve my problems.

    As it turns out there were some fairly significant issues, but they were not insurmountable, just difficult to track down. After talking with several contacts I have a fairly good understanding of what was going on, and there was a bug that was fixed in 10.5.6 and the Time Capsule 7.4.1 firmware. If you are using a Time Capsule you should make sure you update to those if you have not already.

    So problem solved?

    Well, problem mostly solved. First off, until I have been running it for a while I can't be 100% confident about the fix, but some very smart people have told me it is fixed, and I am confident enough in their judgement to use my Time Capsule as my primary backup device. I have also been doing some particularly evil things (purposefully cutting the networking, pulling power on the Time Capsule and/or the Mac, etc). So far the backup integrity has withstood all of this, so I am mostly statisfied?

    Mostly?

    While the integrity of the data (which is what I focused on in the last post) seems to be assured now, there were some other issues with Time Capsule that I had neglected to mention in my previous post, because I got sidetracked on that one issue and the post was really long. Compared to the integrity issues they are minor, but they are worth mentioning, if only to help out other people.

    The first one is the while interrupting backups does not have the potential to corrupt backups any more, it can greatly increase the time the next backup takes. A full explanation of when and why that happens would take an entire blog post, but the short version is that if the disk image is properly unmounted it can correctly track what directories have had changes done to them, and if it is not unmounted it needs to scan substantial portions of the backup to figure out where it was when the volume was unmounted. For the most part that is not a big deal, but it does suck a bit when a backup takes 30 minutes to scan things instead of 5. Strictly speaking this a limitation of HFS+, not Time Machine or Time Capsule.

    Scanning large directories is just painful. You will notice in the above paragraph I said that when HFS+ is correctly unmounted the system maintains a list of directories that were modified, not files. For various reasons it is way too expensive to keep a list of all files that were modified on HFS+, so it tracks the directories that are modified, and Time Machine takes that last of directories and scans the files in those directories for changes. It is a very reasonable compromise that works very well unless you have directories with thousands of files that frequently change. Scanning those folders can be very slow, and if you have to do it every backup it becomes an issue.

    My problem is that I use gmail. Gmail's IMAP bridge is kind of weird, but the big issue is that it doesn't really have mailboxes, but it emulates them by putting everything with a particular tag into an IMAP mailbox. That means your inbox contains every mail you have ever received, and the folder changes every time you receive a mail. It also means that most of your mails are duplicated at least once if you tag them. Since Mail stores every a ."emlx" file for every email in a directory that corresponds to the IMAP folder that means I have several folders with thousands to tens of thousands of emails, and INBOX that is immense, all of which frequently have new files added.

    As a result Time Machine and Time Capsule takes 3+ hours canning for every backup, more if the previous backup was interrupted. Once the backups get that long the odds of interrupting them are pretty high, so I got into a state of perpetually scanning and never completing a backup. To be fair, this is a combination of limitations HFS+ imposes on Time Machine, Gmail making some poor choices in their IMAP bridge, and Mail making some poor choices in their file storage, all three of which conspire to make for a bad experience. Fortunately it was simple enough to fix by excluding ~/Library/Mail/ from the backup. Since my mail is stored on a server backing it up locally is not strictly necessary.

    So, I have gone from doom and gloom to functional but with some problems. My initial backup took ~35 hours, and my Time Capsule's AFP server has wedged 3 times (but the Time Capsule itself kept running and it was possible to reset the just the AFP server by hitting "Disconnect All Users" in the admin tool) since I started using it again.

    I am told my blog posts get kind of long winded, so I am also posting a Networked Time Machine Best Practices post that just gives simple advice without all the wandering analysis and speculation. I plan to post another update after about a month of using it. Additionally, I imagine I will post another update sometimes after Snow Leopard ships just to look at what has changed. While no new features have been disclosed, Snow Leopard is supposed to be all about cleanup, reliability, and performance, so I am eager to see if there are any improvements over the current experience.