Originally Posted By: tacit
I make clones of my booted volumes all the time, but I make a point to ensure that no user-level applications are running when I do it.

That's what I used to do, back in the days when I was backing up my MacOS 8.6 and earlier systems using Retrospect. Retrospect had the nice feature that, after copying all the files that had changed, it would go back and compare them with the copies. (They didn't do that right for a while: they'd just read back what they'd just written to see if they wrote it right, but that doesn't detect read errors. I got some very bad backups because of that. They eventually fixed it so that they copied all the files, then made a second pass comparing all the files, making sure they were reading everything from disk again and not just from cache.)

But when I went to OS X, I discovered to my dismay that "no user-level applications" wasn't nearly good enough. I was astonished at how much OS X does in the background, a lot of it with important files that really needed to be backed up. For a long time I made vain attempts to shut down more and more stuff, but it was becoming unwieldy to remember all the things that needed to be turned off, and increasingly problematic to remember to turn it all back on again after the backup. (At the least, you had to disconnect from the internet during the backup, because so many background processes are constantly updating files with information that becomes available online.)

I finally conceded that the only safe way to do it (and in practice the easiest way also) was to boot from a different volume. I never back up a running boot volume anymore.

Except with Time Machine. TM isn't perfect, but it knows it's backing up a running system, and goes to extraordinary lengths to mitigate the risks.

One thing it does is almost childish in its simplicity, but an effective stratagem nevertheless: it keeps all the backups. If one of them is bad, you can restore from another. This still isn't perfect (you need to know the backup is bad, you need to know which files to restore, etc.), but it was enough to justify giving it a chance to prove itself.

Another thing it does is that every time it does a backup, it does two backups in rapid succession, back to back. Each time it backs up, it backs up only the files that have changed since the last good backup, using the FSEventLog mechanism to find those changes very fast. Much much faster than the time it would take to look at every file to see if it had changed. (Other backup programs spend much more time figuring out which files have changed than they do actually copying them.) When it does that second backup, the "last good backup" is the first one, so the only files that have changed are the ones that changed during the copy phase of the first backup. (The first backup has already caught all the files that changed while it was evaluating what files had changed.) It only takes a few seconds to find and copy those, so the second backup is a consistent copy across the whole filesystem to within a few seconds. (This not only guards against single files changing at the time, but related files that need to be updated and backed up in a consistent state.) It's still not perfect, but the window for failure is getting narrower.

Some files it knows cannot be backed up. If it finds a disk image file that's currently opened read/write, it knows there's no way that file can be backed up. Instead, it punts, saying that the previous backup of that file is "good enough". (Stale data is better than bad data.) One way to look at that is, as far as TM is concerned, a disk image file doesn't actually change until it's closed, and it'll get backed up then.

Some it knows how to deal with. It knows what an sqlite database is, and backs it up in a consistent state even if it's being updated at the time. (How? Talk to the database gurus. Database designers have to deal with multiple simultaneous updates all the time, and many a doctoral dissertation has been written on each tiny aspect of that problem. TM just plays by the rules that SQLite sets up.) That means it gets a consistent view of your Mail data even if Mail is retrieving mail at the time.

During a backup, it uses an FSEvent stream to watch what files are being changed. It can detect a file that changes while it's being backed up, and knows the copy is no good. If it happens in the first pass, it puts that file in the list of things to copy in the second pass. If it happen in the second pass, it will either ignore the change (just like it does for disk images mounted read/write), or will sometimes mark the entire backup as bad, putting up an error alert to that effect. (I don't know how it decides which action to take. I've seen it do both.) No matter. The next backup, an hour later, will probably be good. Two hours between backups is still a lot better than any other backup method can strive for.

Is it perfect? Not if you believe in Murphy, but it's not shabby.