Do snapshots + RAID count as a good on-site backup solution?

Question

The two main reasons I can think of for taking backups seems to be taken care of when I use both snapshots and RAID together with btrfs. (By RAID here, I mean RAID1 or 10)

Accidental deletion of data: Snapshots covers this case
Failure of a drive and bit rot
- Complete failure: RAID covers this case
- Drive returning bad data: RAID + btrfs' error correcting feature covers this case

So as an on-site backup solution, this seems to work fine, and it doesn't even need a separate data storage device for it!

However, I have heard that both RAID and snapshots aren't considered proper backups, so I'm wondering if I have missed anything.

Aside from btrfs not being a mature technology yet, can you think of anything I've missed? Or is my thinking correct and this is a valid on-site backup solution?

score 42 · Accepted Answer · answered Mar 09 '14 at 12:21

No, it's not.

What happens when your filesystem or RAID volume gets corrupted? Or your server gets set on fire? Or someone accidentally formats the wrong array?

You lose all your data and the not-real-backups you thought you had. That's why real backups are on a completely different system than the data you're backing up - because backups protect against something happening to the system in question that would cause data loss. Keep your backups on the same system as you're backing up, and data loss on that system can impact your "backups" as well.

score 12 · Answer 2 · answered Mar 09 '14 at 15:47

For on-site backup, snapshot might be good enough, provided that you regularly 'export' your snapshot somewhere else, where it exists as passive data.

And, regularly test if your 'shipped snapshot' can be restored.

This is how I implemented a quick backup of some of my servers: store the data on ZFS, take a ZFS snapshot, send the delta to another server, where the whole filesystem is re-created (minus the actual service running).

Of course, the best backup is always off-site. Thus, after 'shipping' the snapshot(s) to a separate system, do a 'tape-out' of the snapshots regularly.

So, in my system, the server that receives the snapshot deltas, regularly dumps all its ZFS pools (including earlier snapshots) to tape.

And of course, test your tape-outs to ensure it can be restored.

Note: You will want the snapshot to take place during quiesced disk activity, and preferably in coordination with the database (if any) to ensure consistency; else, the cure might be worse than the illness. That's why NetApp & EMC 'live snapshot' feature is very useful: They will postpone a LUN's snapshot until the database using the LUN indicated that it's safe to carry out the snapshot.

score 8 · Answer 3 · edited Mar 09 '14 at 13:07

What HopelessN00b said. No.

Proper backups are on a separate device than the device being backed up. What happens when you lose two or more drives? What happens when your server room burns down? What happens when someone accidentally destroys your array?

(Anecdote alert: I once heard of someone who had PXE set to auto-install the latest Fedora. His UPS failed. After a power outage, his server rebooted and was set to PXE boot and... installed Fedora over his data. My point? Freakish things happen. Fortunately, he had proper backups.)

Preferably, you have at least three copies of your data, one stored completely offsite in case the data center burns down.

score 7 · Answer 4 · answered Mar 09 '14 at 16:47

Properly implemented snapshots MUST be supported by your storage as decent backups do use them as a very first stage of creating a backup job. It's however a bad idea to use snapshots for primary backup. Reasons:

1) Snapshots and backend storage CAN fail. So real backups must be using separate spindle set or there's a great chance to lose both primary working set and backup data @ the same time.

2) Snapshots "chew away" usable space. It makes sense to use expensive and fast storage for current hot data and off-load snapshots and backups being an ice cold data to some cheaper and slower storage. It works very well with 1) BTW.

3) Snapshots usually slow down the whole proces. Most systems use Copy-on-Write and this approach creates fragmentation. Redirect-on-Write are faster but eat A LOT of space. Very few vendors have properly implemented snapshots. NetApp with WAFL and Nimble Storage with CASL (I'm not affilated with any of them). Pretty much everybody else have issues. For example Dell Equallogic trigger 15 MB page update (and waste) on every single byte changed. That's EXPENSIVE.

score 6 · Answer 5 · answered Mar 09 '14 at 17:04

Yes, it is. It is a perfect way to store backups. Nothing else is needed, heck, even doing ingtegrity checks are just wasted time.

Just to confirm - before I give more advice... you work for a competitor of mine, right? You really do, sure? No? Oh.

Sorry, NUTS. No, not at all. Sorry, dude.

Problem is that you are totally open to any error that happens in (a) the system and (b) the operating system level. You basically only protect against someone deleting some data. Nice. That IS an often occuring error.

What you are not protecting from is:

A power spike wiping out the machine. Been there, seen that.
Some defective raid controller or memory writing sh** on the disc - there goes anything.

And a long list of other things.

This is - naturally, unless you work for a competitor of mine - you always please make a backup:

On another computer
That you isolate from at least power spikes (even if you ahve a USV).

This is why tapes rock - they are not connected and anything short o a fire or flood will not hurt them. Power spike - there goes the tape reader and maybe the robot but the tapes not in the reader are not going to be affected.

BEST would be backups offsite (did I mention stuff like fire and flooding already?) (Again, when you work for a competitor - there is no such thing as a building fire, it is totally not needed, as is fire insurance, please, save that money).

Now, you may think "oh, flooding never happens". Make sure you are sure. See, here is a video of a 09.09.09 flooding of a vodaphone datacenter. I am sure you will understand where the issue is for a insite / in computer backup:

http://www.youtube.com/watch?v=ttcQy3bCiiU

score 4 · Answer 6 · answered Mar 09 '14 at 20:26

Lesson learned from two RAID-1 Drives failing within half an hour of each other: RAID is not a backup mechanism, not in any way, shape or form.

RAID is an availability mechanism that reduces downtime in case of hardware failure but it won't help you at all in case of e.g. Viruses, data deletion/modification or plain catastrophic hardware failure.

score 3 · Answer 7 · answered Mar 10 '14 at 10:05

On its own it is not a backup solution at all. It will reduce or remove downtime in certain failure scenarios but doesn't protect you at all from many others

It can of course be a very valuable part of a more rounded availability+backup solution:

RAID plus snapshows on the same hardware
On-site copies on other hardware (remember: there are failure modes that would take out the whole box, controller, drives, and all in one go)
Semi-disconnected remote copies
and of course proper offline+offsite copies for true disasters

Also: make sure you regularly test your backups. The worst time to discover your backups are not working is when you need to retrieve something from them...

score 3 · Answer 8 · answered Mar 10 '14 at 12:53

Many experienced administrators go with what is known as the 3-2-1 rule of backups:

You should have at least three copies of your data, including the primary source. I.e. a single backup is not enough and copies within the same physical system do not count.
You should be using at least two different backup methods.
You should have at least one off-site copy of your data.

Snapshots violate all three parts:

You only use a single physical machine. Anything affecting the whole machine, such as a PSU failure, could take with it all your data.
You are only using a single method for your backups. If anything is wrong with it, you will only find out when restoring the backup in a crisis situation.
You have no backups off-site. Floods and fires happen only to others, until they happen to you...

Therefore:

You need to have at least one backup on a separate machine on your LAN.
You need to have at least one backup that is not generated using snapshots. Perhaps a good-old incremental tar archive might be in order? Or an rsync based copy?
You need to have at least one remote backup, as far as possible from your current location and definitely not in the same building.

It should also be pointed out that block-level snapshots have about the same consistency guarantees as pulling the plug on your machine and then copying over the disks. In general, you would need to run fsck after a restore or hope that the journal is enough.

Filesystem-level snapshots should be better, but they still would not guarantee the consistency of your files. For many applications (database servers come to mind) copying the files of a live instance can be completely useless, since they could be in an inconsistent state. You would need to use their own application-level backup mechanism to ensure the existence of a clean copy - for which the 3-2-1 rule would apply as well.

Finally, keep in mind that right now we are only talking about copies of your current data. To guard against failures (or security breaches, for that matter) that go on undetected for some time you also need to have several past copies of your data for quite some time back.

Do snapshots + RAID count as a good on-site backup solution?

8 Answers8