65

I am trying to improve the backup situation for my application. I have a Django application and MySQL database. I read an article suggesting backing up the database in Git.

On the one hand I like it, as it will keep a copy of the data and the code in sync.

But Git is designed for code, not for data. As such it will be doing a lot of extra work diffing the MySQL dump every commit, which is not really necessary. If I compress the file before storing it, will git still diff the files?

(The dump file is currently 100MB uncompressed, 5.7MB when bzipped.)

Edit: the code and database schema definitions are already in Git, it is really the data I am concerned about backing up now.

Léo Lam
  • 105
wobbily_col
  • 1,891

5 Answers5

107

Before you lose any data, let me try to introduce a sysadmin perspective to this question.

There is only one reason we create backups: to make it possible to restore when something goes wrong, as it invariably will. As such, a proper backup system has requirements that go far beyond what git can reasonably handle.

Here are some of the issues I can foresee with trying to backup your database in git:

  • The repository will grow dramatically with every "backup". Since git stores entire objects (albeit compressed) and then diffs them later (e.g. when you run git gc), and keeps history forever, you will have a very large amount of data stored that you don't actually need or even want. You might need to limit the amount or retention period of backups you do to save disk space or for legal reasons, but it's difficult to remove old revisions from a git repo without a lot of collateral damage.
  • Restoring is limited to points in time that you have stored in the repository, and since the data is so large, going back more than a trivial amount of time may be slow. A backup system designed for the purpose limits the amount of data stored while potentially providing more granularity, and provides faster restores, reducing downtime in the event of a disaster. Database-aware backup solutions (example) can also provide continuous backup, ensuring that not a single transaction is lost.
  • Commits are likely to be slow as well, and get slower as the database grows. Remember that git is essentially a key-value data store mapped onto a filesystem, and thus is subject to the performance characteristics of the underlying filesystem. It is possible for this length of time to eventually exceed the backup interval, and at that point you can no longer meet your SLA. Proper backup systems also take longer to backup as the data grows, but not nearly so dramatically, since they will automatically manage their own size based on the retention policy you will have configured.

Despite the fact that there are apparently several interesting things you can do with a database dump if you put it into git, overall I can't recommend it for the purpose of keeping backups. Especially since backup systems are widely available (and many are even open source) and work much better at keeping your data safe and making it possible to recover as quickly as possible.

39

My two cents: I do not think it is a good idea. GIT does something like "storing snapshots of a set of files at different points in time", so you can perfectly use GIT for something like that, but that doesn't mean you should. GIT is designed to store source code, so you would be missing most of its functionality, and you would be trading a lot of performance for just a little bit of convenience.

Let me assume that the main reason why you are thinking about this is to "keep a copy of the data and the code in synch", and that this means you are worried that version 2.0 of your code needs a different database schema than version 1.0. A simpler solution would be to store the database schema, as a set of SQL scripts with CREATE statements, along the source code in your Git repository. Then, a part of your installation procedure would be to execute those scripts on a previously installed database server.

The actual contents of those just CREATE-d tables have nothing to do with the version of your source code. Imagine you install your software, version 1.0, on server A and on server B, which are used in different companies by different teams. After some weeks, the contents of the tables will be very different, even though the schemas are exactly the same.

Since you want to back up the contents of the database, I would suggest to you that you use a backup script that tags the backup dump with the current version of the software that the dump belongs to. The script should be in the GIT repository (so that it has access to the source code version string), but the dumps themselves do not belong into a version control system.

EDIT:

After reading the original post that motivated the question, I find this an even more dubious idea. The key point is that the mysqldump command transforms the current state of a DB into a series of SQL INSERT statements, and GIT can diff them to get only the updated table rows.

The mysqldump part is sound, since this is one of the backup methods listed in MySQL's documentation. The GIT part is where the author fails to notice that database servers keep a transaction log in order to recover from crashes, including MySQL. It is using this log, not GIT, that you should create incremental backups for your database. This has, first and foremost, the advantage that you can rotate or flush the logs after recovery, instead of bloating a GIT repository into infinity and beyond ...

logc
  • 2,190
7

Personally, I don't think it's a good idea to use a source control version system to store the backup files, because the GIT version control is designed for data files, not for binaries or dump files like a MySQL backup dump file. The fact that you can do it doesn't mean automatically that you should do it. Moreover, your repository, considering a new database backup for each new commit, will dramatically grow, using a lot of hard disk space and the performance of GIT will get affected, resulting in a slow source control system. For me it's fine to execute a backup strategy and have always ready a backup file when you need to restore the database when something in your code goes wrong, but source control tools aren't made to store binary data.

For these reasons, I don't see any utility in storing the backup files for day 1 and for day 2, and then seeing the differences between the two backup files. It will require a lot of extra and useless work. Instead of using GIT to store database backups when you commit new code, store the database backups in a different path, separated by date and time, and insert in your code some reference to the new database backups created for each version, using the tags, as someone already suggested.

My final note about the database backups and GIT: A database administrator, when he needs to restore a database because some data has been lost, doesn't need to check the differences between the backup file for day 1 and the backup file for day 2, he needs just to know which is the last backup file that will allow him to restore the database, without any error and data loss, reducing downtime. Indeed, the task of a database administrator is to make the data available for recovery as soon as possible, when the system, for some reasons, fails. If you store the database backups in GIT, linked to your commits, you don't allow the database administrator to restore the data quickly, because your backups are limited to points in time that you stored in the GIT repository, and to reduce the downtime of the system, because the performance of your GIT repository will be dramatically reduced having a lot of data to store.

Then, I don't recommend to store the backups using GIT, use instead a good backup software solution (there are some of them here), that will provide more granularity and will allow you to keep your data safe and secure, and making your data recovery simple and fast in case of disasters.

1

You should not store binary data in Git - especially database.
Code changes and database DML changes are totally different things.

MySQL and Oracle can write archive logs for the purpose of being restored to any point in time. Just backup those logs to somewhere safe and you will be okay.

To use Git to back up these "archive logs" does not make sense. Archive logs in production environments are rather heavy and should be removed after making regular full backups. Also it is useless to put them in git - those are already a repository in some sense.

gnat
  • 20,543
  • 29
  • 115
  • 306
Jehy
  • 119
  • 2
0

You SHOULD store a database dump.

In fact i say, you even may have to. How else can developers can play with the real data without destroying something?

But it is a set of data from some snapshots not for the purpose of backup (unless it is your last line of defense). Also there is an interesting privacy policy problem here.

Lothar
  • 702