Versioning data in a database

Question

I have a database to hold data about projects.

Projects have lots of data associated with them that is segmented into other tables such as project_contacts and project_images. Those are one-to-many relations where 1 project can have many contacts and many images.

However, I need to be able to keep "versions" of projects. So, if a project is updated, the previous data about the project is still present. This is a large issue with many-to-many relationship. For example, there is a categories table, and a project_categories table - many projects can have many categories.

When a project updates, not every piece of data will be updated. For example, if a project gets a new contact, it won't always have a new image, so the images table won't be updated.

One solution I have considered is to have a versions table with an auto-incrementing ID; a project_id column and a time column. Each table will also get a version_id column. Whenever a project is updated, a new row is created in versions. Then, when the tables are updated, the version_id gets entered too. This means that the latest row about a project in project_images might be 3 whereas the latest version in project_contacts might be 27.

Because the version_id is incrementing, it's really simple to get the latest version of a single project, as you can join all the tables by the highest version_id.

Where this idea seems to fail for me is if I want to get the version of a project from a specific date.

Could I use my current solution to address that problem?

score 1 · Answer 1 · answered Jan 05 '16 at 15:42

One potential method is to create a date_time column and a version_id column on an indexing table, essentially what you have described. The difference that you should make though is a new data insert onto all project-related tables whenever version_id is incremented. Something like this psuedoSQL...

insert <current date_time>, (select MAX(version_id) from versions)+1 into versions
insert (select (* - version_id - <data that is being updated>) from project),updating_column=new_data, (select MAX(version_id) from versions) into <project related tables>

This is a rudimentary method of thinking about this, but this may lead you down the right track. The major downside with this concept is that you have a lot of data running around that is essentially redundant but is logically simple to implement.

The other, more storage-smart way to think about this relationship is to keep track of what data gets updated. Your versions table would now have a foreign-key column for all data that can potentially be updated as part of your projects as well as a version_id primary key. Instead of inserting version_id into your project tables, you would insert column_version_id.

The idea would conceptually work like this. When you are querying for all data at a point in time you will first hit the versions table. From there it will join all your project tables in based on column_version_id.

While this concept has better storage there are more moving parts to the logic and is also less flexible with changes in how your project data is managed.

Versioning data in a database

1 Answers1

Linked