Design of portable tables with validity interval (historization, temporal databases)

Question

I'm designing a data model for an application which must keep track about the changes of data.

In a first step, my application must support PostgreSQL, but I'd like to add support for other RDBMS (especially Oracle and MS SQL server) in a second step. Therefore, I'd like to choose a portable data model with less usage of proprietary features. (The DDL for the tables may be different from RDBMS vendor to RDBMS vendor. But the SQL queries / statements in the application should be the same for all supported vendors, as far as possible.)

For example, let's say there is a users and a users_versions table. users_versions has a foreign key on users.

An example of the tables could look like:

users
----------------
id | username
---------------- 
 1 | johndoe
 2 | sally

users_versions --> references id of user (userid)
---------------------------------------------------------------------------
id | userid | name     | street      | place     | validfrom  | validuntil
---------------------------------------------------------------------------
 1 |      1 | John Doe | 2nd Fake St | Faketown  | 2018-01-04 | 2018-01-05
 2 |      1 | John Doe | Real St 23  | Faketown  | 2018-01-05 | null
 3 |      2 | Sally Wu | Main St 1   | Lake Fake | 2018-04-02 | 2018-04-20
 4 |      2 | Sally Wu | Other St 99 | Chicago   | 2018-04-20 | null

Most SQL queries will query for the entries currently valid. In the concept example above this woule look like

SELECT *
  FROM users_versions uv 
  INNER JOIN users u ON u.id = uv.userid
  WHERE uv.userid = 123 AND uv.validuntil IS NULL;

Some use cases (reporting etc.) will require SELECTing a historic version of data, as well (e.g. what data were valid at 2017-12-31?). But these won't be performance critical in my application.

In the example above, I might create a filtered unique index on validuntil to ensure that there is only 1 entry with unlimited validity at a time:

CREATE UNIQUE INDEX foo
  ON users_versions ( userid ) 
  WHERE validuntil IS NULL;

As far as I know, filtered indexes can only be used for query optimization in PostgreSQL and MS SQL but not in Oracle. Moreover, indexing null might be a tricky thing, as well (possible / only in multi-column indexes / not-possible).

Therefore, a different approach for users_versions might be the structure above plus an explicit valid column managed by the application. The most recent entry would get a 1, all historic entries would get an 0. Then I could create two indices, one for query optimization and one for integrity enforcement (only 1 valid entry at a time):

CREATE INDEX optimization
  ON users_versions ( userid, valid );

For queries like:

SELECT *
  FROM users_versions uv 
  INNER JOIN users u ON u.id = uv.userid
  WHERE uv.userid = 123 AND uv.valid = 1;

And one more index to enforce the current version integrity (e.g. ORACLE version):

-- ORACLE: Entry with null-only columns ignored in indexing:
CREATE UNIQUE INDEX only_one_valid_version_per_user
  ON users_versions ( 
    CASE WHEN valid = 1 THEN userid ELSE null END,
    CASE WHEN valid = 1 THEN valid  ELSE null END
  );

Probably this index cannot be used for query optimization, but it should ensure that there can only be 1 valid entry per userid but an unlimited amount of invalid entries (valid=0) for the same userid.

What's your suggestion for a portable design of such history tables which allows performance in usage?

validfrom + validuntil, with validuntil (nullable) set to null in the currently valid entry
validfrom + validuntil, with validuntil (not nullable) set to far future date like 2999-12-31 in the currently valid entry
validfrom + validuntil + valid flag, with valid flag managed by the application and used in queries for the currently valid entry
...?

When INSERTing new versions, my application will always perform two steps:

Invalidate current version (set validuntil to current date (plus, optionally, set valid flag to 0))
Insert new version (validfrom current date, plus, optionally, with valid flag 1)

I don't have the requirement that the database enforces overlap-free time intervals for historic entries. I only must make sure that there is only 1 entry with unlimited validity.

For some very large tables, it might be worth splitting into current and history table: One table only contains the currently valid versions (users_versions_current), one other contains all the historic versions (users_versions_history). Whenever a new versions is inserted, the previous version is inserted with validfrom/validuntil into the ..._history table.

What aspects should I consider? Do you know literature, best practice recommendations etc.?

MDCCL · Answer 1 · 2018-05-04T14:18:36.180

I must say that I agree with the spirit of other answers, and I think that you should first focus on building an optimal database with one specific database management system (DBMS) in mind; the portability aspect, although important, should be secondary.

According to the content of your question, you appear to be very familiar with the subject. Anyway, I have shared my take on two scenarios involving temporal capabilities in this post and also in this other post (containing sample diagrams, expository DDL code, etc.), in case you want to take a look and establish some analogies.

Conceptual examination

Starting the analysis at the conceptual level, the business rules under consideration can be formulated as follows:

There can be one-to-many Users
A User holds exactly-one CurrentVersion
A User holds zero-or-many PastVersions

As demonstrated, the entity types CurrentVersion and PastVersion are involved in a one-to-zero-or-many (or zero-or-many-to-one) association. Apart from the cardinalities, it can be inferred that we are dealing with two distinct entity types because, in this case, a CurrentVersion instance does not have a ValidUntil property, while all instances of PastVersion must have it.

Logical-level arrangement

So, I suggest (a) one base table for the “current version” rows and (b) one base table for the “past version” ones. In this way, the assertions (i.e., rows) retained in each table represent what is a clearly different —although associated— kind of fact (as per the relational model theory), avoiding the ad hoc introduction of ambiguities in a single table.

Considering the user example you brought up —and in agreement with the conceptual definitions above—, the structure of the two tables would be almost the same, but user_version (i.e., the one for the “past” versions) includes an additional valid_until column, which along with user_id must make up the composite PRIMARY KEY of said table. The user_version.user_id column must be constrained as a FOREIGN KEY referencing user.user_id.

Manipulation

When a most “up-to-date” version has to be “saved“, the whole row of the “previous” version undergoes an INSERT operation INTO the user_version table, attaching the corresponding valid_until value indicating the exact instant when the operation is carried out. In turn, the values of the “preceding“ row at the user (i.e., “current“) table are replaced with the “most recent“ ones, by means of an UPDATE.

Each row in the user table would cover the need for unlimited validity that you have to ensure (not having a valid_until column, the values remain valid up to the moment when they are UPDATEd, which may never arrive).

Integrity

Of course, the sequentiality of the associated values has to be taken care of (e.g., preventing overlaps, rejecting invalid dates, etc.), just like overall integrity. I would make use of ACID Transactions to guarantee that the pertinent operations are treated as a single Unit of Work within the DBMS itself. Stored procedures (or functions in Postgres) with appropriate permissions would as well be very helpful.

There is no need for NULLable columns —a table with NULL marks does not portray a mathematical relation, so one cannot expect that it behaves as such, it can be normalized, etc.—, nor for a valid column managed by one (or more) application program(s) —which would endanger data quality by violating the principle of self-protection of a database—.

Derivability

The period comprehended between the values in user_version.valid_from and user_version.valid_until stands for the entire validity_interval during which a certain “past” row was “current” or “effective” (it can be calculated in days, minutes, etc. and may be incorporated into a view or computed in application program [app] code as convenient). This and other relevant aspects imply deriving data by virtue of data manipulation operations, mostly SELECTs and a few subqueries.

Accessing the database from the external level by way of one or more apps

Constructing, let us say, an object-oriented programming “intermediate tier” consumed, in turn, by a “higher tier” of one or more apps (or another kind of software component) would as well help in database portability, allowing code reuse and considerable isolation from migrations to other DBMSs. This resource about the Repository Pattern in .NET (C#) can bring about some ideas in this respect.

Portability considerations

It is important to draw a distinction between two of the different levels of abstraction of a database built on a SQL DBMS. The (1) structure and (2) constraints of the tables along with (3) the data manipulation operations —INSERT, SELECT, UPDATE, DELETE, combinations thereof— effectuated on the tables are elements of the logical level. The (4) underlying indexes supporting a table and/or constraints are (“lower“) physical-level components.

In this manner, the same logical design principle is applicable in all the major SQL platforms but, as you mention in the question, the differences between the tools provided by the various SQL DBMSs for creating logical elements are mostly syntactic, so the portability would be affected by some dialect-specific SQL (DDL and DML) features (and perhaps by DBMS-specific data type characteristics and names as well), thus the convenient approach would be to write SQL code that complies with the ISO/IEC/ANSI standard syntax whenever feasible.

Other problem that you will face is that the same (logical) query would be executed differently (at the physical level) depending on the particular DBMS of use, so the response time would vary greatly and, hence, you will have to make some rewrites to improve speed.

With regards to the the physical-level mechanisms, yes, each of the SQL platforms offers different kinds of indexes, although you should make sure that the platform-specific indexing settings do not affect the logical-level layout of a database be it (i) when creating/changing indexes on the same DBMS or (ii) when porting a certain database to another DBMS (this point is related to the subject known as physical data independence).

In this respect, it would be very handy to have a powerful Storage Definition Language (SDL) completely independent from the Data Definition Language (DDL), which would facilitate achieving a clear separation of concerns between the logical and physical tiers, but that is a different story, so you should try to separate the code of the logical declarations from the code of the physical settings as much as possible to assist portability —hard to accomplish with current DDL mixture of characteristics, I know—.

Speed

Furthermore, it is at the physical level of abstraction where you should optimize the performance of a database (via single- or multi-column indexes, upgrading network bandwidth, improving operating system and/or DBMS and/or hardware configurations, etc.), without damaging the quality of the logical structure and constraints and, therefore, putting (1) the integrity of the data and (2) result set reliability at risk. The logical coherence is a paramount factor in general performance, and a piece of software providing incoherent information can hardly be deemed a database, it does not matter if it “works” particularly fast. Database reliability and speed go, decidedly, hand in hand.

Observation

As for the expository data provided in your question, it looks like the users_versions.id column is excessive, since it appears to be an extra column meant to retain system-controlled surrogate keys (e.g., a column with an IDENTITY property in a SQL Server table), making that table logically more wide than necessary, which implies “heavier” structures (in terms of bytes) at the physical level (e.g., a supplementary index), slowing down the execution of data manipulation operations.

In addition, since a surrogate key value is meaningless, its enclosing column would unlikely be specified as a condition in WHERE clauses of SELECT operations (in contrast, most of the queries will probably include users_versions.user_id and/or valid_from and/or valid_until, both comprising the “natural” PRIMARY KEY), so users_versions.id would practically add no benefit at all, it would in fact be a burden demanding needless management. In light of all of the above, I consider that this is another factor that you should take into account to optimize overall system functioning and administration.

My more detailed take on columns for system-controlled surrogate keys is contained in this answer and in this answer too, in case you are interested.

Enabling temporal capabilities exclusively for one column

There are situations where you have to enable temporal capabilities for only one column, so this post and this post may serve as references too.

score 3 · Answer 2 · answered May 02 '18 at 17:27

Quick Answer

Database Agnostic code is a Myth.

Overall Design Concept

My recommendation:

Never assume that these tables will be used by one and only one application
Keep as much of the data logic in the database as possible
Use VIEWS to hide the underlying table(s).
Use database side Transactional APIs [XAPI]
Hide the Transactional APIs behind INSTEAD OF triggers on the VIEW

VIEWs

Instead of having the SELECT statement in the application do the JOINs

  ResultSet get_current_users(void) {
     sql String = "select *
                  from users a
                       join user_information b
                         on a.id=b.user_id
                  where ....";
     ...
   }

Have the database hide the multiple tables with a VIEW

  ResultSet get_current_users(void) {
     sql String = "select * from current_user_information_view"
     ...
   }

Primary reason: you only have one place to modify code instead of one place per application.

XAPI

One explaination of a XAPI is on StackExchange.

The concept may have been started by Tom Kyte, but I believe it can be applied to any RDBMS that supports procedural code similar to t-sql and pl/sql.

Your suggested table layout will requiring a locking mechanism along with one or more DML statements and one or more SELECT statements. Encapsulating all of that within a database side Transational API is my recommendation.

Instead of having the application do the required steps

  void update_user( user_id int, name int, ...) {
    lock_user( user_id )
    current_row_id int = get_current_active_row( user_id );
    etc.
   }

You need to have the database side code do the required steps

  void update_user( user_id int, name int, ...) {
    sql String =  "call { update_user( ?, ?, ?); }";
    ...
  }

INSTEAD OF triggers

By implementing INSTEAD OF triggers on the VIEWs, you have the potential for frameworks, like Hibernate or Oracle APEX, to magically use certain XAPI calls via simple DML statements.

I haven't used this trick myself (I don't do Hibernate). I've only seen it as a recommendation in some other threads. Your Millage May Vary.

Final Thought

Database Agnostic code is a Myth.

Your job is to write performant code. In order to do that, you will need to take advantage of each database's features and work around each database's flaws.

This means:

The DDLs will be different.
The CREATE TABLES will be different.
The CREATE INDEXs will be different
The locking methods used will be different.
The SELECT statements, for the CREATE VIEWs, could be different.
The code for dabase side Transactional APIs will be different.
The calls to the database side XAPIs will be different. (I don't think MS-SQL supports PACKAGES).
The support capabilities of INSTEAD OF triggers could be different.

Additionally, you shouldn't believe that code will only be different at the database level. The code can be sub-version specific too.

Example Assuming your Business Requirement is to maintain an audit trail of data changes, the way you would do it in Oracle 11.2.0.3 SE is completely different than 11.2.0.4 SE

This is because the licensing for Flashback Data Archive changed such that this feature is included in all editions of an Oracle database as of 11.2.0.4 and higher (non optimized version).

That is going to be a lot of code that you will have to maintain.

If you don't have the time, resources, staff, support, or Management Approval to implement, unit test, debug, and fix the code, then the development of your database agnostic application will fail.

As such, I recommend that you stick with the overall design concept I've mentioned but implement for PostgreSQL only.

score 2 · Answer 3 · answered May 03 '18 at 01:02

Your initial design has what I call a Row Spanning Dependency. That is where the value of one column of one row is dependent on the value of a different column in a different row. For example, the value of validfrom of the second version is dependent on the value of validuntil of the first version, the third version is dependent on the second version and so on. This creates a chain of dependencies which makes the integrity of the data very fragile.

Taking your sample data, suppose you had the following entries:

users_versions
---------------------------------------------------------------------------
id | userid | name     | street      | place     | validfrom  | validuntil
---------------------------------------------------------------------------
 1 |      1 | John Doe | 2nd Fake St | Faketown  | 2018-01-04 | 2018-01-05
 2 |      1 | John Doe | Real St 23  | Faketown  | 2018-01-07 | null

Notice the validfrom date of version 2 is no longer synched with validuntil of version 1. According to the dates, V1 became invalid two days before V2 became valid. This creates a gap in the chain.

There are many ways this could happen. The validuntil date of V1 could be wrong. Or the validfrom date of V2 could be wrong. Or both could be wrong. Or both could be right, there is just a missing version that should come between them. Just looking at the data, you can see the data is wrong, but there is no way to determine where the data is wrong.

Now consider this scenerio:

users_versions
---------------------------------------------------------------------------
id | userid | name     | street      | place     | validfrom  | validuntil
---------------------------------------------------------------------------
 1 |      1 | John Doe | 2nd Fake St | Faketown  | 2018-01-04 | 2018-01-09
 2 |      1 | John Doe | Real St 23  | Faketown  | 2018-01-07 | null

Now V1 remains valid for two days after V2 became valid. This in an overlap. Again, there is no way (just looking at the data) to determine if V1 has the incorrect date or V2 has the incorrect date or if both dates are wrong. More importantly, the existance of either a gap or overlap breaks the integrity of the data. The data is not just wrong, it is invalid.

Now let's consider a design that indicates only when a version takes effect but not when it ends. Once a version takes effect it remains in effect until the next version takes effect.

users_versions
-------------------------------------------------------
userid | validfrom  |name     | street      | place
-------------------------------------------------------
     1 | 2018-01-04 |John Doe | 2nd Fake St | Faketown
     1 | 2018-01-05 |John Doe | Real St 23  | Faketown

I have also removed the incremental id value, it being cumbersome to maintain and adds no new information to the data. The PK of this tuple is (userid, validfrom). Now change the dates as I did before:

users_versions
-------------------------------------------------------
userid | validfrom  |name     | street      | place
-------------------------------------------------------
     1 | 2018-01-04 |John Doe | 2nd Fake St | Faketown
     1 | 2018-01-07 |John Doe | Real St 23  | Faketown

The second version had originally taken effect on Jan 5, it now takes effect on Jan 7. However, the data is still valid. Change either date to any value and the data will always remain valid. The only problematic value is for both dates to be the same, but as they form the PK, that is not possible.

Yes, either or both dates could be wrong, but they are always valid. We can only design for integrity, not accuracy. Because this data originates from outside the database, all verification of accuracy must take place from outside the database.

The most critical of all considerations when designing a database is data integrity. You cannot prevent wrong data from getting into the database. But you can, and should, prevent invalid data from getting in whenever possible.

Note that while all invalid data is also wrong, not all wrong data is invalid. A March 20th birthday, for example, may be inserted specifying May 20th. While that would be wrong, it is a perfectly valid birthday.

Now the data changes and a new version is inserted:

users_versions
-------------------------------------------------------
userid | validfrom  |name     | street      | place
-------------------------------------------------------
     1 | 2018-01-04 |John Doe | 2nd Fake St | Faketown
     1 | 2018-01-07 |John Doe | Real St 23  | Faketown
     1 | 2018-04-10 |John Doe | 23 New St   | Faketown

One insert. No updates. Data remains valid. Easy.

"Fine, but how do you query the current version?"

I'm glad you asked. Here is a fairly simple rewrite of your query to work with the new design. The record you're looking for is the one with the most recent validfrom date. So:

SELECT *
  FROM users u 
  INNER JOIN users_versions uv ON uv.userid = u.id
     AND u.validfrom =(
        SELECT  MAX( uv1.validfrom )
        FROM   users_versions uv1
        WHERE  uv1.userid = uv.userid
           AND uv1.validfrom <= CurrentDateTime()
WHERE u.id = 123;

The cool aspect of this design is that a "look back" query, returning data as it existed at a specified time, is the same query. Just change the line

           AND uv1.validfrom <= CurrentDateTime()

to

           AND uv1.validfrom <= :DateOfInterest

This also allows for inserting future dates. If a change is scheduled for a known time, go ahead and insert that version. It will sit there quietly until such time as it becomes valid and then will appear as current.

Here is a list of minimum requirements I set out more than 10 years ago when I started working with versioned data. I considered a design to be unworkable unless it passed all of these requirements:

Each version of the data should be self-contained and independent of other versions. This means no flag or other indicator showing which is the current version and which are "history." It also means updating the entity means inserting a new version only -- no updating of previous versions needed.
Avoid what I call Row Spanning Dependency. That is where one field (End_Date) of a row must remain in synch with another field (Start_Date) of a different row. This makes working with the data more difficult and is an excellent source of anomalies.
The current version and all past versions should be in the same table. This makes it possible to use the same query to view past data "as of" a particular date and to view the current data.
Foreign keys to data that has been versioned should work the same as normal (unversioned) data.
The design should be so simple or universally understood that the learning curve for new developers is minimized. This uses normalization which is a database standard.