Is it poor practice to aggregate data from different tables into one?

Question

Background

I write a lot of large reports for and generally maintain a large health records DB (write SPs, functions, jobs, etc.). The original schema, and software that uses it, is from a different vendor, so I can't change much about it structurally. There are many records that require tracking such as labs, procedures, vaccines, etc. and they are scattered across dozens of tables, many of which are bloated and poorly indexed (I have been able to fix this somewhat).

The Problem

The issue is that because we have little control over the DB, and since it can change from any given update or patch, it makes writing and maintaining these reports difficult and tedious - especially when there's a large amount of overlap. All it takes is one patch and I'm stuck rewriting large portions of a dozen reports. Additionally, queries quickly become obfuscated and slow as the joins, nested selects, and applies pile up.

My "Solution"

My plan was to write all of these records to one "catch-all" table, and write triggers on the original tables to maintain the records in this aggregate table. Of course I'd need to ensure my triggers were intact after updates, but this would be much easier from a maintainability standpoint, and just referencing the data.

The table would be thin and long, storing only the data required, something like this:

CREATE TABLE dbo.HCM_Event_Log (
    id INT IDENTITY,
    type_id INT NULL,
    orig_id VARCHAR(36) NULL,
    patient_id UNIQUEIDENTIFIER NOT NULL,
    visit_id UNIQUEIDENTIFIER NULL,
    lookup_id VARCHAR(50) NULL,
    status VARCHAR(15) NULL,
    ordered_datetime DATETIME NULL,
    completed_datetime DATETIME NULL,
    CONSTRAINT PK_HCM_Event_Log PRIMARY KEY CLUSTERED (id)
)

Then I'd have various relational tables for things like the type_id and item groupings.

I'm starting to second guess this idea as several of these tables are written to quite a bit, the SPs and reports I'd be writing would reference the data a lot as well. So I'm concerned that this table would become a record locking and performance nightmare with so much I/O.

My Question

Is a bad or a good idea? I realize every situation is different in SQL Server (2008 r2 Standard Edition BTW), and the "sometimes" rule, but I'm really just looking for general advice.

I started to consider using a service broker, but I'd only be performing simple updates/inserts (See the alternative to the accepted answer). The data in many cases needs to be realtime, so using a backup DB wouldn't really work. Performance is already somewhat of an issue for us, but most of that is hardware related that'll be resolved soon.

Vladimir Baranov · Answer 1 · 2015-08-17T23:46:29.013

If I understood you correctly,

you have a large third-party system,
you don't have much control over it,
you make complex reports that read data directly from this third-party database,
your queries depend on the internal structure of the third-party database.

I would approach it like this:

Set up my own separate database, which I have full control of.
Set up a sync process that reads data from relevant tables and columns from the third-party database and inserts/updates into mine.
Develop my complex reports based on the stable structure of my database.

In this case you can fine-tune structure and indexes of your database to improve performance of your reports, without affecting third-party system. Unless the original data structure changes dramatically, the logic of your queries for your reports would not change if third-party database changes. You would have to adjust only the sync process.

The sync process is effectively the conversion process - you convert data from third-party database into the structure that you need. Part of this conversion process could be fixing any normalization problems that original third-party database may have. Only this part of the system has to know and depend on the internal structure of third-party system. Your main reports and main queries would depend only on your database.

So, the main point is - separate and limit the part of your system that depends on internals of third-party system.

update

Regarding real-time requirement. BTW, I always thought that definition of "real-time" is "guaranteed response time", not "some small response time". It depends on your application, of course. In my practice it is enough if I sync two databases within a minute of detected change. If a user sees a report on screen and some underlying data changes, the report has to be somehow re-run to reflect this change. You can poll for changes or listen to some event/message, still the report query has to be executed again to show the latest changes.

You already intend to write triggers to capture changes in the original tables and write these changes to one generic table. So, capture changes as you intended, but write them to properly normalized tables, not a single one.

So, this is extreme case - conversion of third-party data structure into your internal data structure is performed in the triggers that fire on INSERT/UPDATE/DELETE of third-party tables. It can be tricky. The code of triggers would depend on internal structure of both systems. If conversion is non-trivial it may delay original INSERT/UPDATE/DELETE to the point of their failure. If there is a bug in your trigger it may affect original transaction to the point of their failure. If third-party system changes it may break your trigger, which would cause transactions of third-party system to fail.

Less extreme case. To make the code of your triggers simpler and less prone to errors write all captured changes to some staging/audit/diff table(s), set some flag/send a message that there are changes pending and launch the main conversion process that would go through these intermediary tables and perform conversion. The main thing here is that potentially heavy conversion process should happen outside the scope of the original transaction.

At a second glance it looks pretty much like your original suggestion in the question. But, the difference is: the capture-all tables hold data only temporarily; the amount of data is small - just what has changed; it doesn't have to be a single table; eventually the data will be stored in separate properly normalized permanent tables, which you have full control of, which are independent from third-party system and which you can tune for your queries.

score 3 · Answer 2 · answered Aug 14 '15 at 12:52

By all means put it into a standardised set of tables so that you can tweak the import stage rather than having to change complex report(s) and queries. But the data should still be normalised which will require multiples tables (but with good indexes).

As others have mentioned, don't use triggers, sync in batches.

Don't worry about lots of joins, when data is normalised and indexed properly these do not add any significant cost or management burden.

The time to denormalise into something like a data warehouse is when you need to be able to make lots of different types of query on the data that you can't predict. It has it's own disadvantages and overheads and should be used where appropriate, not as a go-to thing.

score 3 · Answer 3 · answered Aug 18 '15 at 04:32

I worked with a very similar situation like this in the past in a 24x7 manufacturing company and finally decided to use transactional replication. It is possible to configure DDL to be replicated such that you can push out whatever the patches change to the subscriber. Obviously there are pros and cons to everything and you need to weigh them to determine what you can support against what works best for the company.

On the positive side:

"Real-time" is limited only to network and transaction commit performance on the subscriber. In my experience with moderately high TPS system, we were replicated to within less than 10 seconds of "real-time" data.
Separation of workloads. You're currently running a mixed workload on one server. If you can separate out these two concerns, then you may be able to get the performance benefits on both systems of having removed one workload from the equation
Control. You'll be able to make indexing/stats/maintenance modifications to suit your reporting workload.

There are cons, though:

Cost. Another license and more hardware (virtual, or otherwise).
Replication. It works great once it's been setup properly, but it can be a hassle to get to that point.
Maintenance. If you make any deleterious changes to the structures (e.g. drop an index), they will return when the snapshot is applied (after the publication has changed or when the articles have changed).

Andomar · Answer 4 · 2015-08-14T07:03:40.303

My plan was to write all of these records to one "catch-all" table, and write triggers on the original tables to maintain the records in this aggregate table.

Triggers have so many problems you should avoid them:

An error in a trigger can cause the original transaction to abort
Triggers that correctly handle multi-row operations are hard to write
Triggers can confuse client applications by modifying the returned rowset (for example, a trigger overrides the number of affected rows)
When one trigger triggers another, the results are hard to predict

A better option is a job that periodically copies the data to a new table. Your reports can run of the copy. A job that copies rows is easy to write and maintain, and it there is no risk that it will affect the operation of the third party application.

Is it poor practice to aggregate data from different tables into one?

4 Answers4