52

I have a database with a 1:m relationship.

I have to display a list of parents to the user rapidly on a home screen at startup. The parent shows a single piece of information that is a sum of a particular child field for that parent.

I don’t want the expense of making large queries to the child table (essentially doing lists of calculations on the entire child table) - in order to show the parents. I also feel it's important to show the sum to the user on the home screen.

I have therefore denormalised the sum of all children for a particular parent and added it as a field within the parent table. Each time a CRUD operation is done within the child table (with an ID that matches a particular parent), I recalculate and reinsert the new value into the parent field.

Is this an anti-pattern and therefore 'bad' practice? I felt I was doing the right thing by prioritising performance for the UI.

6 Answers6

94

Is denormalisation for performance reasons an anti-pattern? Not of itself - if something is required, then you have to find a way to do it, and it may well be better to denormalise your data than spend $$$ on a SuperHugeDatabaseInstance to get you the speed you need. I've certainly done this in the past when we were calculating summary data for billions of data points.

Is denormalisation for performance reasons before you've measured things and found out if you actually need to do it an anti-pattern? Yes, every time.

89

Performance requirements are legitimate requirements, and it's great that you have found a potential way to meet these requirements. Denormalization is a tool, not an anti-pattern. But this likely involves tradeoffs, and you should consider them before committing to this solution.

For example, I'm worried about maintaining consistency when the same information is expressed through multiple fields:

  1. What is your “source of truth”? What happens if the sum in the parent gets out of sync? Can the cached values be recalculated from scratch?
  2. If there is a consistency problem, what would the impact be?
  3. How will you ensure that parents get updated whenever a child changes? Would that be triggers or materialized views in the DB or is this the responsibility of the applications making queries?
  4. Will you use transactions/locks to ensure that updates are performed atomically?
  5. Might transactions/locks lead to lock contention on parents with lots of children, thus reducing performance?

Many scenarios can sidestep many of these problems either because they don't have that stringent performance requirements, or because the business problem can tolerate eventual consistency. Also, workloads that mostly read a value and rarely modify it are typically easier to optimize.

You might also consider whether this is the most appropriate way to achieve your performance:

  1. Is the query just slow because the child table is not properly indexed? Does the query plan (“explain” statement) look reasonable?
  2. Is the query really on the critical path, or could it be performed asynchronously?
  3. Is the speed actually acceptable, but the GUI merely feels slow because there are no transition animations?
amon
  • 135,795
27

Premature optimization is the root of all evil - most of it, anyway - in computer science. ~Donald Knuth

Denormalizing aggregate data to avoid the aggregate function is not an anti-pattern. The anti-pattern is doing so without first establishing that the aggregate function is infeasibly slow.

First, define "quite large". Hundreds of children? Thousands? Trivial. Modern DBMS software on modern hardware can make short work of even naive implementations of the aggregate at these scales. If by "quite large" you mean hundreds of thousands or millions of rows, then we may be talking about enough raw data for a more custom solution.

A related consideration is multi-level aggregates. If the sum isn't just of children, but children of children, then the Big-O shape of the aggregate function is the number of children raised to the power of the maximum number of "generations" of child data involved in the aggregate. A thousand children of one parent is trivial, but a thousand rows with a thousand children is a million rows that have to be queried. Add one more layer and you're summing a billion rows for just three referential generations. That assumes constant access time for each record, which is rarely true in RDBMSes; best-case is usually logarithmic if indexed, linear if not, so basically add another polynomial degree to the search for tablescans. If all parents reference all children in each generation, this billion-row calculation can be required from a worst-case table of just 3000 rows, making a trivial-looking table a computational nightmare.

Last is the shape of the aggregate function itself. Most "built-ins" like SUM, AVG, STDEV etc are linear to total inputs, but there are "aggregate" functions (defined loosely as algorithms digesting a series to a scalar) that have higher computational complexity. Regression analyses tend to be N^2 or worse, and certain specialized scalar calculations can be higher-order polynomial or NP-time.

I put aggregate function complexity last because you typically have less control over the calculation you need than the inputs into it, however a common optimization involves reducing the problem from a complete calculation to an incremental one. Calculating the sum of a million random data points is a million addition operations, and there really isn't any way around that. However, calculating the sum of a million data points, given the sum of 999,999 of them and the millionth value, is basically constant-time.

This is how that aggregate column is going to help you, if it is really needed. A trigger on an insert or update of a child record, to set the sum field of the parent record(s) to (currentSum - deleted.ValueToSum + inserted.ValueToSum), makes maintaining these sums dramatically less complex than calculating the value from scratch. This improves query speed of the aggregate value without sacrificing insert speed (which is impacted by both triggers and indexes).

Again, all of this is contingent on a simple, intuitive, "naive" solution not being good enough. To determine that, you first have to define "good enough", usually "some insignificant fraction of the total load time of the UI view", and test your candidate implementation to show that it definitely doesn't meet the definition. So, start with a computed column defined as a subquery of the record's children (basically a stored subquery giving you the advantage of a cached execution plan). If that isn't fast enough, make sure you are including the value you're summing in an index of the child table on the parent ID. If that's still not fast enough, then consider replacing the computed column with incremental calculation in a trigger or the create/update SP.

KeithS
  • 22,282
23

Most relational databases have something called Materialized Views. It basically has the database precompute a query and keep that around for quick response. The database is then responsible for keeping the View consistent with the data, including any complications from atomicity transactions. Basically like database-managed caching.

Read the documentation for your database, because this can get complicated.

Using Materialized Views is no more a denormalization antipattern than using indices. But making the application layer ensure data consistency when the database could be doing it: that is an antipattern.

dspeyer
  • 369
4

If you are pulling data from a database to display in an app and hitting up against performance issue if this kind, I would seriously consider a caching layer. It's significantly faster to retrieve from a cache rather than a database.

At it's minimum the cache would store the count of children against the parent's ID whenever the count is done, and a CRUD operation that changes the quantity of children would clear the relevant values from the cache. When fetching the data for the UI first check if it's in the cache, and if not do the database sum request.

This solution would open up further performance improvements as you find other things that could easily be cached.

0

Whether denormalization is seen as one or not, I will refrain on commenting.

That said, there are many reasons to denormalize datasets. Optimizing for performance, storage and/or reliability should most often be the goal when doing so. (See: CAP theorem and data storage)

The question of whether or not it is pragmatism to do so depends heavily on how you decide to denormalize said data. Looking at the natural access patterns and query patterns set off by users, are there logical mappings to the dataset?

For example, if you have millions of products, do these products have some taxonomy like nested categories? Chiefly: does your user-facing navigation already divide cleanly into a sorted structure? Then make the leaf nodes of that structure the primary identity of each record.

Very, very often we are already "sharding" traffic or lookups by some hierarchy that is more convenient for users. Users happen to tend to be humans, just like developers.

Turns out these same groupings are actually quite useful for scaling out by further distributing the categories which use the most resources. After that point, if you need to, you can always normalize within the categories and use widely available algorithms. You end up with a flexible structure that scales based on the needs of the domain your project deals with.