111

Are there any guidelines or rules of thumb to determine when to store aggregate values and when to calculate them on the fly?

For example, suppose I have widgets which users can rate (see schema below). Each time I display a widget I could calculate the average user rating from the Ratings table. Alternatively I could store the average rating on the Widget table. This would save me from having to calculate the rating every time I display the widget, but then I'd have to recalculate the average rating each time a user rated a widget.

Ratings       Widgets
---------     -------
widget_id     widget_id
user_id       name              
rating        avg_rating  <--- The column in question
BenV
  • 4,923
  • 7
  • 40
  • 38

5 Answers5

64

It depends. Pre-calculating aggregate values places a larger load on writes, deriving them makes reads more difficult

If you are frequently accessing a derived value, pre-calculation is a valid de-normalization step. However, in this instance, I recommend using a Materialized View (a view, written to disk, linked by trigger to the parent tables). The materialized view is designed to store frequently asked but tedious-to-derive data, and is useful for high numbers of writes and low numbers of reads.

In a high-write, high-read scenario, consider having a task in the background which mimics the effects of a materialized view, but in less than real-time. This will present a "good enough" average while preserving write and read performance.

In no circumstances, should you treat the derived column like a "normal" column: make sure the data presented in the Widgets "view" is present elsewhere in the table, such that the entire tuple can be derived by whatever processes you emplace. This question is also strongly database (and database-version) specific, so I recommend performance testing of the aggregate (with appropriate indexes) against a normal-sized data set and the materialized view.

Brian Ballsun-Stanton
  • 4,731
  • 2
  • 30
  • 36
12

How often you need to calculate/display the values relative to how often the underlying numbers are changed/updated.

So, if you have a website with 10k daily hits that's displaying a value that's only going to change once an hour, I'd calculate it when the underlying values change (could be a database trigger, whatever).

If you have a tool to go and look at stats, where the stats are changing by the second, but you only three people have access, and they only look at it a couple of times a day, I'd be more likely to calculate it on the fly. (unless, it takes a couple of minutes to calculate that having had stale data in the first place isn't a big deal ... and my boss tells me to just generate the thing from cron every hour, so he doesn't have to wait when he wants to look at it.)

Joe
  • 5,189
  • 1
  • 29
  • 39
4

Use StaleWidgets table as a queue of "invalid" (to be recalculated) widgets. Use other thread (asynchronous) task that can recalculate these values. Period or moment of recalculations depends on system requirements:

  • just on read,
  • at the end of month,
  • for some user at the start of day
  • ...
garik
  • 6,782
  • 10
  • 44
  • 56
2

For the case in particular there is a diferent solution where you don't have to add all the ratings and divide it by the total to find the average. Instead you can have an other field that contains the total of the reviews, thus each time you add a rating you calculate the new average using (avg_ratingĂ—total+new_rating)/total, this is much faster than aggregate and reduces disk readings since you don't have to access to all the rating values. Similar solutions might apply to other cases.

The downside of this is that it is not an acid transaction, so you might end with an outdated rating. But still you can solve that by using triggers in the database. The other problem is that the database is not normalized anymore, but don't be afraid to denormalize data in exchange to performance.

2

I would suggest calulating on fly if calulation is not too cumbersome and in case where you have complex calcutaion and frequent update but not that frequnet read than you can store calculated data and have extra column(bool) which will store whether recalculation is required or not. e.g. set this column to true whenever recalculation should be done but don't do recalculation and when you do recalculation set this column as false(this will represent calculated value is latest and not stale).

This way you don't have to recalculate everytime, you will calculate only when you have to read and recalculation column value is true. This way you will save lots of recalculation.

techExplorer
  • 261
  • 1
  • 4