0

Consider application where users rates products (e.g. 1-5 stars).

Through passage of time, there might be millions of records. One can create desired indexes and/or keep sum and count of all ratings to easily get overall average score of particular product or seller.

The problem I am trying to tackle is how to efficiently provide an average score based on ratings from last year (365 days). In other words, any rating that is older than 365 days MUST NOT be considered for average score.

Three most naïve approaches are:

  1. Calculate it on DB level, using GROUP BY and AVG (or similar DB functions).
  • cons: it might analyze thousands of rows to give actual rating
  1. Have a background process that updates average_score once per day for all products and sellers.
  • cons: average_score is updated only once per day
  1. Recalculate average_score every time new rating is added.
  • cons: average_score is recalculated too often

My question is if there is better/smarter way to do it, taking into account there might be hundreds of thousands ratings per day?

1 Answers1

2

This depends on how important the one-year window is. If you get a request at 3:15 PM on January 31st of a year, is it important that you exclude a rating from 3:14 on January 31st of the previous year?

A nice thing about averages is that you can break them up and recalculate them. So if it is acceptable to include ratings from one year ago by day (rather than by minute - that is, it would be OK to include the 3:14 request from the previous year), then you can calculate a daily average and use that as a baseline, and re-calculate against that baseline value throughout the day. For example, in the January 31st example, you would have calculated a baseline at the start of the day, and keep track of not only what the average is but also the number of ratings that it was calculated using. So say your baseline was 4.315 stars, based on 1000 ratings. You receive a request at 3:15. You grab all ratings that occurred since the baseline, and find that 20 ratings have given you an average of 2.5 stars. You then multiply the averages back out - 4.315 * 1000 = 4315, 2.5 * 20 = 50. You can then add those together and re-divide: (4315 + 50) / (1000 + 20) = 4.279.

You can use this strategy more generally to break down your recalculations - for example, you could store a monthly average for each month, for example. And if minute-by-minute granularity is truly important, you actually could break this down further, so for example storing weekly averages, and recalculating using the partial weeks for the current week and the one-year-ago week.

autophage
  • 880