Store time series data efficiently without wasting space

Question

I'm building an application to calculate different KPI metrics for customers of an ecommerce website e.g. (avg. order value, avg. items count and so on). KPIs are integer and or double values e.g. number of items bought, avg. order value, gross margin...

The application fetches orders data, calculate metrics and store them. I'm using MySQL as a relational database.

about metrics:

I currently have 10 metrics to calculate for each customer.

Metrics can increase in the future but not so frequently so I can consider "10" as quite definitive. Anyway altering schema in the future is not a problem at all.

I need to calculate each metric on a weekly basis (at minimum). Metrics are about customers.

about customers:

Customers are 30k and they are growing + 0.5k/month rate.

Not all customers buy with the same frequency. I can have occasional buyers but also heavy buyers.

I want to show a graph with the overall trend of a specific KPI in a given timespan.

I want to show a graph with the trend of a metric for a specific customer in a given timespan.

My entites are:

orders
customers
customers_kpi

I'm worried about storing a huge amount of useless data

52 weeks * 30k users * 4+ years = 6.2M rows at minimum

I have 2 questions:

Should I store rows for customers without orders for a given timespan (e.g. the row will be all filled with NULL)? Can avoid it somehow without affecting data visualization?
Which table structure is more efficient ("thin" table vs "fat" table) given that not all customers buy every week and number of metrics are unlikely to change often?

I'm in doubt among these 2 structures for customers_kpi table:

customer_id	kpi1	kpi2	kpi3	..kpiN	from	to

VS

customer_id	kpi_name	kpi_value	from	to

J.D. · Accepted Answer · 2021-05-14T15:35:38.563

First off, a few million rows is nothing to be afraid of. Modern relational database management systems can handle up to trillions of rows on pretty standard hardware, when architected and indexed properly. Secondly, I like to measure things by a minimum of 10 years as a good test of time, so let's take your metric and annualize it over 10 years which gives us closer to ~15.6 million rows, and let's round it up to say 20 million (to account for customer growth etc). We're still in medium sized numbers, as far as number of records in a single table, so no big deal.

To answer your questions directly:

1. "Should I store rows for customers without orders for a given timespan (e.g. the row will be all filled with NULL)? Can avoid it somehow without affecting data visualization?"

A: There's no need to store empty rows. You can create a dates dimension table which will store a single row per date and then you can outer join to it such as SELECT * FROM dateDimensions AS D LEFT OUTER JOIN orders AS O ON D.date = O.orderDate WHERE D.date >= 'some date value' AND D.date < 'some other date value'. (You can join in your other tables here as needed.) This will help keep your table much leaner (maybe that 20 million rows in 10 years becomes 10 million or 5 million roughly) because now you're not storing multiple empty order or empty customer_kpi rows every week for every customer who didn't make an order, you're only storing one row per date (or your dates dimensions table can even be simplified to just storing one row per week - though probably overkill). And outer joining to your dates dimension table will maintain the same visualization you're looking for.

2. Which table structure is more efficient ("thin" table vs "fat" table) given that not all customers buy every week and number of metrics are unlikely to change often?

A: Both table structures have their place, and sometimes a denormalized table structure like your first example structure is better for an OLAP heavy / heavy reporting database regarding performance, but in general normalization is best practice when all things are equal, like your second example structure. Because 20 million rows (worse case number of rows based on my answer to your first question) is really nothing to be concerned about in a single table, and because you get better flexibility with querying about only specific KPIs (which can be more performant by not bringing back a fat row with a bunch of unneeded data), e.g. SELECT * FROM customers_kpi WHERE kpi_name = 'SomeSpecificKPI', I'd personally recommend starting with your second table structure. You can always easily transform it later on to look like your first structure if you need or should you change your mind and want to materialize it as that structure into a new table.

score 3 · Answer 2 · edited May 14 '21 at 13:00

You should use system two:

customer_id	kpi_name	kpi_value	from	to

Empty fields don't cost space in an innodb table.

A row has a fixed max row size, which only allow a certain number of columns, depending on datatype.

So let's say you have for every row a small fixed size of columns, the first would allow simple queries to show the data in a table, as you need only to select all rows from a customer, from time A to B.

Options 2 needs dynamic SQL or a stored procedure for the same result (or you have this at application level).

Lots of columns needs on application level a fair amount of programming usually in loops, to handle the data, which needs also a higher level of testing.

A I said from the start the second approach has a lot more in favor than option 1.

Rick James · Answer 3 · 2021-05-16T18:55:20.180

(In addition to https://dba.stackexchange.com/a/291559/1876,..)

the row will be all filled with NULL

Not a problem. I see 3 choices for that:

Not store the row. (But then you might need to LEFT JOIN from a table with all dates).
You could turn NULL into 0 before creating the row. (62M rows might be 6GB of disk space -- no big deal, these days.)
In any case, you can use IFNULL() or some other function can turn NULL into 0 if needed for making the data more friendly to your graphing program.

Since the typical user will vanish within a year, it feels cleaner to avoid storing anything for the "empty" months, then deal with the LEFT JOIN to fetch the data.

When summarizing the data, do something like:

INSERT INTO summary_table
    SELECT user_id, month,
           COUNT(...), SUM(...), ...
        FROM  ... JOIN ...
        WHERE ... -- limit to "last" month
        GROUP BY user_id, month

That inserts only non-empty rows. More on summary tables: http://mysql.rjweb.org/doc.php/summarytables

(If you ever move to MariaDB, the date table goes away because you can create it on the fly with a "Sequence Storage Engine".)

Store time series data efficiently without wasting space

3 Answers3