1

I am storing Time Series Data in a Postgresql database.

I am wondering what the performance ramifications are of the following two proposed design patterns.

Option 1: A single table with a timestamp and many columns ( maybe 100 ) containing various measurement data.

TABLE measurement (id, timestamp, measurementA, measurementB, measurementC ...) 

Option 2:

Two tables. One with an id and timestamp and another table which contains a reference to the measurement table and then multiple fields describing the recorded data.

TABLE measurement (id, timestamp)
TABLE measurement_data (id, measurement_id, description, type, value)

I like the idea of option two. As my table structure is much smaller and simpler. However I have concerns of the effects of queries on the table. Will there be a huge performance hit for a simple query such as the following?

Using Option 1s Design

SELECT measurementA 
FROM measurement 
WHERE timestamp > X AND timestamp < Y

Using Option 2s Design

SELECT value 
FROM measurement_data 
WHERE description = 'measurementA' 
  AND measurement_id IS IN (SELECT id 
                            FROM measurement 
                            WHERE timestamp > X AND timestamp < Y )

1 Answers1

1

(type of option 2 is not represented in option 1. Makes the comparison slightly unfair, but hardly matters.)

The major difference between the two options is storage size. Assuming datatype real for your measurements and bigint for IDs (for lack of info), the row size of option 1 is:

8 + 8 + 100 x 4 + 24 + 4 = 444 bytes.

While the same occupies with option 2:

8 + 8 + 24 + 4 = 44 bytes in table measurement. Plus ...
(8 + 8 + 13 + 4 + 4 + 24 + 4) x 100 = 6500 bytes in measurement_data.
So 6544 bytes (!) total in option 2 for one row in option 1.

Plus you need more indexes in option 2, half of which are bigger by the same factor than in option 1 as well. Seems hopeless. Size matters.

OTOH the query for option 2 retrieves only rows with 65 bytes per hit in measurement_data, but it adds the overhead of at least an additional index-only scan on measurement. Still might prevail for a large number of hits and clustered (more considerations!) data. And updates should be cheaper, producing less table bloat - but such a design typically does not see many updates. Many related considerations depending on the complete picture.

In any case, ditch option 2 which stores some attributes redundantly over and over. Consider option 3 instead, a classical m:n relationship:

TABLE measurement (measurement_id bigint PK, ts timestamp)
TABLE measurement_kind (measurement_kind_id int PK, type int?, description text)
TABLE measurement_data (measurement_id bigint, measurement_kind_id int, value real, PK (measurement_id, measurement_kind_id))

Still 44 bytes for measurement. Plus:
(8 + 4 + 4 + 24 + 4) * 100 = 4400 bytes in measurement_data
So 4444 bytes total in option 3 for one row in option 1.

(And a negligible 100 x (4 + 4 + 13 + 3 + 24 + 4) = 5200 bytes once --> 8k mini-table for measurement_kind.)

The equivalent query to go with it:

SELECT md.value 
FROM   measurement m
CROSS  JOIN (
   SELECT measurement_kind_id
   FROM   measurement_kind
   WHERE  description = 'measurementA'
   ) mk
JOIN   measurement_data md USING (measurement_id, measurement_kind_id)
WHERE  m.ts > X
AND    m.ts < Y;

Resumé: Maybe option 1, maybe option 3, never option 2.

Related (suggested!):

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633