Efficient downsampling of a selected timeseries to equidistant samples

Question

We have timeseries data from several sensors in a table and I am interested in getting N equally-spaced samples from the full range, as the whole data would be too much to display in a web-app.

I am currently using the following query for N = 500:

SELECT sensor_id, sensor_data_id, val1, val2, datetime
FROM (
   SELECT sensor_id, sensor_data_id, val1, val2, datetime
        , ROW_NUMBER() OVER (ORDER BY datetime)
   FROM sensordata
   WHERE sensor_id = 22
  ) x
WHERE mod(ROW_NUMBER, (SELECT COUNT(1) / 500 FROM sensordata WHERE sensor_id = 22)) = 0;

The table contains val1 and val2 for several timestamps per sensor_id (roughly 2M rows for sensor_id = 22). The table has primary key (sensor_data_id) and the following other indices - (sensor_id) and a composite index (sensor_id, date_time).

The above query roughly takes 36s and the result size for sensor_id = 22 is around 278 MB, found using (SELECT pg_size_pretty( sum(pg_column_size(sensordata))) FROM sensordata WHERE sensor_id = 22;). While the above query does not consider a time-range, I want to eventually add another condition in the WHERE clause to downsample the data for a selected datetime range for a particular sensor_id.

I have tried EXPLAIN ANALYZE a couple of times, and the only hint that seemed to help a little was increasing the work_mem which is 300 MB now. The database does not need to serve a lot of connections simultaneously, so this high a number is not an issue currently.

Are there more efficient ways to achieve the above? The table sensordata is likely to grow in time. Currently around 7 GBs for 121 distinct sensor_id.

About PG: "PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit"

Running on: AWS RDS db.t3.micro which has the following specs:

Here and here are execution plans for my query.

Erwin Brandstetter · Accepted Answer · 2023-08-20T03:14:11.360

In any case, you need the multicolumn index on (sensor_id, datetime) that you already have.

Optimized simple query

You can improve your original query to:

SELECT sensor_data_id, val1, val2, datetime
FROM  (
   SELECT sensor_data_id, val1, val2, datetime
        , row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING) AS rn
   FROM   sensordata
   WHERE  sensor_id = 22
   ) x
WHERE rn % (SELECT COUNT(*) / 500 FROM sensordata WHERE sensor_id = 22) = 0;

The added ROWS UNBOUNDED PRECEDING is a workaround for a sneaky performance issue in window functions that is fixed in Postgres 16.
See my bug report on pgsql-bugs and the discussion on pgsql-hackers.

Either way, upgrade to a current version of Postgres. There has been a steady stream of improvements to performance.

Correct query

Neither query gets you exactly 500 equidistant rows. The function width_bucket() would get us there:

SELECT DISTINCT ON (bucket)
       sensor_data_id, val1, val2, datetime
FROM  (
   SELECT sensor_data_id, val1, val2, datetime
        , width_bucket(row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING), 1
                     , (SELECT COUNT(*) + 1 FROM sensordata WHERE sensor_id = 22), 500) AS bucket
   FROM   sensordata
   WHERE  sensor_id = 22
   ) sub
ORDER BY bucket, datetime;

Adding 1 to the count makes the 500th row land inside the 500th bucket, not the 501st.

Much faster index-based selection?

If there is any reliable (!) pattern in your datetime column (or any column, for that matter; maybe sensor_data_id?), or if sampled rows don't have to be exactly equidistant, a much faster index-based selection may be possible. Say, you have a (mostly) regular time schedule for your sensor (regular intervals for datetime), then you can fetch 1 row per computed time interval:

WITH RECURSIVE rcte AS (   
   (  -- parentheses required!
   SELECT sensor_data_id, val1, val2, datetime
        , (SELECT (max(s2.datetime) - s.datetime) / 500
           FROM   sensordata s2
           WHERE  s2.sensor_id = 22) AS intvl  -- very fast idx-based
   FROM   sensordata s
   WHERE  sensor_id = 22
   ORDER  BY datetime
   LIMIT  1
   )
UNION ALL
   SELECT s.*, r.intvl
   FROM   rcte r
   CROSS  JOIN LATERAL (
      SELECT s.sensor_data_id, s.val1, s.val2, s.datetime
      FROM   sensordata s
      WHERE  s.sensor_id = 22
      AND    s.datetime > r.datetime + r.intvl
      ORDER  BY s.datetime
      LIMIT  1
      ) s
   )
TABLE rcte;

Should be substantially faster for big full samples as it can skip over many rows and pick the next via index, recursively.

and the following other indices - (sensor_id) and a composite index (sensor_id, date_time).

The index on just (sensor_id) is kind of redundant. Unless Index-deduplication in Postgres 13 or later compresses it enough to be useful, additionally. See:

Is a composite index also good for queries on the first field?

Efficient downsampling of a selected timeseries to equidistant samples

1 Answers1

Optimized simple query

Correct query

Much faster index-based selection?

Aside

Linked