How can I get a range of values from sparse data

Question

I have the following table (simplified):

create table tunnels
(
    interval          varchar          not null,
    ts_start          timestamp        not null,
    ts_end            timestamp,
    price_top         double precision not null,
    price_bottom      double precision not null,
)

Those represent segments, delimited by ts_start, ts_end and a range price_top, price_bottom.

The data is sparse and rows are created only when something changes. There are several timeframes running at the same time, represented by 'interval'

I would like to make a query where I request a time range and get an array where I get the values for each interval existing.

Some sample data is here: https://pastebin.com/P1mugdtk

What I'm looking for is some array with interpolated values by the minute by interval (they're represented as strings: M1, M5, etc. right now), with rows like:

[ ts: xxx, m1_price_top: xx, m1_price_bottom: xx, m5: price_top: xx, m5: price_bottom: xx, ...]
[ ts: xxx, m1_price_top: xx, m1_price_bottom: xx, m5: price_top: xx, m5: price_bottom: xx, ...]

to add a bit more complexity, within a time range, we need the previous record, for each interval, to know where to start and ts_end can be null if there is no data that closes that segment (it's always the last record).

how can I achieve that in postgres?

Vérace · Answer 1 · 2024-03-20T20:07:05.683

This seems to me to be an excellent use case for PostgreSQL's range data type(s)!

I've created your table with a few changes as follows (all of the code below is available on the fiddle here):

CREATE TABLE tunnels
(
  t_interval   TEXT         NOT NULL,  -- TEXT is better virtually **always**
  ts_start     TIMESTAMPTZ  NOT NULL,  -- With timezone better if possible!
  ts_end       TIMESTAMPTZ,
  price_top    REAL         NOT NULL,  -- REAL should be good enough - adjust as you
  price_bottom REAL         NOT NULL,   -- see fit!
CONSTRAINT ts_lt_te    CHECK (ts_start < ts_end),
  CONSTRAINT ti_ts_te_uq UNIQUE (t_interval, ts_start, ts_end)
);

A few points to note:

the TEXT type is almost always preferable to VARCHAR (first field) - see here, here and here - (these posters are PostgreSQL experts!) - I also changed its name because INTERVAL is a PostgreSQL keyword. SQL keywords should never be used as field or table names, it makes debugging difficult and makes your app non-portable
again, the TIMESTAMPTZ is virtually always preferable to TIMESTAMP - see here, here and here
I've used REAL instead of DOUBLE PRECISION - this just depends on how much precision you actually need - the former takes up 4 bytes, the latter 8. See here for details. Basically, you should only use as much precision as you require, but I'm not aware of your requirements in this area, so this is just a suggestion - à vous le choix!
I've added a couple of constraints which would help to weed out bad data - there's no bad data in your sample (however, see below)
I've hidden the table insert statement snippet in the fiddle (pain to scroll through all the time - it's just below the table definition)

So, now we come to PostgreSQL's range types (Functions and Opertors (manual), also see here and here for a couple of nice use cases). PostgreSQL is the only mainstream RDBMS which has these out of the box.

We will use the TSTZRANGE (range of times with TIMESTAMPTZ) type as follows:

SELECT 
  t_interval AS t_int, 
  TSTZRANGE(ts_start, ts_end) AS range,
  price_bottom AS pb, price_top AS pt,
  ts_start AS ts, ts_end AS te
FROM 
  tunnels
WHERE ts_start >= '2023-10-02' AND ts_end < '2023-10-04'
  AND 
  TSTZRANGE(ts_start, ts_end) 
  @> 
  '2023-10-03 19:20:00'::TIMESTAMPTZ  
ORDER BY 
  t_interval, ts_start
LIMIT 20;

Result:

t_int  range                                                pb    pt                 ts                   te
   H1  ["2023-10-03 18:00:00+00","2023-10-03 21:00:00+00")  27212.3     27479.5     2023-10-03 18:00:00+00  2023-10-03 21:00:00+00
   H2  ["2023-10-03 06:00:00+00","2023-10-03 22:00:00+00")  27255   27633.3     2023-10-03 06:00:00+00  2023-10-03 22:00:00+00
  M15  ["2023-10-03 19:15:00+00","2023-10-03 19:30:00+00")  27230   27479.5     2023-10-03 19:15:00+00  2023-10-03 19:30:00+00
  M30  ["2023-10-03 18:00:00+00","2023-10-03 20:30:00+00")  27212.3     27479.5     2023-10-03 18:00:00+00  2023-10-03 20:30:00+00
   M5  ["2023-10-03 19:05:00+00","2023-10-03 19:25:00+00")  27230   27479.5     2023-10-03 19:05:00+00  2023-10-03 19:25:00+00

(I've put another couple of examples at the bottom of the fiddle.)

Check out the @> notation - it means "Containment" - you can check out the other operators from the manual.
Note the bounds of the range - starts with [ and ends with ) - the [ is inclusive whereas the ) is exclusive. This is very important!

In the text form of a range, an inclusive lower bound is represented by “[” while an exclusive lower bound is represented by “(”. Likewise, an inclusive upper bound is represented by “]”, while an exclusive upper bound is represented by “)”. (See Section 8.17.5 for more details).

If you run SELECT * FROM tunnels ORDER BY t_interval, ts_start LIMIT 2; with your sample data, the result is:

t_interval  ts_start                ts_end              price_top   price_bottom
        D1  2023-10-03 00:00:00+00  2023-10-04 00:00:00+00  28613   25965
        D1  2023-10-04 00:00:00+00  2023-10-09 00:00:00+00  28613   27123

Now, you can see where ts_end of the first record matches the ts_start of the second - conceivably you could have two different prices at the same point in time - this is probably not a good idea. I ran a query (see fiddle) and your data doesn't appear to suffer from this, but it could.

Now, notice the range colums - they start with [ and end with ) - so no overlaps! For any instant, there can only be one value of price_bottom, price_top - see my question in the fiddle as to why values match in succeeding records.

Now, we come to the pièce de résistance of PostgreSQL's capabilities in this area! First we'll do an INSERT as follows:

--
--  Works, but overlaps with (see above)
--  D1  2023-10-03  2023-10-04  2   2   2023-10-04  27123   28613
--
INSERT INTO tunnels VALUES ('D1', '2023-10-03 01:00:00+00', '2023-10-04 05:00:00+00', 2, 2);

We don't want overlaps to be possible!

So, we do the following: CREATE EXTENSION btree_gist;. From the documentation, we have:

btree_gist provides GiST index operator classes that implement B-tree equivalent behavior for the data types int2, int4, int8, float4, float8, numeric, timestamp with time zone, timestamp without time zone, time with time zone, time without time zone, date, interval, oid, money, char, varchar, text, bytea, bit, varbit, macaddr, macaddr8, inet, cidr, uuid, bool and all enum types.

Note "text" (and "varchar") - without the extension, these types won't work with GiST indexes - not for btree like indexes anyway - try running the fiddle without the extension.

There is a whole ecosystem of PostgreSQL extensions (another feature unique to PostgreSQL), here are the "contrib" modules that are available - and this is only scratching the surface - see here, here & here. Contrib modules are not part of the core, but held in high esteem and distributed with the core.

After DELETE-ing the offending record, we create an index:

CREATE INDEX ti_ts_te_range_ix 
ON tunnels
USING GIST (t_interval, TSTZRANGE(ts_start, ts_end));

and we alter our table and establish a non-overlapping constraint as follows:

ALTER TABLE tunnels ADD CONSTRAINT ts_te_no_overlap_ct
EXCLUDE USING GIST (t_interval WITH =, TSTZRANGE(ts_start, ts_end) WITH &&);

and then we retry our INSERT - Result:

ERROR:  conflicting key value violates exclusion constraint "ts_te_no_overlap_ct"
DETAIL:  Key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023-10-03 01:00:00+00","2023-10-04 05:00:00+00")) 
conflicts with existing key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023-10-03 00:00:00+00","2023-10-04 00:00:00+00")).

So, I hope this answers your question or at least goes some way towards doing so. You say you want an array of arrays - I'm not sure if that's a good idea - manipulating long strings is not SQL's forté - it's far better at "slicing and dicing" data "atoms" - i.e. no .csv fields for example - see Codd's rules:

Rule 2: The guaranteed access rule:

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

Two final points:

1. PostgreSQL recently introduced the multirange type - that might be of interest to you if you wish to have long strings (a processing nightmare!) - a couple of articles - code before multirange and code after!
1. If you like, another thing you could do is to use a GENERATED column - it can make your SQL more elegant at the price of a bit of disk space - you can do this as follows (see small sample fiddle here):
ALTER TABLE test ADD COLUMN z TSTZRANGE GENERATED ALWAYS AS (TSTZRANGE(x, y)) STORED;

If the answer above doesn't satisfy you, well maybe the multirange approach can be adapted to your requirements - ping me on the comments! Interesting question - I learnt a lot (+1)!

How can I get a range of values from sparse data

1 Answers1