11

I have data like following:

    created_at       |    status    
---------------------+-------------
 2016-04-05 1:27:15  | info
 2016-04-05 3:27:15  | info
 2016-04-05 5:27:15  | warn
 2016-04-05 10:27:15 | info
 2016-04-05 11:27:15 | warn

With this data, I want to convert like as follows:

 status  | 2016-04-05 1:00:00 | 2016-04-05 4:00:00 | 2016-04-05 8:00:00 | 2016-04-05 12:00:00
---------+--------------------+--------------------+--------------------+-------------------
 info    | 1                  | 1                  | 0                  | 1
 warn    | 0                  | 0                  | 1                  | 1                  

Can anyone suggest the best way to do this?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
user91740
  • 111
  • 1
  • 1
  • 3

2 Answers2

8

Assuming 2016-04-05 0:27:15 instead of 2016-04-05 1:27:15 in the underlying table, the question would make more sense to me:

CREATE TABLE tbl (created_at timestamp, status text);
INSERT INTO tbl VALUES
  ('2016-04-05 00:27:15', 'info')
, ('2016-04-05 03:27:15', 'info')
, ('2016-04-05 05:27:15', 'warn')
, ('2016-04-05 10:27:15', 'info')
, ('2016-04-05 11:27:15', 'warn');

The logic would be to count events that happened up to and excluding the next bound. This fits the often overlooked function width_bucket() perfectly. To be precise, it requires the variant with arbitrary bounds (since there is no regular pattern in the OP's bounds) introduced with Postgres 9.5. Explanation straight from the manual:

width_bucket(operand anyelement, thresholds anyarray)

return the bucket number to which operand would be assigned given an array listing the lower bounds of the buckets; returns 0 for an input less than the first lower bound; the thresholds array must be sorted, smallest first, or unexpected results will be obtained

For regular buckets you can use another variant that's available in Postgres 9.1 as well.
Combine it with crosstab() re-using the same bounds as column names (the rest of the query works with Postgres 9.1):

SELECT * FROM crosstab(
 $$SELECT status
        , width_bucket(created_at, '{2016-04-05 01:00
                                   , 2016-04-05 04:00
                                   , 2016-04-05 08:00
                                   , 2016-04-05 12:00}'::timestamp[])
        , count(*)::int
   FROM   tbl
   WHERE  created_at < '2016-04-05 12:00'  -- exclude later rows
   GROUP  BY 1, 2
   ORDER  BY 1, 2$$
, 'SELECT generate_series(0,3)'
   ) AS t(status text, "2016-04-05 01:00" int
                     , "2016-04-05 04:00" int
                     , "2016-04-05 08:00" int
                     , "2016-04-05 12:00" int);

Result:

 status | 2016-04-05 01:00 | 2016-04-05 04:00 | 2016-04-05 08:00 | 2016-04-05 12:00
--------+------------------+------------------+------------------+------------------
 info   |                1 |                1 |                  |  1
 warn   |                  |                  |                1 |  1

The second crosstab parameter ('SELECT generate_series(0,3)') is a query string when executed returning one row for every target column. Every value not found on either side - not in the raw data or not generated by the 2nd parameter - is simply ignored.

Basics for crosstab():

Replace NULL with 0

If you need 0 instead of NULL in the result, fix with COALESCE(), but that's merely a cosmetic problem:

SELECT status
     , COALESCE(t0, 0) AS "2016-04-05 01:00"
     , COALESCE(t1, 0) AS "2016-04-05 04:00"
     , COALESCE(t2, 0) AS "2016-04-05 08:00"
     , COALESCE(t3, 0) AS "2016-04-05 12:00"
FROM   crosstab(
 $$SELECT status
        , width_bucket(created_at, '{2016-04-05 01:00
                                   , 2016-04-05 04:00
                                   , 2016-04-05 08:00
                                   , 2016-04-05 12:00}'::timestamp[])
        , count(*)::int
   FROM   tbl
   WHERE  created_at < '2016-04-05 12:00'
   GROUP  BY 1, 2
   ORDER  BY 1, 2$$
, 'SELECT generate_series(0,3)'
   ) AS t(status text, t0 int, t1 int, t2 int, t3 int);

Result:

 status | 2016-04-05 01:00 | 2016-04-05 04:00 | 2016-04-05 08:00 | 2016-04-05 12:00
--------+------------------+------------------+------------------+------------------
 info   |                1 |                1 |                0 |  1
 warn   |                0 |                0 |                1 |  1

Adding totals

To add totals per status use the new GROUPING SETS in Postgres 9.5+

SELECT status
     , COALESCE(t0, 0) AS "2016-04-05 01:00"
     , COALESCE(t1, 0) AS "2016-04-05 04:00"
     , COALESCE(t2, 0) AS "2016-04-05 08:00"
     , COALESCE(t3, 0) AS "2016-04-05 12:00"
     , COALESCE(t4, 0) AS total
FROM   crosstab(
 $$SELECT status, COALESCE(slot, -1), ct  -- special slot for totals
   FROM  (
      SELECT status
           , width_bucket(created_at, '{2016-04-05 01:00
                                      , 2016-04-05 04:00
                                      , 2016-04-05 08:00
                                      , 2016-04-05 12:00}'::timestamp[]) AS slot
           , count(*)::int AS ct
      FROM   tbl
      WHERE  created_at < '2016-04-05 12:00'
      GROUP  BY GROUPING SETS ((1, 2), 1)  -- add totals per status
      ORDER  BY 1, 2
      ) sub$$
 , 'VALUES (0), (1), (2), (3), (-1)'  -- switched to VALUES for more sophisticated series
   ) AS t(status text, t0 int, t1 int, t2 int, t3 int, t4 int);

Result like above, plus:

...  | total
... -+-------
...  |     3
...  |     2

Note that total includes all rows not excluded before aggregation, even if filtered by crosstab().

This is in reply to @Vérace's request in the comments rather than to the unclear question.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
5

To answer this, I did the following. (I found the following three threads helpful - 1, 2 and 3. I also found the documentation on generate_series and CROSSTAB here and here respectively to be useful). This should work on 9.1 - not tested, but the documentation indicates that no post-9.1 stuff is being used.

Created a table:

ntest=# create table pv_tab(created_at timestamp, status varchar(10));

and populated it.

INSERT INTO pv_tab VALUES('2016-04-05 01:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 03:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 05:27:15', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 11:27:15', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 00:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 01:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 01:24:13', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 01:12:13', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 01:12:22', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 02:05:45', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 02:34:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:34:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:35:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:45', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:34', 'info');
(20 rows)

The (correct) result of my query is:

stat        slot1   slot2   slot3   slot4  slot5  slot6   Total
----        -----   -----   -----   -----  -----  -----   -----
info            6       2       0       3      0      0      11
warn            3       1       1       4      0      0       9

The working query is:

SELECT * FROM CROSSTAB
(
'
WITH time_slots AS
(
  SELECT
    status,
    CASE

      -- Here I put the "created_at" values into "buckets" - it would
      -- not be desirable to have too many of these buckets - certainly
      -- any more than 12 would make the `SQL` and result unwieldy!

      -- I recommend that you create 2hr slots - 00:00 - 02:00, &c.
      -- This `CTE` splits the times into the various slots sample slots
      -- 1-4 - you can, of course, have more but it makes the `SQL` and 
      -- the answer more messy. Here, I;ve deliberately only used 4
      -- out of 6 in order to illustrate dealing with sparse data in 
      -- the result. (I used the OP;s initial slots - easy to change).

      WHEN created_at <  ''2016-04-05 02:00'' THEN 1
      WHEN created_at >= ''2016-04-05 02:00'' AND created_at < ''2016-04-05 04:00'' THEN 2
      WHEN created_at >= ''2016-04-05 04:00'' AND created_at < ''2016-04-05 08:00'' THEN 3
      WHEN created_at >= ''2016-04-05 08:00'' AND created_at < ''2016-04-05 12:00'' THEN 4
    END AS time_slot,
    COUNT(status) AS stat_count
  FROM pv_tab
  GROUP BY status, time_slot
  ORDER BY status, time_slot
),
statuae AS
  -- Get all statuses. Hardly necessary when there are only two, but
  -- could be an issue later if more values are required ("unknown"..  &c.).
(
  SELECT DISTINCT(status) AS stati
  FROM pv_tab
),
all_slots (slots) AS
  -- This `CTE` is necessary to perform a cross-join between statuses 
  -- and slots. This is because the `CROSSTAB` table function doesn;t 
  -- appear to play well with `NULL`s - see my question to Erwin
  -- Brandstetter in comments.
(
  SELECT generate_series(1, 6) -- six (should be) 2 hour slots. In any case, it is arbitrary!
),
stat_slots AS
  -- Here the statuses slots are cross-joined - i.e. all slots with all statuses.
(
  SELECT statuae.stati, all_slots.slots
  FROM statuae, all_slots
),
individual_stati AS
  -- `Left-join` the complete status/slot table with the actual slots in
  -- the sample table. NULL counts are `COALESCE`ed into 0 - necessary, otherwise
  -- `NULL`s "back up" the result and leaves blanks in the right-most 
  -- columns - and the totals appear in what should be slots. 
(
  SELECT ss.stati AS status, ss.slots AS time_slot, COALESCE(ts.stat_count, 0) AS counts
  FROM stat_slots ss
  LEFT JOIN time_slots ts
  ON ss.stati = ts.status AND ss.slots = ts.time_slot
  ORDER BY 1, 2
),
total_stati AS
  -- This is just pure showing off :-). I;m using this `CTE` to add
  -- a totals field to the query. Not asked for by the OP - can be
  -- ripped out! I got the idea for this from the 3rd link (top of post).
(
  SELECT status, 7 AS time_slot, count(status) AS counts -- 7 - an exta slot for totals
  FROM pv_tab
  GROUP BY status
)
-- Final query bringing it all together - nice, simple and elegant. :-)
SELECT status, time_slot, counts FROM individual_stati
UNION
SELECT status, time_slot, counts FROM total_stati
ORDER BY 1, 2
'
) AS My_Tab("stat" varchar(10), "slot1" bigint, "slot2" bigint, "slot3" bigint, "slot4" bigint, "slot5" bigint, "slot6" bigint, "Total" bigint);
Vérace
  • 30,923
  • 9
  • 73
  • 85