Aggregate query for pairs of arrays of events

Question

db<>fiddle for all of the data and queries below

I have a table events with the following structure:

create table events (
    correlation_id char(26) not null,
    user_id        bigint,
    task_id        bigint not null,
    location_id    bigint,
    type           bigint not null,
    created_at     timestamp(6) with time zone not null,
    constraint events_correlation_id_created_at_user_id_unique
        unique (correlation_id, created_at, user_id)
);

This table holds records of tasks being performed, that look like this:

correlation_id	user_id	task_id	location_id	type	created_at
01CN4HP4AN0000000000000001	4	58	30	0	2018-08-17 18:17:15.348629
01CN4HP4AN0000000000000001	4	58	30	1	2018-08-17 18:17:22.852299
01CN4HP4AN0000000000000001	4	58	30	99	2018-08-17 18:17:25.535593
01CN4J9SZ80000000000000003	4	97	30	0	2018-08-17 18:28:00.104093
01CN4J9SZ80000000000000003	4	97	30	99	2018-08-17 18:29:09.016840
01CN4JC1430000000000000004	4	99	30	0	2018-08-17 18:29:12.963264
01CN4JC1430000000000000004	4	99	30	99	2018-08-17 18:32:09.272632
01CN4KJCDY0000000000000005	139	97	30	0	2018-08-17 18:50:09.725668
01CN4KJCDY0000000000000005	139	97	30	3	2018-08-17 18:50:11.842000
01CN4KJCDY0000000000000005	139	97	30	99	2018-08-17 18:51:42.240895
01CNC4G1Y40000000000000008	139	99	30	0	2018-08-20 17:00:40.260430
01CNC4G1Y40000000000000008	139	99	30	99	2018-08-20 17:00:47.583501

Rows with type = 0 indicate the start of a task, and rows with type = 99 indicate the end of a task. (Other values mean other things that are not relevant for this question, but two example rows are included here for completeness.)

Each task_id corresponds to a row from a tasks table. The only other field in the tasks table that is relevant to this question is called inprogress_status, and it can be 1 or 2, which represent Opening task and Closing task respectively.

I was originally asked for a query that would return the list of tasks, ordered by start date and location, with a single row that includes the start (type = 0) and end (type = 99) for each task.

Here is the query I used to do that:

SELECT e.created_at::DATE, e.location_id, e.task_id
     , CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type
     , e.correlation_id
     , json_object_agg(e.type, json_build_object('timestamp', e.created_at, 'user_id', e.user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE e.type IN (0, 99)
AND t.inprogress_status IN (1, 2)
group by created_at::DATE, location_id, task_id, correlation_id, inprogress_status
ORDER BY 1, 2, 3;

Here is the result for that query using the data shown above:

created_at	location_id	task_id	task_type	correlation_id	events
2018-08-17	30	58	OPEN	01CN4HP4AN0000000000000001	`{"0": {"timestamp": "2018-08-17T18:17:15.348629+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:17:25.535593+00:00", "user_id": 4} }`
2018-08-17	30	97	CLOSE	01CN4J9SZ80000000000000003	`{"0": {"timestamp": "2018-08-17T18:28:00.104093+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:29:09.01684+00:00", "user_id": 4} }`
2018-08-17	30	99	OPEN	01CN4JC1430000000000000004	`{ "0": {"timestamp": "2018-08-17T18:29:12.963264+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:32:09.272632+00:00", "user_id": 4} }`
2018-08-17	30	97	CLOSE	01CN4KJCDY0000000000000005	`{ "0": {"timestamp": "2018-08-17T18:50:09.725668+00:00", "user_id": 139}, "99": {"timestamp": "2018-08-17T18:51:42.240895+00:00", "user_id": 139} }`
2018-08-20	30	99	OPEN	01CNC4G1Y40000000000000008	`{ "0": {"timestamp": "2018-08-20T17:00:40.26043+00:00", "user_id": 139}, "99" : {"timestamp": "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }`

In the above example, task_id 58 and 99 have inprogress_status = 1 and task_id 97 has inprogress_status = 2.

Now I have been asked to modify the returned data structure so that it aggregates by the inprogress_status as well, and returns the rows as pairs of OPEN+CLOSE events.

To try to figure out how to build this, I started by trying to get this format (the final format I actually want is below):

created_at	location_id	events
2018-08-17	30	{"OPEN": [{"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }], "CLOSE": [{"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }, { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }]}
2018-08-20	30	`{"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null}`

Here is the first query I wrote to try to make this work:

WITH grouped_events AS (
    SELECT e.created_at::DATE AS created_date,
        location_id,
        task_id,
        CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
        jsonb_build_object('id', e.correlation_id) ||
                jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
    FROM events e
    JOIN tasks t on e.task_id = t.id
    WHERE type IN (0, 99)
    AND inprogress_status IN (1, 2)
    GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events
GROUP BY 1, 2
ORDER BY 1, 2

The problem is that this produces invalid JSON. with multiple identical keys:

{
    "OPEN": {
        "0": { "user_id": 4, "timestamp": "2018-08-17T18:29:12.963264+00:00" },
        "99": { "user_id": 4, "timestamp": "2018-08-17T18:32:09.272632+00:00" },
        "id": "01CN4JC1430000000000000004"
    },
    "OPEN": {
        "0": { "user_id": 4, "timestamp": "2018-08-17T18:17:15.348629+00:00" },
        "99": { "user_id": 4, "timestamp": "2018-08-17T18:17:25.535593+00:00" },
        "id": "01CN4HP4AN0000000000000001"
    },
    // ... etc.
}

I found that this query returns the data in the format show above:

WITH grouped_events1 AS (
    SELECT e.created_at::DATE AS created_date,
        location_id,
        task_id,
        CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
        jsonb_build_object('id', e.correlation_id) ||
                jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
    FROM events e
    JOIN tasks t on e.task_id = t.id
    WHERE type IN (0, 99)
    AND inprogress_status IN (1, 2)
    GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
), grouped_events2 AS (
    SELECT created_date, location_id, task_type, json_agg(events) AS events
    FROM grouped_events1
    GROUP BY 1, 2, 3
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events2
GROUP BY 1, 2
ORDER BY 1, 2

However, the format I actually need should just pair a single OPEN with a single CLOSE, like this (each OPEN with the CLOSE that follows it in time):

created_at	location_id	events
2018-08-17	30	`{"OPEN": {"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, "CLOSE": {"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }}`
2018-08-17	30	`{"OPEN": {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }, "CLOSE": { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }}`
2018-08-20	30	`{"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null}`

Now I'm trying to figure out if I'm heading the wrong direction, because I can't see how to get to my final format from what I have.

Am I approaching this wrong? How can I get the result I'm looking for?

Erwin Brandstetter · Accepted Answer · 2022-08-09T05:14:23.967

This produces your desired result:

SELECT the_day, location_id
     , jsonb_object_agg(task_type, events || jsonb_build_object('correlation_id', correlation_id)) AS events
FROM  (
   SELECT e.created_at::date AS the_day, e.location_id, e.correlation_id
        , count(*) FILTER (WHERE t.inprogress_status = 1)
                   OVER (PARTITION BY e.location_id ORDER BY min(e.created_at) FILTER (WHERE e.type = 0)) AS task_nr
        , CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type     
        , jsonb_object_agg(e.type, jsonb_build_object('timestamp', e.created_at, 'user_id', e.user_id)) AS events
   FROM   events e
   JOIN   tasks t on e.task_id = t.id
   WHERE  e.type IN (0, 99)
   AND    t.inprogress_status IN (1, 2)
   GROUP  BY 1, 2, e.correlation_id, t.inprogress_status
   ) sub
GROUP  BY the_day, location_id, task_nr
ORDER  BY the_day, location_id, task_nr;

db<>fiddle here

Except that missing 'OPEN' events at the start of the day and missing 'CLOSE' events at the end are just missing.

I use jsonb instead of json to allow the jsonb || jsonb operator. You can just cast the result to json, if you actually need that.

Core feature is this sophisticated expression to form task numbers:

    , count(*) FILTER (WHERE t.inprogress_status = 1)
               OVER (PARTITION BY e.location_id ORDER BY min(e.created_at) FILTER (WHERE e.type = 0)) AS task_nr

Every 'OPEN' task starts a new group. The created_at with type = 0 defines the sequence of tasks. Technically, this works because we can nest aggregate function (even with the aggregate FILTER clause) in window functions.
Related answers:

Aggregate query for pairs of arrays of events

1 Answers1