db<>fiddle for all of the data and queries below
I have a table events with the following structure:
create table events (
correlation_id char(26) not null,
user_id bigint,
task_id bigint not null,
location_id bigint,
type bigint not null,
created_at timestamp(6) with time zone not null,
constraint events_correlation_id_created_at_user_id_unique
unique (correlation_id, created_at, user_id)
);
This table holds records of tasks being performed, that look like this:
| correlation_id | user_id | task_id | location_id | type | created_at |
|---|---|---|---|---|---|
| 01CN4HP4AN0000000000000001 | 4 | 58 | 30 | 0 | 2018-08-17 18:17:15.348629 |
| 01CN4HP4AN0000000000000001 | 4 | 58 | 30 | 1 | 2018-08-17 18:17:22.852299 |
| 01CN4HP4AN0000000000000001 | 4 | 58 | 30 | 99 | 2018-08-17 18:17:25.535593 |
| 01CN4J9SZ80000000000000003 | 4 | 97 | 30 | 0 | 2018-08-17 18:28:00.104093 |
| 01CN4J9SZ80000000000000003 | 4 | 97 | 30 | 99 | 2018-08-17 18:29:09.016840 |
| 01CN4JC1430000000000000004 | 4 | 99 | 30 | 0 | 2018-08-17 18:29:12.963264 |
| 01CN4JC1430000000000000004 | 4 | 99 | 30 | 99 | 2018-08-17 18:32:09.272632 |
| 01CN4KJCDY0000000000000005 | 139 | 97 | 30 | 0 | 2018-08-17 18:50:09.725668 |
| 01CN4KJCDY0000000000000005 | 139 | 97 | 30 | 3 | 2018-08-17 18:50:11.842000 |
| 01CN4KJCDY0000000000000005 | 139 | 97 | 30 | 99 | 2018-08-17 18:51:42.240895 |
| 01CNC4G1Y40000000000000008 | 139 | 99 | 30 | 0 | 2018-08-20 17:00:40.260430 |
| 01CNC4G1Y40000000000000008 | 139 | 99 | 30 | 99 | 2018-08-20 17:00:47.583501 |
Rows with type = 0 indicate the start of a task, and rows with type = 99 indicate the end of a task. (Other values mean other things that are not relevant for this question, but two example rows are included here for completeness.)
Each task_id corresponds to a row from a tasks table. The only other field in the tasks table that is relevant to this question is called inprogress_status, and it can be 1 or 2, which represent Opening task and Closing task respectively.
I was originally asked for a query that would return the list of tasks, ordered by start date and location, with a single row that includes the start (type = 0) and end (type = 99) for each task.
Here is the query I used to do that:
SELECT e.created_at::DATE, e.location_id, e.task_id
, CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type
, e.correlation_id
, json_object_agg(e.type, json_build_object('timestamp', e.created_at, 'user_id', e.user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE e.type IN (0, 99)
AND t.inprogress_status IN (1, 2)
group by created_at::DATE, location_id, task_id, correlation_id, inprogress_status
ORDER BY 1, 2, 3;
Here is the result for that query using the data shown above:
| created_at | location_id | task_id | task_type | correlation_id | events |
|---|---|---|---|---|---|
| 2018-08-17 | 30 | 58 | OPEN | 01CN4HP4AN0000000000000001 | {"0": {"timestamp": "2018-08-17T18:17:15.348629+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:17:25.535593+00:00", "user_id": 4} } |
| 2018-08-17 | 30 | 97 | CLOSE | 01CN4J9SZ80000000000000003 | {"0": {"timestamp": "2018-08-17T18:28:00.104093+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:29:09.01684+00:00", "user_id": 4} } |
| 2018-08-17 | 30 | 99 | OPEN | 01CN4JC1430000000000000004 | { "0": {"timestamp": "2018-08-17T18:29:12.963264+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:32:09.272632+00:00", "user_id": 4} } |
| 2018-08-17 | 30 | 97 | CLOSE | 01CN4KJCDY0000000000000005 | { "0": {"timestamp": "2018-08-17T18:50:09.725668+00:00", "user_id": 139}, "99": {"timestamp": "2018-08-17T18:51:42.240895+00:00", "user_id": 139} } |
| 2018-08-20 | 30 | 99 | OPEN | 01CNC4G1Y40000000000000008 | { "0": {"timestamp": "2018-08-20T17:00:40.26043+00:00", "user_id": 139}, "99" : {"timestamp": "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} } |
In the above example, task_id 58 and 99 have inprogress_status = 1 and task_id 97 has inprogress_status = 2.
Now I have been asked to modify the returned data structure so that it aggregates by the inprogress_status as well, and returns the rows as pairs of OPEN+CLOSE events.
To try to figure out how to build this, I started by trying to get this format (the final format I actually want is below):
| created_at | location_id | events |
|---|---|---|
| 2018-08-17 | 30 | {"OPEN": [{"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }], "CLOSE": [{"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }, { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }]} |
| 2018-08-20 | 30 | {"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null} |
Here is the first query I wrote to try to make this work:
WITH grouped_events AS (
SELECT e.created_at::DATE AS created_date,
location_id,
task_id,
CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
jsonb_build_object('id', e.correlation_id) ||
jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE type IN (0, 99)
AND inprogress_status IN (1, 2)
GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events
GROUP BY 1, 2
ORDER BY 1, 2
The problem is that this produces invalid JSON. with multiple identical keys:
{
"OPEN": {
"0": { "user_id": 4, "timestamp": "2018-08-17T18:29:12.963264+00:00" },
"99": { "user_id": 4, "timestamp": "2018-08-17T18:32:09.272632+00:00" },
"id": "01CN4JC1430000000000000004"
},
"OPEN": {
"0": { "user_id": 4, "timestamp": "2018-08-17T18:17:15.348629+00:00" },
"99": { "user_id": 4, "timestamp": "2018-08-17T18:17:25.535593+00:00" },
"id": "01CN4HP4AN0000000000000001"
},
// ... etc.
}
I found that this query returns the data in the format show above:
WITH grouped_events1 AS (
SELECT e.created_at::DATE AS created_date,
location_id,
task_id,
CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
jsonb_build_object('id', e.correlation_id) ||
jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE type IN (0, 99)
AND inprogress_status IN (1, 2)
GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
), grouped_events2 AS (
SELECT created_date, location_id, task_type, json_agg(events) AS events
FROM grouped_events1
GROUP BY 1, 2, 3
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events2
GROUP BY 1, 2
ORDER BY 1, 2
However, the format I actually need should just pair a single OPEN with a single CLOSE, like this (each OPEN with the CLOSE that follows it in time):
| created_at | location_id | events |
|---|---|---|
| 2018-08-17 | 30 | {"OPEN": {"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, "CLOSE": {"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }} |
| 2018-08-17 | 30 | {"OPEN": {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }, "CLOSE": { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }} |
| 2018-08-20 | 30 | {"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null} |
Now I'm trying to figure out if I'm heading the wrong direction, because I can't see how to get to my final format from what I have.
Am I approaching this wrong? How can I get the result I'm looking for?