I'd like your help to optimise an increasingly slow query - or better yet, help me understand what is the problem and point me in the right direction.
Every day I scrape 3 supermarkets and I record their prices in a django app. Each scrape results in a new Retrieval record, a few thousand Price records and maybe in a few Item records (most items already exist, so I only record their prices for a given day).
Once this is done, a process runs which computes the price changes for all of today's items, by retrieving each item's last two prices. These last two prices may not always be on consecutive days, because some items come and go.
The 3 tables that we're interested in and their indexes look like this:
CREATE TABLE public.main_item (
id uuid NOT NULL,
name character varying(512) NOT NULL
);
CREATE TABLE public.main_price (
id uuid NOT NULL,
per_item double precision NOT NULL,
item_id uuid NOT NULL,
retrieval_id uuid NOT NULL
);
CREATE TABLE public.main_retrieval (
id uuid NOT NULL,
"timestamp" timestamp with time zone NOT NULL,
);
CREATE INDEX name_index ON public.main_item USING btree (name);
CREATE INDEX timestamp_index ON public.main_retrieval USING btree ("timestamp");
I have a query which returns a result that looks like this where id is the Item id and latest_prices is a json which contains the last 2 prices for that item and then I process the results in python.
| id | latest_prices |
|---|---|
| 0003db22-3c8a-4f21-aea1-667361ebe377 | {"{"per_item": 2.44, "price_id": "24dc5524-35d5-472b-8f16-5840308a9cc4"}","{"per_item": 2.44, "price_id": "a415d740-0e50-43ba-b33f-3d6c9328a319"}"} |
| 0011cc73-07ca-415d-85e0-1c6782e0b041 | {"{"per_item": 3.48, "price_id": "e754cc25-9fb5-4e88-8689-55878e47f7dc"}","{"per_item": 3.48, "price_id": "553a6cf2-2c6e-421c-b7e0-c43d5c0cbf85"}"} |
However, the query is getting increasingly slow. My assumption is because of the size of the Price table which by now is ~16 million rows.
Running an EXPLAIN ANALYZE on that query I see that the majority of the time is spent in a Bitmap Heap Scan on the main_price table: https://explain.depesz.com/s/ZX78#stats
Below you can find the query that the django ORM has generated for today:
SELECT "main_item"."id", ARRAY(
SELECT JSONB_BUILD_OBJECT(('price_id')::text, U0."id", ('per_item')::text, U0."per_item") AS "json"
FROM "main_price" U0
INNER JOIN "main_retrieval" U2
ON (U0."retrieval_id" = U2."id")
WHERE (U0."item_id" = ("main_item"."id") AND U2."timestamp" < '2024-09-24 00:00:00+00:00')
ORDER BY U2."timestamp" DESC LIMIT 2
) AS "latest_prices"
FROM "main_item"
WHERE "main_item"."id" IN (
SELECT V1."item_id"
FROM "main_retrieval" V0
LEFT OUTER JOIN "main_price" V1
ON (V0."id" = V1."retrieval_id")
WHERE (
V0."timestamp" >= '2024-09-23 00:00:00+00:00'
AND V0."timestamp" < '2024-09-24 00:00:00+00:00'
AND NOT (
EXISTS(
SELECT 1 AS "a"
FROM "main_retrieval" U0
LEFT OUTER JOIN "main_price" U1
ON (U0."id" = U1."retrieval_id")
WHERE (U1."item_id" IS NULL AND U0."id" = (V0."id"))
LIMIT 1))))
Note
I'm more than happy to ditch the query generated by the ORM and write it by hand.
However, I'm struggling to find an optimal way to fetch the 2 most recent prices for each Item. Fetching the most recent is easy, but fetching the previous one is surprisingly difficult.