Aggregate the last two entries per item efficiently

Question

I'd like your help to optimise an increasingly slow query - or better yet, help me understand what is the problem and point me in the right direction.

Every day I scrape 3 supermarkets and I record their prices in a django app. Each scrape results in a new Retrieval record, a few thousand Price records and maybe in a few Item records (most items already exist, so I only record their prices for a given day).

Once this is done, a process runs which computes the price changes for all of today's items, by retrieving each item's last two prices. These last two prices may not always be on consecutive days, because some items come and go.

The 3 tables that we're interested in and their indexes look like this:

CREATE TABLE public.main_item (
    id uuid NOT NULL,
    name character varying(512) NOT NULL
);
CREATE TABLE public.main_price (
    id uuid NOT NULL,
    per_item double precision NOT NULL,
    item_id uuid NOT NULL,
    retrieval_id uuid NOT NULL
);
CREATE TABLE public.main_retrieval (
    id uuid NOT NULL,
    "timestamp" timestamp with time zone NOT NULL,
);
CREATE INDEX name_index ON public.main_item USING btree (name);
CREATE INDEX timestamp_index ON public.main_retrieval USING btree ("timestamp");

I have a query which returns a result that looks like this where id is the Item id and latest_prices is a json which contains the last 2 prices for that item and then I process the results in python.

id	latest_prices
0003db22-3c8a-4f21-aea1-667361ebe377	{"{"per_item": 2.44, "price_id": "24dc5524-35d5-472b-8f16-5840308a9cc4"}","{"per_item": 2.44, "price_id": "a415d740-0e50-43ba-b33f-3d6c9328a319"}"}
0011cc73-07ca-415d-85e0-1c6782e0b041	{"{"per_item": 3.48, "price_id": "e754cc25-9fb5-4e88-8689-55878e47f7dc"}","{"per_item": 3.48, "price_id": "553a6cf2-2c6e-421c-b7e0-c43d5c0cbf85"}"}

However, the query is getting increasingly slow. My assumption is because of the size of the Price table which by now is ~16 million rows.

Running an EXPLAIN ANALYZE on that query I see that the majority of the time is spent in a Bitmap Heap Scan on the main_price table: https://explain.depesz.com/s/ZX78#stats

Below you can find the query that the django ORM has generated for today:

SELECT "main_item"."id", ARRAY(
    SELECT JSONB_BUILD_OBJECT(('price_id')::text, U0."id", ('per_item')::text, U0."per_item") AS "json"
    FROM "main_price" U0
        INNER JOIN "main_retrieval" U2
        ON (U0."retrieval_id" = U2."id")
    WHERE (U0."item_id" = ("main_item"."id") AND U2."timestamp" < '2024-09-24 00:00:00+00:00')
    ORDER BY U2."timestamp" DESC LIMIT 2
    ) AS "latest_prices"
FROM &quot;main_item&quot;
WHERE &quot;main_item&quot;.&quot;id&quot; IN (
    SELECT V1.&quot;item_id&quot;
    FROM &quot;main_retrieval&quot; V0
        LEFT OUTER JOIN &quot;main_price&quot; V1
        ON (V0.&quot;id&quot; = V1.&quot;retrieval_id&quot;)
    WHERE (
        V0.&quot;timestamp&quot; &gt;= '2024-09-23 00:00:00+00:00'
        AND V0.&quot;timestamp&quot; &lt; '2024-09-24 00:00:00+00:00'
        AND NOT (
            EXISTS(
                SELECT 1 AS &quot;a&quot;
                FROM &quot;main_retrieval&quot; U0
                    LEFT OUTER JOIN &quot;main_price&quot; U1
                    ON (U0.&quot;id&quot; = U1.&quot;retrieval_id&quot;)
                WHERE (U1.&quot;item_id&quot; IS NULL AND U0.&quot;id&quot; = (V0.&quot;id&quot;))
                LIMIT 1))))

Note

I'm more than happy to ditch the query generated by the ORM and write it by hand.

However, I'm struggling to find an optimal way to fetch the 2 most recent prices for each Item. Fetching the most recent is easy, but fetching the previous one is surprisingly difficult.

score 1 · Accepted Answer · answered Sep 28 '24 at 15:17

You are using an index to find the rows of main_price which match item_id, but then within those rows are doing a slow filter on retrieval_id. So this should be sped up by a joint index

create index on main_price (item_id, retrieval_id);

I didn't test this, as it not feasible to test on an empty schema (All plans are good when the tables have no rows!), but I think it should work.

Erwin Brandstetter · Answer 2 · 2024-10-02T20:13:21.883

Basically, main_price implements the n:m relationship between items and retrievals. You should have indexes accordingly. See:

How to implement a many-to-many relationship in PostgreSQL?

As for ...

find an optimal way to fetch the 2 most recent prices for each Item.

Assuming a standard many-to-many implementation with referential integrity enforced, the following, simpler query is equivalent - and much faster:

SELECT p.item_id
     , ARRAY (
         SELECT jsonb_build_object('price_id', p1.id, 'per_item', p1.per_item) AS js
         FROM   main_price     p1
         JOIN   main_retrieval r1 ON r1.id = p1.retrieval_id
         WHERE  p1.item_id = p.item_id
         AND    r1.timestamp < '2024-09-24+0'
         ORDER  BY r1.timestamp DESC
         LIMIT  2
         ) AS latest_prices
FROM   main_retrieval r
JOIN   main_price     p ON r.id = p.retrieval_id
WHERE  r.timestamp >= '2024-09-23+0'
AND    r.timestamp <  '2024-09-24+0';

An array of jsonb objects is an odd result type. Don't you want a json(b) array of objects instead?

SELECT p.item_id, p2.latest_prices
FROM   main_retrieval r
JOIN   main_price     p ON r.id = p.retrieval_id
CROSS  JOIN LATERAL (
   SELECT jsonb_agg(p2.*) AS latest_prices
   FROM  (
      SELECT p1.id AS price_id, p1.per_item
      FROM   main_price     p1
      JOIN   main_retrieval r1 ON r1.id = p1.retrieval_id
      WHERE  p1.item_id = i.id
      AND    r1.timestamp < '2024-09-24+0'
      ORDER  BY r1.timestamp DESC
      LIMIT  2
      ) p2
   ) p2
WHERE  r.timestamp >= '2024-09-23+0'
AND    r.timestamp <  '2024-09-24+0';

See:

Build JSON object from one-to-many relationship data in a single query?

If it's just for display use json_agg() instead of jsonb_agg().

Either way, have both of these indexes on table main_price:

CREATE INDEX main_price_item_id_retrieval_id ON main_price (item_id, retrieval_id);
CREATE INDEX main_price_retrieval_id_item_id ON main_price (retrieval_id, item_id);

Plus the index on main_retrieval.timestamp you already have, and the PK index on main_retrieval.id you should have and just forgot to declare.

Actually, this set of indexes would be the optimum for the query:

CREATE INDEX main_price_item_id_retrieval_id ON main_price (item_id) INCLUDE (retrieval_id);
CREATE INDEX main_price_retrieval_id_item_id ON main_price (retrieval_id) INCLUDE (item_id);
CREATE INDEX main_retrieval_retrieval_id_item_id ON main_retrieval (timestamp DESC) INCLUDE (id);
CREATE INDEX main_retrieval_retrieval_id_item_id ON main_retrieval (id) INCLUDE (timestamp);

But that buys only little additional optimization over basic indexes while being less versatile for other queries.

I doubt you need uuid. Chances are, bigint (or just int for retrievals) would serve you better.

Don't use double precision for prices, which are exact entities. Use integer or numeric.

Aggregate the last two entries per item efficiently

Note

2 Answers2