SUM over distinct rows with multiple joins

Question

Schema:

CREATE TABLE "items" (
  "id"            SERIAL                   NOT NULL PRIMARY KEY,
  "country"       VARCHAR(2)               NOT NULL,
  "created"       TIMESTAMP WITH TIME ZONE NOT NULL,
  "price"         NUMERIC(11, 2)           NOT NULL
);
CREATE TABLE "payments" (
  "id"      SERIAL                   NOT NULL PRIMARY KEY,
  "created" TIMESTAMP WITH TIME ZONE NOT NULL,
  "amount"  NUMERIC(11, 2)           NOT NULL,
  "item_id" INTEGER                  NULL
);
CREATE TABLE "extras" (
  "id"      SERIAL                   NOT NULL PRIMARY KEY,
  "created" TIMESTAMP WITH TIME ZONE NOT NULL,
  "amount"  NUMERIC(11, 2)           NOT NULL,
  "item_id" INTEGER                  NULL
);

Data:

INSERT INTO items VALUES
  (1, 'CZ', '2016-11-01', 100),
  (2, 'CZ', '2016-11-02', 100),
  (3, 'PL', '2016-11-03', 20),
  (4, 'CZ', '2016-11-04', 150)
;
INSERT INTO payments VALUES
  (1, '2016-11-01', 60, 1),
  (2, '2016-11-01', 60, 1),
  (3, '2016-11-02', 100, 2),
  (4, '2016-11-03', 25, 3),
  (5, '2016-11-04', 150, 4)
;
INSERT INTO extras VALUES
  (1, '2016-11-01', 5, 1),
  (2, '2016-11-02', 1, 2),
  (3, '2016-11-03', 2, 3),
  (4, '2016-11-03', 3, 3),
  (5, '2016-11-04', 5, 4)
;

So, we have:

3 items in CZ in 1 in PL
370 earned in CZ and 25 in PL
350 cost in CZ and 20 in PL
11 extra earned in CZ and 5 extra earned in PL

Now I want to get answers for the following questions:

How many items we had last month in every country?
What was the total earned amount (sum of payments.amounts) in every country?
What was the total cost (sum of items.price) in every country?
What was the total extra earnings (sum of extras.amount) in every country?

With the following query (SQLFiddle):

SELECT
  country                  AS "group_by",
  COUNT(DISTINCT items.id) AS "item_count",
  SUM(items.price)         AS "cost",
  SUM(payments.amount)     AS "earned",
  SUM(extras.amount)       AS "extra_earned"
FROM items
  LEFT OUTER JOIN payments ON (items.id = payments.item_id)
  LEFT OUTER JOIN extras ON (items.id = extras.item_id)
GROUP BY 1;

The results are wrong:

 group_by | item_count |  cost  | earned | extra_earned
----------+------------+--------+--------+--------------
 CZ       |          3 | 450.00 | 370.00 |        16.00
 PL       |          1 |  40.00 |  50.00 |         5.00

Cost and extra_earned for CZ are invalid - 450 instead of 350 and 16 instead of 11. Cost and earned for PL are also invalid - they are doubled.

I understand, that in case of LEFT OUTER JOIN there will be 2 rows for item with items.id = 1 (and so on for other matches), but I don't know how to build a proper query.

Questions:

How to avoid wrong results in aggregation in queries on multiple tables?
What is the best way to calculate sum over distinct values (items.id in that case)?

PostgreSQL version: 9.6.1

Erwin Brandstetter · Accepted Answer · 2021-09-10T01:36:15.450

Since there can be multiple payments and multiple extras per item, you run into a "proxy cross join" between those two tables. Aggregate rows per item_id before joining to item and it should all be correct:

SELECT i.country         AS group_by
     , COUNT(*)          AS item_count
     , SUM(i.price)      AS cost
     , SUM(p.sum_amount) AS earned
     , SUM(e.sum_amount) AS extra_earned
FROM  items i
LEFT  JOIN (
   SELECT item_id, SUM(amount) AS sum_amount
   FROM   payments
   GROUP  BY 1
   ) p ON p.item_id = i.id
LEFT  JOIN (
   SELECT item_id, SUM(amount) AS sum_amount
   FROM   extras
   GROUP  BY 1
   ) e ON e.item_id = i.id
GROUP BY 1;

Consider the "fishmarket" example:

Two SQL LEFT JOINS produce incorrect result

To be precise, SUM(i.price) would be incorrect after joining to a single n-table, which multiplies each price by the number of related rows. Doing it twice just makes it worse - and also potentially computationally expensive.

Oh, and since we do not multiply rows in items now, we can just use the cheaper count(*) instead of count(DISTINCT i.id). (id being NOT NULL PRIMARY KEY.)

db<>fiddle here
_{Old sqlfiddle}

But if I want to filter by `items.created`?

Addressing your comment.

It depends. Can we apply the same filter to payments.created and extras.created?

If yes, then just add the filters in the subqueries as well. (Does not seem likely in this case.)

If no, but we are still selecting most items, the above query would still be most efficient. Some of the aggregations in the subqueries are eliminated in the joins, but that's still cheaper than more complex queries.

If no, and we are selecting a small fraction of items, I suggest correlated subqueries or LATERAL joins. Examples:

Multiple array_agg() calls in a single query

SUM over distinct rows with multiple joins

1 Answers1

But if I want to filter by `items.created`?

Linked

SUM over distinct rows with multiple joins

1 Answers1

But if I want to filter by items.created?

Linked

But if I want to filter by `items.created`?