How do I make PostgreSQL include all the months where I did not buy anything?

Question

I have this query:

SELECT avg(sum)
FROM
(
    SELECT date_trunc('month', "timestamp"), sum(amount) as sum
    FROM bookkeeping WHERE summary LIKE 'Bitcoin purchase:%'
    GROUP BY date_trunc('month', "timestamp")
) a

It gave me an impossible average cost of my Bitcoin purchases per month. I soon found out why: it ignores all the numerous months over the years where I did not buy Bitcoin at all, including only months when I did buy at least once. Since there aren't a ton of "empty" months included (with 0 for sum), it does the average of the sums for only the months where I bought at least once, so the average becomes completely wrong.

How do I make it "invent" rows for the months which aren't represented by at least one record, so that the avg(sum) will actually be meaningful?

Vérace · Answer 1 · 2021-01-30T11:04:25.760

You might like to try this (see the fiddle here):

CREATE TABLE cal (dc) AS 
SELECT GENERATE_SERIES
       (
           (DATE '2020-01-01'),
           (DATE '2020-12-31'),
           interval '1 MONTH'
       )::DATE;
SELECT * FROM cal;

Result:

Having done this, we now look at our bitcoin purchases:

CREATE TABLE bp  -- bitcoin_purchases
(
  bp_date TIMESTAMP NOT NULL,
  bp_count SMALLINT NOT NULL,
  coin_price DOUBLE PRECISION NOT NULL CHECK (coin_price > 0),
  bp_total DOUBLE PRECISION GENERATED ALWAYS AS (bp_count * coin_price) STORED
);

Note the use of a GENERATED column here - it's not absolutely necessary but helps make subsequent SQL easier.

Populate it:

INSERT INTO bp VALUES
('2020-01-05 10:00:00', '10',  '500'),
('2020-01-25 10:00:00', '20', '1000'),
('2020-03-05 10:00:00', '10', '1500'),
('2020-03-05 10:00:00', '10', '2000'),
('2020-05-17 10:00:00', '10', '2500'),
('2020-05-19 10:00:00', '10', '3000'),
('2020-07-23 10:00:00', '10', '500.45'),
('2020-07-27 10:00:00', '10', '500.45'),
('2020-09-30 10:00:00', '10', '500.45');

Then, you might want to do something like this:

SELECT 
  c.dc, DATE_TRUNC('MONTH', c.dc), bp.*, DATE_TRUNC('MONTH', bp_date) 
FROM cal c
LEFT JOIN bp
  ON c.dc = DATE_TRUNC('MONTH', bp.bp_date);

Result:

dc  date_trunc  bp_date bp_count    coin_price  bp_total    date_trunc
2020-01-01  2020-01-01 00:00:00+00  2020-01-05 10:00:00 10  500 5000    2020-01-01 00:00:00
2020-01-01  2020-01-01 00:00:00+00  2020-01-25 10:00:00 20  1000    20000   2020-01-01 00:00:00
2020-02-01  2020-02-01 00:00:00+00                  
2020-03-01  2020-03-01 00:00:00+00  2020-03-05 10:00:00 10  1500    15000   2020-03-01 00:00:00
2020-03-01  2020-03-01 00:00:00+00  2020-03-05 10:00:00 10  2000    20000   2020-03-01 00:00:00
2020-04-01  2020-04-01 00:00:00+01                  
2020-05-01  2020-05-01 00:00:00+01  2020-05-17 10:00:00 10  2500    25000   2020-05-01 00:00:00
2020-05-01  2020-05-01 00:00:00+01  2020-05-19 10:00:00 10  3000    30000   2020-05-01 00:00:00
2020-06-01  2020-06-01 00:00:00+01                  
2020-07-01  2020-07-01 00:00:00+01  2020-07-23 10:00:00 10  500.45  5004.5  2020-07-01 00:00:00
2020-07-01  2020-07-01 00:00:00+01  2020-07-27 10:00:00 10  500.45  5004.5  2020-07-01 00:00:00
2020-08-01  2020-08-01 00:00:00+01                  
2020-09-01  2020-09-01 00:00:00+01  2020-09-30 10:00:00 10  500.45  5004.5  2020-09-01 00:00:00
2020-10-01  2020-10-01 00:00:00+01                  
2020-11-01  2020-11-01 00:00:00+00                  
2020-12-01  2020-12-01 00:00:00+00                  
16 rows

Now, the DATE_TRUNC function isn't the one you want here - you want EXTRACT(field FROM source) - you have those large TIMESTAMP fields - whereas you could use the easier INTEGER(^*) from EXTRACT - easier on the CPU...

SELECT 
  EXTRACT('MONTH'FROM c.dc) AS "Month:",
  COALESCE(SUM(bp_total), 0) "Amount/mth",
  COALESCE(SUM(bp_count), 0) AS "No. of coins", 
  COALESCE(ROUND(SUM(bp_total::NUMERIC)/SUM(bp_count), 2), 0) AS "Avg price/mth"
FROM cal c
LEFT JOIN bp
  ON EXTRACT('MONTH' FROM c.dc) = EXTRACT('MONTH' FROM bp.bp_date)
GROUP BY EXTRACT('MONTH' FROM c.dc)
ORDER BY EXTRACT('MONTH' FROM c.dc);

(^*) it's not actually an INT but for the purposes of this dicussion, it's easier to deal with than the results of DATE_TRUNC.

Result (better viewed on fiddle):

Month:  Amount/mth  No. of coins    Avg price/mth
1   25000   30  833.33
2   0   0   0
3   35000   20  1750.00
4   0   0   0
5   55000   20  2750.00
6   0   0   0
7   10009   20  500.45
8   0   0   0
9   5004.5  10  500.45
10  0   0   0
11  0   0   0
12  0   0   0

Note the user of the COALESCE function which puts 0s in place of NULLs - now you can use this in a sub-SELECT and not have to worry about 3-valued logic which can trip people up!

You can also make use of window functions as follows:

SELECT 
  bp.bp_date, bp.coin_price AS "price",
  SUM(bp_count) OVER (PARTITION BY EXTRACT('MONTH' FROM bp.bp_date)) AS "No./mth",
  SUM(bp_count) OVER () AS "No. coints"
FROM bp
ORDER BY bp.bp_date;

Result:

bp_date price   No./mth No. coints
2020-01-05 10:00:00 500 50  120
2020-01-25 10:00:00 1000    50  120
2020-01-28 10:00:00 1250    50  120
2020-03-05 10:00:00 1500    20  120
2020-03-05 10:00:00 2000    20  120
2020-05-17 10:00:00 2500    20  120
2020-05-19 10:00:00 3000    20  120
2020-07-23 10:00:00 500.45  20  120
2020-07-27 10:00:00 500.45  20  120
2020-09-30 10:00:00 500.45  10  120
10 rows

See fiddle here. Window functions are very powerful - see here for an introduction and here for more detail.

p.s. welcome to the forum!

How do I make PostgreSQL include all the months where I did not buy anything?

1 Answers1