Retrieve recent Matches per Player efficiently

Question

I have a website for showing matches of a RTS game.

One of the main use cases is to show recent matches of a player or a group of players.

PostgreSQL 14.13

Table structure:

create table public.match (
  match_id integer primary key not null,
  started timestamp(3) without time zone,
  ...
);
create index "IDX_e7b6cfca8139b9aa85880aab9e" on match using btree (started);
create table public.player (
  match_id integer not null,
  profile_id integer not null,
  slot smallint not null,
...,
primary key (match_id, profile_id, slot),
foreign key (match_id) references public.match (match_id)
  match simple on update cascade on delete restrict,
  foreign key (profile_id) references public.profile (profile_id)
  match simple on update cascade on delete restrict
);
create index idx_player_profile_match on player using btree (profile_id, match_id);
create index "IDX_58afd2c450f166eacbdf982841" on player using btree (match_id);
create index "IDX_ba3de28aa98207f3a21145feb8" on player using btree (profile_id);

The match table has 50 million records. The player table 200 million records.

There are about 50000 matches with 4 players on average added per day to the database.

The tables currently contain data for the last 3 years. The match table has relation_size 6 GB and total_relation_size 12 GB. The player table 15 GB and 37 GB.

Querying the DB for the most recent 50 matches for a player or for all matches in the last month for a player takes almost a second. After the first execution it is more fast because it is cached but that does not help me as often different users are requested.

-- most recent 50 matches for a player
EXPLAIN ANALYSE
SELECT * FROM match m
JOIN player p on p.match_id = m.match_id
WHERE p.profile_id=582058 ORDER BY started desc LIMIT 50;
Limit  (cost=12075.34..12081.09 rows=50 width=107) (actual time=675.517..682.833 rows=50 loops=1)
  ->  Gather Merge  (cost=12075.34..12158.48 rows=723 width=107) (actual time=675.516..682.828 rows=50 loops=1)
        Workers Planned: 1
        Workers Launched: 1
        ->  Sort  (cost=11075.33..11077.14 rows=723 width=107) (actual time=663.726..663.731 rows=38 loops=2)
              Sort Key: m.started DESC
              Sort Method: top-N heapsort  Memory: 39kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 37kB
              ->  Nested Loop  (cost=22.66..11051.31 rows=723 width=107) (actual time=3.286..661.017 rows=1882 loops=2)
                    ->  Parallel Bitmap Heap Scan on player p  (cost=22.09..4853.22 rows=723 width=40) (actual time=2.774..210.891 rows=1882 loops=2)
                          Recheck Cond: (profile_id = 582058)
                          Heap Blocks: exact=1855
"                          ->  Bitmap Index Scan on ""IDX_ba3de28aa98207f3a21145feb8""  (cost=0.00..21.79 rows=1229 width=0) (actual time=4.818..4.819 rows=3772 loops=1)"
                                Index Cond: (profile_id = 582058)
                    ->  Index Scan using match_pkey on match m  (cost=0.56..8.57 rows=1 width=67) (actual time=0.237..0.237 rows=1 loops=3764)
                          Index Cond: (match_id = p.match_id)
Planning Time: 0.305 ms
Execution Time: 682.897 ms
-- matches in the last month for a player
EXPLAIN ANALYSE
SELECT * FROM match m
JOIN player p on p.match_id = m.match_id
WHERE m.started > '2024-09-01' AND
      p.profile_id=271202 ORDER BY started desc;
Gather Merge  (cost=12053.93..12057.61 rows=32 width=107) (actual time=797.710..797.881 rows=161 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Sort  (cost=11053.92..11054.00 rows=32 width=107) (actual time=782.572..782.578 rows=80 loops=2)
        Sort Key: m.started DESC
        Sort Method: quicksort  Memory: 36kB
        Worker 0:  Sort Method: quicksort  Memory: 35kB
        ->  Nested Loop  (cost=22.66..11053.12 rows=32 width=107) (actual time=171.842..781.756 rows=80 loops=2)
              ->  Parallel Bitmap Heap Scan on player p  (cost=22.09..4853.22 rows=723 width=40) (actual time=1.254..268.523 rows=1682 loops=2)
                    Recheck Cond: (profile_id = 271202)
                    Heap Blocks: exact=1721
"                    ->  Bitmap Index Scan on ""IDX_ba3de28aa98207f3a21145feb8""  (cost=0.00..21.79 rows=1229 width=0) (actual time=1.712..1.712 rows=3378 loops=1)"
                          Index Cond: (profile_id = 271202)
              ->  Index Scan using match_pkey on match m  (cost=0.56..8.58 rows=1 width=67) (actual time=0.304..0.304 rows=0 loops=3365)
                    Index Cond: (match_id = p.match_id)
                    Filter: (started > '2024-09-01 00:00:00'::timestamp without time zone)
                    Rows Removed by Filter: 1
Planning Time: 0.446 ms
Execution Time: 797.931 ms

When executing this again the execution time drops to about 40ms because of caching.

As far a I understand Postgres first finds all player records for the given profile_id and then orders by started date and then filters by count/started date. This way the query will become more and more inefficient as more matches are added to the database.

My idea is to add the started date of the match to the player table and create an index on profile_id, started.

Do you think this is a good solution? Are there other options?

I also thought about partitioning match/player table by started date but this seems to be a lot of effort.

Update using the answer from Erwin Brandstetter:

Thank you for your advice!

I created the index and filled the started column of the player table:

CREATE INDEX player_profile_id_started_match_id ON player (profile_id, started) INCLUDE (match_id);

The recent matches query for one player is very fast now:

-- get recent 50 matches for one player
EXPLAIN ANALYSE
SELECT *
FROM  (
   SELECT p.match_id
   FROM   player p
   WHERE  p.profile_id=199325
   ORDER  BY p.started DESC
   LIMIT  50
   ) p
JOIN   match m USING (match_id);
Nested Loop  (cost=1.14..500.47 rows=50 width=67) (actual time=2.721..21.442 rows=50 loops=1)
  ->  Limit  (cost=0.57..70.85 rows=50 width=12) (actual time=2.702..14.494 rows=50 loops=1)
        ->  Index Only Scan Backward using player_profile_id_started_match_id on player p  (cost=0.57..1739.19 rows=1237 width=12) (actual time=2.701..14.480 rows=50 loops=1)
              Index Cond: (profile_id = 199325)
              Heap Fetches: 50
  ->  Index Scan using match_pkey on match m  (cost=0.56..8.58 rows=1 width=67) (actual time=0.138..0.138 rows=1 loops=50)
        Index Cond: (match_id = p.match_id)
Planning Time: 0.222 ms
Execution Time: 21.482 ms

I also have the use case of querying the recent matches for multiple players (One or more of the players need to be in the match).

-- get recent 50 matches from multiple players
EXPLAIN ANALYSE
SELECT *
FROM  (
   SELECT p.match_id
   FROM   player p
   WHERE  p.profile_id IN (1136191, 19455781)
   ORDER  BY p.started DESC
   LIMIT  50
   ) p
JOIN   match m USING (match_id);
Nested Loop  (cost=3557.08..3986.26 rows=50 width=67) (actual time=511.692..515.601 rows=50 loops=1)
  ->  Limit  (cost=3556.51..3556.64 rows=50 width=12) (actual time=511.658..511.671 rows=50 loops=1)
        ->  Sort  (cost=3556.51..3562.70 rows=2473 width=12) (actual time=511.657..511.663 rows=50 loops=1)
              Sort Key: p.started DESC
              Sort Method: top-N heapsort  Memory: 29kB
              ->  Index Only Scan using player_profile_id_started_match_id on player p  (cost=0.57..3474.36 rows=2473 width=12) (actual time=0.592..508.846 rows=5277 loops=1)
"                    Index Cond: (profile_id = ANY ('{1136191,19455781}'::integer[]))"
                    Heap Fetches: 2965
  ->  Index Scan using match_pkey on match m  (cost=0.56..8.58 rows=1 width=67) (actual time=0.078..0.078 rows=1 loops=50)
        Index Cond: (match_id = p.match_id)
Planning Time: 1.090 ms
Execution Time: 515.662 ms

In this example it was very slow. But after running VACUUM ANALYZE public.player; which took 7 minutes, it is running fast (about 30ms).

Note: I have not yet updated the foreign key and unique index because the started date will never change after the creation of match and player records.

score 1 · Answer 1 · answered Oct 27 '24 at 14:33

For the second query index on match table on (match_id, started) should help (if planner will use it) because with it we'll read less data from table that will be filtered away later.

If that's not enough then implementing your idea (with additional column and index) will help much more for both queries.

Erwin Brandstetter · Accepted Answer · 2024-10-27T19:09:12.283

My idea is to add the started date of the match to the player table and create an index on (profile_id, started).

Yes, that. But type timestamp (not date) to match table match. Plus, add match_id to the index to allow index-only scans. Possibly just as INCLUDE column:

CREATE INDEX player_profile_id_started_match_id ON player (profile_id, started) INCLUDE (match_id);

This adds some unpleasant redundancy. You wouldn't want related started values in both tables to disagree. To enforce integrity you might expand the FK constraint to:

FOREIGN key (started, match_id) REFERENCES public.match (started, match_id)

... which in turn requires a UNIQUE constraint on public.match (started, match_id). I put started first, so the constraint index can now replace the index "IDX_e7b6cfca8139b9aa85880aab9e" on just (started) to avoid additional costs. See:

Is a composite index also good for queries on the first field?

Then optimize your query a bit. LIMIT in a subquery before the join:

SELECT *  -- ???
FROM  (
   SELECT p.match_id  -- more ?
   FROM   player p
   WHERE  p.profile_id = 582058
   ORDER  BY p.started DESC
   LIMIT  50
   ) p
JOIN   match m USING (match_id)

Or:

SELECT *  -- ???
FROM  (
   SELECT p.match_id  -- more ?
   FROM   player p
   WHERE  p.profile_id = 582058
   AND    p.started >= '2024-09-01'  -- surely, you want >=
   ) p
JOIN   match m USING (match_id);

Assuming (profile_id, started) is unique. (The same player can't start multiple matches at the same time?) Else, add another column to ORDER BY to break ties.

Aside

Drop the indexes "IDX_ba3de28aa98207f3a21145feb8" and "IDX_58afd2c450f166eacbdf982841". Both barely add anything but write cost over the existing indexes idx_player_profile_match and the PK index. Same logic as for (started, match_id) vs. just (started) above.

Retrieve recent Matches per Player efficiently

2 Answers2

Aside