3

I am looking to remove older data from a table which consists of 889 million rows of data.

I have a script but I am trying to make it more robust as removing approximately 418 million rows of data to be precise.

Am running in Postgres 9.6, with the table and FK Constraints

          Column          |           Type           | Collation | Nullable |   
             Default                
--------------------------+--------------------------+-----------+----------+---
------------------------------------
 game_id                  | integer                  |           | not null | ne
xtval('game_game_id_seq'::regclass)
 game_id                  | integer                  |           | not null | 
 session_id               | integer                  |           |          | 
 game_created_on          | timestamp with time zone |           | not null | 
 currency_code            | character(3)             |           | not null | 
 game_cash_staked         | numeric(12,2)            |           |          | 
 game_cash_won            | numeric(12,2)            |           |          | 
 game_bonus_staked        | numeric(12,2)            |           |          | 
 game_bonus_won           | numeric(12,2)            |           |          | 
 game_created_by_id       | integer                  |           | not null | 
 game_remote_ref          | character varying(50)    |           |          | 
 game_description         | text                     |           |          | 
 game_batch_id            | integer                  |           |          | 
 game_rejection_code_id   | integer                  |           |          | 
 game_rejection_message   | character varying(255)   |           |          | 
 game_transfer_remote_ref | character varying(128)   |           |          |

Indexes: "game_pkey" PRIMARY KEY, btree (game_id) "idx_game_created_on_rejection_code" btree (game_created_on) WHERE game_rejection_code_id IS NULL "idx_game_game_created_on" btree (game_created_on) "idx_game_session_id" btree (session_id) "game_idx_01" btree (game_remote_ref) "game_idx_game_id" btree (game_id) Foreign-key constraints: "ref_game_to_currency" FOREIGN KEY (currency_code) REFERENCES currency(currency_code) "ref_game_to_game" FOREIGN KEY (game_id) REFERENCES game(game_id) "ref_game_to_game_rejection_code" FOREIGN KEY (game_rejection_code_id) REFERENCES game_rejection_code(game_re jection_code_id)

Scipt have looked to use:

begin;
CREATE TABLE gamearchived AS
SELECT t.*
FROM  game t
where t.game_created_on < NOW() - interval '1 year'; -- this grabs stuff Older than 1 year
delete from game t
where  t.game_id in (select gamearchived.game_id from gamearchived);
select count (*) from gamearchived
COMMIT;

I am wondering if this is the safest way to remove the old data from the main table or to do in batches. Also my current table which I will be removing data from has indexes and foreign key constraints, would it be better to drop the indexes first prior to deleting and then adding them back in once done. The amount of data removed is approximately 450 million rows.

Need to keep the old data so it needs to be accessible. Any advice is much appreciated.

rdbmsNoob
  • 459
  • 9
  • 30

1 Answers1

3

The first step would be to upgrade to a current version. Postgres 9.6 reaches EOL on November 11, 2021 - in three weeks. Postgres 13 or 14 are much faster with big data. Dropping and recreating all indexes has additional benefit in this case: that way you tap into the new feature index deduplication of Postgres 13 or later: shrinks indexes with duplicative data to a fraction of their size.

Assuming no concurrent access.

would it be better to drop the indexes first prior to deleting and then adding them back in once done.

Because you delete half the table, typically yes. The added benefit is that recreated indexes are in pristine condition without bloat.

This will be much faster:

BEGIN;
CREATE TABLE public.gamearchived (LIKE public.game);

-- DROP all indexes on table game here (be sure to remember the DDL!)

WITH del AS ( DELETE FROM game WHERE game_created_on < NOW() - interval '1 year'; -- older than 1 year RETURNING * ) , ins AS ( INSERT INTO public.gamearchived SELECT * FROM del ORDER BY game_created_on; -- optional, only if it helps future queries ) SELECT count(*) FROM del; -- get your count

-- run checks if you are not sure; last chance.

COMMIT;

If anything goes wrong, the transaction is rolled back. So this is safe.

Creating the new table within the same transaction saves a lot of overhead: no additional WAL needs to be written.

The main difference: this only needs a single sequential scan on the big table. Your original does a lot of additional (pointless) work.

Also, we don't need any indexes for the big DELETE. It's cheaper to drop them before and recreate them after, than to incrementally keep them up to date.

I would at least run this afterwards:

VACUUM ANALYZE game;
VACUUM ANALYZE gamearchived;

To free up space (and while thee is no concurrent access), even:

VACUUM FULL ANALYZE game;

(No point in running VACUUM FULL on gamearchived.)

Ideally, recreate all indexes on table game now.
Also, create any constraints or indexes (PK?) you want on table gamearchived. Doesn't have to be in the same transaction.
The important bit is to do it after the big insert, which is much cheaper and typically produces better results (indexes balanced without bloat).

Outgoing FK constraints in table game don't matter, the don't impose any additional work for DELETE. (Incoming FK constraints would matter, as Postgres would have to check for possible referencing rows in linked tables.)

Related:

Basics in this related chapter of the manual: Populating a Database

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633