Merge two huge tables keeping only the unique rows

Question

I have two tables: table1 with 61 million rows and table2 with 59 millions rows. The columns in both are the same (name and type). Both are imported from backup files.

I want to merge these two tables into one table keeping only unique records.

For example, in table1 I have the following records:

NAME    CODE   ...
Name1   001    ...
Name2   002    ...
Name3   003    ...
Name4   004    ...

And in table2:

NAME    CODE   ...
Name1   001    ...
Name2   002    ...
Name5   005    ...
Name2   002    ...

The result table should be something like this:

NAME    CODE   ...
Name1   001    ...
Name2   002    ...
Name3   003    ...
Name4   004    ...
Name5   005    ...

Edit to provide more info as requested by @Erwin Brandstetter:

table1 has 15 GB. I am inserting records on table2 but when I finish it will have almost the same size.
I have more than 300 GB of free space (and an external HD if needed).
Total RAM is 8 GB.
No indexes (i can create if needed).
No concurrent access (it is accessed just by me on local machine).
I can create a new, third table if needed.
Here if the full table definition:

CREATE TABLE inss
(
   nb double precision,
   nome character varying(200),
   nasc double precision,
   cpf double precision,
   especie double precision,
   dib double precision,
   valor double precision,
   banco_pagt double precision,
   ag_banco double precision,
   orgao_pag double precision,
   aps double precision,
   meio_pagto double precision,
   banco_empr double precision,
   contrato character varying(200),
   vl_empres double precision,
   comp_ini_d double precision,
   parcelas double precision,
   vl_parcela double precision,
   tipo_empre double precision,
   endereco character varying(200),
   bairro character varying(200),
   municipio character varying(200),
   uf character varying(2),
   cep double precision,
   sit_empres double precision,
   dt_averb double precision,
   dt_exc double precision,
   id double precision
 );

The id column is filled with a row number, so it could be used to perform batch operations if the full operation becomes too slow.
All columns except the id column should be considered for the unique check.

score 1 · Accepted Answer · edited Apr 13 '17 at 12:42

The basic problem can be solved with various simple queries. Considering all columns:

CREATE TABLE tbl3 AS
TABLE tbl1
UNION TABLE tbl2;

Given this additional information:

All columns except the id column should be considered for the unique check.

And:

I don't need to preserve the ID column.

Just drop the id column, then you can proceed with the simple query above.

I would import to temporary tables (much faster, less overhead) and only write the final result (tbl3) to a regular table - in one session because temporary tables are dropped automatically at the end of the session.

CREATE TEMP TABLE tbl1 ( <columns from above, without id> );
COPY tbl1 FROM '/path/to/file1';

CREATE TEMP TABLE tbl2 ( <columns from above, without id> );
COPY tbl2 FROM '/path/to/file2';

Alternatively, to preserve the input tables across sessions, you could use unlogged tables.

For best performance create and fill the target with CREATE TABLE AS and add the PK constraint in the same transaction:

BEGIN;

CREATE SEQUENCE tbl3_tbl3_id_seq;

CREATE TABLE tbl3 AS 
SELECT nextval('tbl3_tbl3_id_seq'::regclass)::int AS tbl3_id, *
FROM  (TABLE tbl1 UNION TABLE tbl2 ) sub;

ALTER TABLE tbl3
   ADD CONSTRAINT tbl3_pkey PRIMARY KEY(tbl3_id)
 , ALTER COLUMN tbl3_id SET DEFAULT nextval('tbl3_tbl3_id_seq'::regclass);

ALTER SEQUENCE tbl3_tbl3_id_seq OWNED BY tbl3.tbl3_id;    

COMMIT;

Replace all occurrences of "tbl3" with our desired table name.

Detailed explanation in this related answer:

What causes large INSERT to slow down and disk usage to explode?

I added a serial column (tbl3_id) as surrogate PK to the target table. Adding the actual PK constraint at the end (of the same session) is the fastest way.

Optimizing bulk update performance in PostgreSQL

Before you do it, test whether double precision is the best data type for all those columns. Chances are, some of them could be integer (cheaper for whole numbers) or must really be numeric (loss-less). If so, adapt your temp tables to begin with.

score 0 · Answer 2 · answered Nov 04 '15 at 13:58

If all columns except the ID must be unique and the ID does not need to be preserved the way to go could be:

Create the new empty table as table1. Make the ID an 'auto-increment'.
Insert a union of table1 and table2 with all fields except the ID and insert these in the new table.

This way you enter unique rows and the ID is generated by the database.

Merge two huge tables keeping only the unique rows

2 Answers2