37

I have two tables in a PostgreSQL 9.3 database: Table link_reply has a foreign key named which_group pointing to table link_group.

I want to delete all rows from link_group where no related row in link_reply exists. Sounds basic enough but I've been struggling with it.

Will it be something simple like this (not working)?

DELETE FROM link_group WHERE link_reply = NULL;
Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
Hassan Baig
  • 2,079
  • 8
  • 31
  • 44

2 Answers2

46

Quoting the manual:

There are two ways to delete rows in a table using information contained in other tables in the database: using sub-selects, or specifying additional tables in the USING clause. Which technique is more appropriate depends on the specific circumstances.

Bold emphasis mine. Using information that is not contained in another table is a tad bit tricky, but there are easy solutions. From the arsenal of standard techniques to ...

... a NOT EXISTS anti-semi-join is probably simplest and most efficient for DELETE:

DELETE FROM link_group lg
WHERE  NOT EXISTS (
   SELECT FROM link_reply lr
   WHERE  lr.which_group = lg.link_group_id
   );

Assuming (since table definitions are not provided) link_group_id as column name for the primary key of link_group.

The technique @Mihai commented works as well (applied correctly):

DELETE FROM link_group lg
USING  link_group      lg1
LEFT   JOIN link_reply lr ON lr.which_group = lg1.link_group_id
WHERE  lg1.link_group_id = lg.link_group_id
AND    lr.which_group IS NULL;

But since the table expression in the USING clause is joined to the target table (lg in the example) with a CROSS JOIN, you need another instance of the same table as stepping stone (lg1 in the example) for the LEFT JOIN, which is less elegant and typically slower.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
1

A further comment on how to delete rows from table a not referenced in table b.

So the difficulty in Postgres's DELETE FROM ... is that the syntax does not allow us to put a "JOIN" directly inside "FROM" clause. Otherwise any solution that works for SELECT would work here too. One way around this obstacle is to use a CTE:

-- My solution
WITH cte AS (
    SELECT a."Id"
    FROM a
    LEFT JOIN b ON a."Id" = b."Id"
)   WHERE b."Id" IS NULL
)
DELETE FROM a
USING cte
WHERE a."Id" = cte."Id"

Another way is to use a fictitious self-join, as suggested in @Mihai's solution. Perhaps unsurprisingly, this produces exactly the same execution plan (scroll down), but the query itself is much nicer and shorter:

-- Mihai's solution
DELETE FROM a AS a1
USING a AS a2
LEFT JOIN b ON a2.id = b.id
WHERE (a1.id = a2.id) AND (b.id IS NULL);

Interestingly, Erwin Brandstetter's solution

-- Erwin Brandstetter's solution
DELETE FROM a
WHERE NOT EXISTS (
   SELECT FROM b
   WHERE b."Id" = a."Id"
)

produces a different execution plan.


I ran explain analyze a couple of times for my case, where

  • both tables a and b have a btree index on primary key "Id";
  • out of 12M rows, 8K will be deleted.

Observations:

  • the difference is not great, execution times are very similar, with subquery-based solution being slightly ahead;
  • the join-based solutions (1, 2) have a "Nested Loop" node which causes an additional index scan on a;
  • the subquery-based solution (3) does not use any indexes (it does not have that extra scan)
  • in about 3 out of 10 times, join-based solutions randomly took 50% longer and I couldn't get the subquery-based solution to stumble in a similar way, maybe a coincidence (public AWS RDS, shared hardware)

Plan and timing for the join-based solutions (1, 2):

 Delete on a1  (cost=425213.99..1044473.87 rows=0 width=0) (actual time=25359.152..25360.192 rows=0 loops=1)
   ->  Nested Loop  (cost=425213.99..1044473.87 rows=1 width=18) (actual time=7128.534..24946.717 rows=8003 loops=1)
         Output: a1.ctid, a2.ctid, b.ctid
         Inner Unique: true
         ->  Hash Anti Join  (cost=425213.56..1044466.81 rows=1 width=16) (actual time=7127.232..24852.555 rows=8003 loops=1)
               Output: a2.ctid, a2."Id", b.ctid
               Hash Cond: (a2."Id" = b."Id")
               ->  Seq Scan on a2  (cost=0.00..203587.61 rows=12421261 width=10) (actual time=0.439..5483.911 rows=12421040 loops=1)
                     Output: a2.ctid, a2."Id"
               ->  Hash  (cost=206399.36..206399.36 rows=12587936 width=10) (actual time=6917.431..6917.433 rows=12587936 loops=1)
                     Output: b.ctid, b."Id"
                     Buckets: 262144  Batches: 128  Memory Usage: 6282kB
                     ->  Seq Scan on b  (cost=0.00..206399.36 rows=12587936 width=10) (actual time=0.980..3881.084 rows=12587936 loops=1)
                           Output: b.ctid, b."Id"
         ->  Index Scan using "a_pkey" on a1  (cost=0.43..7.06 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=8003)
               Output: a1.ctid, a1."Id"
               Index Cond: (a1."Id" = a2."Id")
 Planning Time: 4.300 ms
 Execution Time: 25362.914 ms

Plan and timing for the subquery-based solution (3):

 Delete on a  (cost=425213.56..1044466.81 rows=0 width=0) (actual time=23464.673..23464.678 rows=0 loops=1)
   ->  Hash Anti Join  (cost=425213.56..1044466.81 rows=1 width=12) (actual time=8080.284..23387.870 rows=8003 loops=1)
         Output: a.ctid, b.ctid
         Hash Cond: (a."Id" = b."Id")
         ->  Seq Scan on a  (cost=0.00..203587.61 rows=12421261 width=10) (actual time=0.009..4519.007 rows=12421040 loops=1)
               Output: a.ctid, a."Id"
         ->  Hash  (cost=206399.36..206399.36 rows=12587936 width=10) (actual time=7712.061..7712.062 rows=12587936 loops=1)
               Output: b.ctid, b."Id"
               Buckets: 262144  Batches: 128  Memory Usage: 6282kB
               ->  Seq Scan on  b  (cost=0.00..206399.36 rows=12587936 width=10) (actual time=0.004..4209.453 rows=12587936 loops=1)
                     Output: b.ctid, b."Id"
 Planning Time: 0.187 ms
 Execution Time: 23464.731 ms