4

I have a table (t0), in my Postgres DB, with data that looks something like this:

t1_id  t2_id
1      1      
2      1
2      1
4      null
4      null
5      null

And I have a query to return my desired results of:

t1_id  t2_id
1      1
4      null
5      null

My query looks something like this:

(
  SELECT DISTINCT ON (t2_id) t1_id, t2_id
  FROM t0
  WHERE t2_id IS NOT NULL
)
UNION ALL
(
  SELECT DISTINCT ON (t1_id) t1_id, t2_id
  FROM t0
  WHERE t2_id IS NULL
)

Is there a faster way to do an operation like this? It's not too bad, but I'm doing it in several places (with joins) and all these repeated queries seems to slow stuff down a bit. Seems like there must be a better way.

Here's the query in fiddle form: http://sqlfiddle.com/#!15/d41d8/3603

Aaron Bertrand
  • 181,950
  • 28
  • 405
  • 624
Sam
  • 637
  • 2
  • 7
  • 12

2 Answers2

5

For the simple case, I can only think of minor improvements to the query:

(
  SELECT DISTINCT ON (t2_id)
         t1_id, t2_id
  FROM   t0
  WHERE  t2_id IS NOT NULL
  ORDER  BY t2_id, t1_id  -- to get consistent results
)
UNION ALL
(
  SELECT DISTINCT ON (t1_id)
         t1_id, NULL      -- cheaper
  FROM   t0
  WHERE  t2_id IS NULL
  -- if you retrieve more columns, add ORDER BY, too
)
  • As ypercube mentioned You need to add ORDER BY with an unambiguous list of expressions to get deterministic results.

  • You can use the constant NULL instead of t2_id in the second leg of the query. Also relevant for below index support.

  • The key to performance is indexing. Try two partial indexes matching the two parts of the query:

    CREATE INDEX t0_part1_idx ON t0 (t2_id, t1_id) WHERE t2_id IS NOT NULL;
    CREATE INDEX t0_part2_idx ON t0 (t1_id)        WHERE t2_id IS NULL;
    

    You may or may not want to include additional columns to allow index-only scans.

Depending on table size and data distribution, there may be faster alternatives:

Single SELECT

If you want to condense it into a single SELECT:

SELECT DISTINCT ON (coalesce(t2_id, t1_id), t2_id)
       t1_id, t2_id
FROM   t0
ORDER  BY coalesce(t2_id, t1_id), t2_id, t1_id;

Equivalent, except for sort order. If you want this to be fast, try a functional index:

CREATE INDEX t0_func_idx ON t0 (coalesce(t2_id, t1_id), t2_id, t1_id);
Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
5

Give your example, this will do it:

select distinct on (coalesce(t2_id, t1_id)) t1_id, t2_id
from t0
order by coalesce(t2_id, t1_id);

It essentially says "do a distinct on t2_id, but if that is null use t1_id instead".