SELECT query with DISTINCT on a table-structure for graphs is very slow

Question

I have 2 tables where nodes contains about 60m rows and edges about 500m rows (both growing fast). These two tables represent a directed graph.

CREATE TABLE edges (
  node_from bigint NOT NULL,
  node_to bigint NOT NULL,
  prop bigint NOT NULL
);

CREATE INDEX edges_prop_idx   ON edges (prop);
CREATE INDEX from_prop_idx    ON edges (node_from, prop);
CREATE INDEX from_to_prop_idx ON edges (node_from, node_to, prop);

and

CREATE TABLE nodes (
  name text,
  type text NOT NULL,
  id bigint NOT NULL,
  CONSTRAINT nodes_pkey PRIMARY KEY (id)
);

CREATE INDEX name_idx          ON nodes ("substring"(name, 1, 2700));
CREATE INDEX trgm_gin_name_idx ON nodes USING gin (name gin_trgm_ops);

I have the following query:

SELECT nodes.id, nodes.name, nodes.type
FROM nodes
LEFT JOIN edges B1 ON 
    B1.node_to = nodes.id
LEFT JOIN edges A1 ON
    A1.node_to = B1.node_from
LEFT JOIN edges C1 ON
    C1.node_from = B1.node_to
WHERE
    B1.prop = (SELECT id FROM nodes WHERE substring(name, 1, 2700) = 'name' AND type = 'property')::bigint -- Property
    AND A1.prop = (SELECT id FROM nodes WHERE substring(name, 1, 2700) = '...' AND type = 'property')::bigint -- Property
    AND C1.prop = 2
    AND C1.node_to = ANY ((SELECT array(SELECT id FROM nodes WHERE substring(name, 1, 2700) = 'de'))::bigint[])
LIMIT 50

This query works so far (returns in <= 200ms) but returns a lot of duplicates. I would love to eliminate these duplicates and thought about adding a DISTINCT ON (nodes.id) to the query. Unfortunately, this leads to a never-ending query. I'm using PostgreSQL 9.4 (beta).

Query plans

Without DISTINCT:

"Limit  (cost=39.55..111.62 rows=50 width=33)"
"  InitPlan 1 (returns $0)"
"    ->  Index Scan using name_idx on nodes nodes_1  (cost=0.57..12.61 rows=1 width=8)"
"          Index Cond: ("substring"(name, 1, 2700) = 'name'::text)"
"          Filter: (type = 'property'::text)"
"  InitPlan 2 (returns $1)"
"    ->  Index Scan using name_idx on nodes nodes_2  (cost=0.57..12.61 rows=1 width=8)"
"          Index Cond: ("substring"(name, 1, 2700) = '...'::text)"
"          Filter: (type = 'property'::text)"
"  InitPlan 4 (returns $3)"
"    ->  Result  (cost=12.60..12.61 rows=1 width=0)"
"          InitPlan 3 (returns $2)"
"            ->  Index Scan using name_idx on nodes nodes_3  (cost=0.57..12.60 rows=2 width=8)"
"                  Index Cond: ("substring"(name, 1, 2700) = 'de'::text)"
"  ->  Nested Loop  (cost=1.72..53798189845.30 rows=37322005381 width=33)"
"        Join Filter: (b1.node_to = c1.node_from)"
"        ->  Nested Loop  (cost=1.14..8489574975.42 rows=1038114946 width=41)"
"              ->  Nested Loop  (cost=0.57..128899706.72 rows=1038114946 width=8)"
"                    ->  Seq Scan on edges a1  (cost=0.00..6963386.60 rows=1724439 width=8)"
"                          Filter: (prop = $1)"
"                    ->  Index Scan using from_prop_idx on edges b1  (cost=0.57..70.51 rows=20 width=16)"
"                          Index Cond: ((node_from = a1.node_to) AND (prop = $0))"
"              ->  Index Scan using nodes_pkey on nodes  (cost=0.56..8.04 rows=1 width=33)"
"                    Index Cond: (id = b1.node_to)"
"        ->  Index Only Scan using from_to_prop_idx on edges c1  (cost=0.58..43.63 rows=1 width=8)"
"              Index Cond: ((node_from = nodes.id) AND (node_to = ANY ($3)) AND (prop = 2))"
"Planning time: 1.790 ms"

With DISTINCT:

"Limit  (cost=7608487867.87..7608488177.20 rows=50 width=33)"
"  InitPlan 1 (returns $0)"
"    ->  Index Scan using name_idx on nodes nodes_1  (cost=0.57..12.61 rows=1 width=8)"
"          Index Cond: ("substring"(name, 1, 2700) = 'name'::text)"
"          Filter: (type = 'property'::text)"
"  InitPlan 2 (returns $1)"
"    ->  Index Scan using name_idx on nodes nodes_2  (cost=0.57..12.61 rows=1 width=8)"
"          Index Cond: ("substring"(name, 1, 2700) = '...'::text)"
"          Filter: (type = 'property'::text)"
"  InitPlan 4 (returns $3)"
"    ->  Result  (cost=12.60..12.61 rows=1 width=0)"
"          InitPlan 3 (returns $2)"
"            ->  Index Scan using name_idx on nodes nodes_3  (cost=0.57..12.60 rows=2 width=8)"
"                  Index Cond: ("substring"(name, 1, 2700) = 'de'::text)"
"  ->  Unique  (cost=7608487830.05..7981707883.86 rows=60328104 width=33)"
"        ->  Sort  (cost=7608487830.05..7701792843.50 rows=37322005381 width=33)"
"              Sort Key: nodes.id, nodes.name, nodes.type"
"              ->  Hash Join  (cost=15672379.12..34349332.83 rows=37322005381 width=33)"
"                    Hash Cond: (b1.node_to = nodes.id)"
"                    ->  Merge Join  (cost=5419844.59..20186145.96 rows=1038114946 width=8)"
"                          Merge Cond: (b1.node_from = a1.node_to)"
"                          ->  Sort  (cost=2506073.10..2510384.20 rows=1724439 width=16)"
"                                Sort Key: b1.node_from"
"                                ->  Bitmap Heap Scan on edges b1  (cost=32280.97..2327441.09 rows=1724439 width=16)"
"                                      Recheck Cond: (prop = $0)"
"                                      ->  Bitmap Index Scan on edges_prop_idx  (cost=0.00..31849.86 rows=1724439 width=0)"
"                                            Index Cond: (prop = $0)"
"                          ->  Sort  (cost=2506073.10..2510384.20 rows=1724439 width=8)"
"                                Sort Key: a1.node_to"
"                                ->  Bitmap Heap Scan on edges a1  (cost=32280.97..2327441.09 rows=1724439 width=8)"
"                                      Recheck Cond: (prop = $1)"
"                                      ->  Bitmap Index Scan on edges_prop_idx  (cost=0.00..31849.86 rows=1724439 width=0)"
"                                            Index Cond: (prop = $1)"
"                    ->  Hash  (cost=10251247.23..10251247.23 rows=102984 width=41)"
"                          ->  Nested Loop  (cost=3001857.14..10251247.23 rows=102984 width=41)"
"                                ->  Bitmap Heap Scan on edges c1  (cost=3001856.57..9416003.75 rows=102984 width=8)"
"                                      Recheck Cond: (prop = 2)"
"                                      Filter: (node_to = ANY ($3))"
"                                      ->  Bitmap Index Scan on edges_prop_idx  (cost=0.00..3001830.82 rows=162545367 width=0)"
"                                            Index Cond: (prop = 2)"
"                                ->  Index Scan using nodes_pkey on nodes  (cost=0.56..8.10 rows=1 width=33)"
"                                      Index Cond: (id = c1.node_from)"
"Planning time: 1.102 ms"

Any ideas on how I could optimize my query (with DISTINCT) to make it reasonably fast?

score 1 · Accepted Answer · edited Apr 13 '17 at 12:42

Table layout

It would be more efficient to reverse the column order in nodes: fixed length NOT NULL columns first. This is just a tiny optimization. It's only the first item because table layout comes first.
```
CREATE TABLE nodes(
   id bigint PRIMARY KEY
 , type text NOT NULL
 , name text
);
```

Index

Replace substring(name,1,2700) with left(name, 2700) everywhere, it's slightly faster and shorter. Better yet, don't use this at all. Consider the next point.
An index on 2700 characters of text is very inefficient. Only 3 index entries will fit on a data page - or it won't work at all. If you have many multi-byte characters, the actual byte-size can be considerably bigger than 2700, which would result in an exception:
- Index max row size error
- Character varying index overhead & length limit
If you don't run into exceptions, the index is still going to be huge and relatively slow. If all you do is equality checks with the = operator, use a functional index on a hash value instead.
And I would append the id to it to allow index-only scans:
```
CREATE INDEX name_md5_idx ON nodes (md5(name), id);
```
And adapt your queries to it.
I don't see a PK on edges? Every table should have one.
I don't see an index on edges with node_to as leading column. May or may not be needed:
- Is a composite index also good for queries on the first field?

Query

Your LEFT JOIN clauses are counteracted by later conditions on the left tables. Use [INNER] JOIN instead. And adapt to the suggested functional index:

Step 1

SELECT n.id, n.name, n.type
FROM   nodes n
JOIN   edges B1 ON B1.node_to = n.id
JOIN   edges A1 ON A1.node_to = B1.node_from
JOIN   edges C1 ON C1.node_from = B1.node_to
WHERE  B1.prop = (SELECT id FROM nodes WHERE md5(name) = md5('name') AND type = 'property')
AND    A1.prop = (SELECT id FROM nodes WHERE md5(name) = md5('...')  AND type = 'property')
AND    C1.prop = 2
AND    C1.node_to = ANY (SELECT array(SELECT id FROM nodes WHERE md5('name') = md5('de')))
LIMIT  50;

Step 2

Untangle:

SELECT n.id, n.name, n.type
FROM   edges an
JOIN   nodes a  ON a.prop = an.id
JOIN   edges b  ON b.node_from = a.node_to
JOIN   nodes bn ON bn.id = b.prop
JOIN   edges c  ON c.node_from = b.node_to
JOIN   nodes cn ON cn.id = c.node_to
JOIN   nodes n  ON n.id = c.node_from
WHERE  md5(an.name) = md5('...')
AND    an.type = 'property'
AND    md5(bn.name) = md5('name')
AND    bn.type = 'property'
AND    c.prop = 2
AND    md5(cn.name) = md5('de')
LIMIT  50;

In particular, replace the sub-selects with more plain joins, thereby getting rid of the intermediary array (which is probably expensive).

If your nodes table grows fast enough that collisions in the md5 key space seem possible, add exact predicates to recheck (not expensive):

AND    an.name = '...'
AND    bn.name = 'name'
AND    cn.name = 'de'