I have 2 tables where nodes contains about 60m rows and edges about 500m rows (both growing fast). These two tables represent a directed graph.
CREATE TABLE edges (
node_from bigint NOT NULL,
node_to bigint NOT NULL,
prop bigint NOT NULL
);
CREATE INDEX edges_prop_idx ON edges (prop);
CREATE INDEX from_prop_idx ON edges (node_from, prop);
CREATE INDEX from_to_prop_idx ON edges (node_from, node_to, prop);
and
CREATE TABLE nodes (
name text,
type text NOT NULL,
id bigint NOT NULL,
CONSTRAINT nodes_pkey PRIMARY KEY (id)
);
CREATE INDEX name_idx ON nodes ("substring"(name, 1, 2700));
CREATE INDEX trgm_gin_name_idx ON nodes USING gin (name gin_trgm_ops);
I have the following query:
SELECT nodes.id, nodes.name, nodes.type
FROM nodes
LEFT JOIN edges B1 ON
B1.node_to = nodes.id
LEFT JOIN edges A1 ON
A1.node_to = B1.node_from
LEFT JOIN edges C1 ON
C1.node_from = B1.node_to
WHERE
B1.prop = (SELECT id FROM nodes WHERE substring(name, 1, 2700) = 'name' AND type = 'property')::bigint -- Property
AND A1.prop = (SELECT id FROM nodes WHERE substring(name, 1, 2700) = '...' AND type = 'property')::bigint -- Property
AND C1.prop = 2
AND C1.node_to = ANY ((SELECT array(SELECT id FROM nodes WHERE substring(name, 1, 2700) = 'de'))::bigint[])
LIMIT 50
This query works so far (returns in <= 200ms) but returns a lot of duplicates. I would love to eliminate these duplicates and thought about adding a DISTINCT ON (nodes.id) to the query. Unfortunately, this leads to a never-ending query. I'm using PostgreSQL 9.4 (beta).
Query plans
Without DISTINCT:
"Limit (cost=39.55..111.62 rows=50 width=33)"
" InitPlan 1 (returns $0)"
" -> Index Scan using name_idx on nodes nodes_1 (cost=0.57..12.61 rows=1 width=8)"
" Index Cond: ("substring"(name, 1, 2700) = 'name'::text)"
" Filter: (type = 'property'::text)"
" InitPlan 2 (returns $1)"
" -> Index Scan using name_idx on nodes nodes_2 (cost=0.57..12.61 rows=1 width=8)"
" Index Cond: ("substring"(name, 1, 2700) = '...'::text)"
" Filter: (type = 'property'::text)"
" InitPlan 4 (returns $3)"
" -> Result (cost=12.60..12.61 rows=1 width=0)"
" InitPlan 3 (returns $2)"
" -> Index Scan using name_idx on nodes nodes_3 (cost=0.57..12.60 rows=2 width=8)"
" Index Cond: ("substring"(name, 1, 2700) = 'de'::text)"
" -> Nested Loop (cost=1.72..53798189845.30 rows=37322005381 width=33)"
" Join Filter: (b1.node_to = c1.node_from)"
" -> Nested Loop (cost=1.14..8489574975.42 rows=1038114946 width=41)"
" -> Nested Loop (cost=0.57..128899706.72 rows=1038114946 width=8)"
" -> Seq Scan on edges a1 (cost=0.00..6963386.60 rows=1724439 width=8)"
" Filter: (prop = $1)"
" -> Index Scan using from_prop_idx on edges b1 (cost=0.57..70.51 rows=20 width=16)"
" Index Cond: ((node_from = a1.node_to) AND (prop = $0))"
" -> Index Scan using nodes_pkey on nodes (cost=0.56..8.04 rows=1 width=33)"
" Index Cond: (id = b1.node_to)"
" -> Index Only Scan using from_to_prop_idx on edges c1 (cost=0.58..43.63 rows=1 width=8)"
" Index Cond: ((node_from = nodes.id) AND (node_to = ANY ($3)) AND (prop = 2))"
"Planning time: 1.790 ms"
With DISTINCT:
"Limit (cost=7608487867.87..7608488177.20 rows=50 width=33)"
" InitPlan 1 (returns $0)"
" -> Index Scan using name_idx on nodes nodes_1 (cost=0.57..12.61 rows=1 width=8)"
" Index Cond: ("substring"(name, 1, 2700) = 'name'::text)"
" Filter: (type = 'property'::text)"
" InitPlan 2 (returns $1)"
" -> Index Scan using name_idx on nodes nodes_2 (cost=0.57..12.61 rows=1 width=8)"
" Index Cond: ("substring"(name, 1, 2700) = '...'::text)"
" Filter: (type = 'property'::text)"
" InitPlan 4 (returns $3)"
" -> Result (cost=12.60..12.61 rows=1 width=0)"
" InitPlan 3 (returns $2)"
" -> Index Scan using name_idx on nodes nodes_3 (cost=0.57..12.60 rows=2 width=8)"
" Index Cond: ("substring"(name, 1, 2700) = 'de'::text)"
" -> Unique (cost=7608487830.05..7981707883.86 rows=60328104 width=33)"
" -> Sort (cost=7608487830.05..7701792843.50 rows=37322005381 width=33)"
" Sort Key: nodes.id, nodes.name, nodes.type"
" -> Hash Join (cost=15672379.12..34349332.83 rows=37322005381 width=33)"
" Hash Cond: (b1.node_to = nodes.id)"
" -> Merge Join (cost=5419844.59..20186145.96 rows=1038114946 width=8)"
" Merge Cond: (b1.node_from = a1.node_to)"
" -> Sort (cost=2506073.10..2510384.20 rows=1724439 width=16)"
" Sort Key: b1.node_from"
" -> Bitmap Heap Scan on edges b1 (cost=32280.97..2327441.09 rows=1724439 width=16)"
" Recheck Cond: (prop = $0)"
" -> Bitmap Index Scan on edges_prop_idx (cost=0.00..31849.86 rows=1724439 width=0)"
" Index Cond: (prop = $0)"
" -> Sort (cost=2506073.10..2510384.20 rows=1724439 width=8)"
" Sort Key: a1.node_to"
" -> Bitmap Heap Scan on edges a1 (cost=32280.97..2327441.09 rows=1724439 width=8)"
" Recheck Cond: (prop = $1)"
" -> Bitmap Index Scan on edges_prop_idx (cost=0.00..31849.86 rows=1724439 width=0)"
" Index Cond: (prop = $1)"
" -> Hash (cost=10251247.23..10251247.23 rows=102984 width=41)"
" -> Nested Loop (cost=3001857.14..10251247.23 rows=102984 width=41)"
" -> Bitmap Heap Scan on edges c1 (cost=3001856.57..9416003.75 rows=102984 width=8)"
" Recheck Cond: (prop = 2)"
" Filter: (node_to = ANY ($3))"
" -> Bitmap Index Scan on edges_prop_idx (cost=0.00..3001830.82 rows=162545367 width=0)"
" Index Cond: (prop = 2)"
" -> Index Scan using nodes_pkey on nodes (cost=0.56..8.10 rows=1 width=33)"
" Index Cond: (id = c1.node_from)"
"Planning time: 1.102 ms"
Any ideas on how I could optimize my query (with DISTINCT) to make it reasonably fast?