I have a Postgres table containing the following three columns:
- origin (INT, b tree index)
- destination (INT, b tree index)
- minutes (SMALLINT, b tree index)
The table has ~ 1.4 billion rows. Each row contains the information about the travel time (minutes) from one place (origin) to another (destination).
The table is read-only - we might need to change it, but that will happen once a year or so...
I need to do queries like:
SELECT min(minutes)
FROM table
WHERE origin IN([long list])
AND destination IN([long list])
GROUP BY origin
The performance is really bad. Depending on the number of origins and destinations, the query takes more than 10 minutes. What we need is < 3 minutes, ideally < 10 sec. DBMS is postgreSQL 9.4, 16 G RAM.
Is there a way to improve it? Partitioning the table, optimizing indexes etc.? Or another DBMS?
Edited 2018-11-28:
- Added Index on (origin, destination)
EXPLAIN ANALYSE SELECT min(minutes),origin FROM table WHERE destination = ANY('{100225471, 100236548, 10263073, 10827564, 7435519, 100272388, 10688802, 10823750, 10853634, 10681223, 100213867, 100234761, 100113775, 100229234, 100234067, 100235418, 100229220, 1000053957, 1000059198, 1000028857, 1000057809, 1000058848, 1000059188, 1000057802}') GROUP BY(origin)
"HashAggregate (cost=228043.99..228044.04 rows=5 width=6) (actual time=10118.146..10126.308 rows=50022 loops=1)"
" Group Key: origin"
" -> Index Scan using idx_destination on table (cost=0.60..226716.69 rows=265460 width=6) (actual time=0.022..10021.269 rows=208174 loops=1)"
" Index Cond: (destination = ANY ('{100225471,100236548,10263073,10827564,7435519,100272388,10688802,10823750,10853634,10681223,100213867,100234761,100113775,100229234,100234067,100235418,100229220,1000053957,1000059198,1000028857,1000057809,1000058848,1000 (...)"
"Planning time: 0.306 ms"
"Execution time: 10127.949 ms"