Currently I have a multi table/column GROUP BY query which is pretty slow. Below is the tables schema used by the query:
report Table
Table "report"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+--------------------------+---------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('dashboard_report_id_seq'::regclass) | plain | |
county | character varying(200) | not null | extended | |
district | character varying(200) | not null | extended | |
report_name | character varying(500) | not null | extended | |
Indexes:
"dashboard_report_pkey" PRIMARY KEY, btree (id)
question Table
Table "question"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('dashboard_question_id_seq'::regclass) | plain | |
name | character varying(255) | not null | extended | |
label | character varying(255) | not null | extended | |
report_type | character varying(255) | not null | extended | |
Indexes:
"dashboard_question_pkey" PRIMARY KEY, btree (id)
questionanswer Table
Table "questionanswer"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------------------------+-----------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('dashboard_questionanswer_id_seq'::regclass) | plain | |
answer | character varying(255) | not null | extended | |
question_id | integer | not null | plain | |
report_id | integer | not null | plain | |
Indexes:
"dashboard_questionanswer_pkey" PRIMARY KEY, btree (id)
"dashboard_questionanswer_6f78b20c" btree (report_id)
"dashboard_questionanswer_7aa0f6ee" btree (question_id)
Foreign-key constraints:
"dashboard_que_report_id_4c2a87ee585b6121_fk_dashboard_report_id" FOREIGN KEY (report_id) REFERENCES dashboard_report(id) DEFERRABLE INITIALLY DEFERRED
"dashboard_question_id_522f82097923c241_fk_dashboard_question_id" FOREIGN KEY (question_id) REFERENCES dashboard_question(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no
Query:
SELECT
"report"."county",
"report"."district",
"report"."report_name",
"question"."report_type",
COUNT(DISTINCT "questionanswer"."report_id")
FROM "questionanswer"
INNER JOIN "question" ON ( "questionanswer"."question_id" = "question"."id" )
INNER JOIN "report" ON ( "questionanswer"."report_id" = "report"."id" )
WHERE
"question"."name" = 'touch'
GROUP BY
"report"."county",
"report"."district",
"report"."report_name",
"question"."report_type" ;
I'm not sure where to start with optimising this query, at the moment it is fairly slow on my system ~300 ms. I did try individually indexing all the columns referred by the GROUP BY clause but that didn't help the query time either.
So basically user submits reports, the data of which is stored in reports, each report has multiple QuestionAnswers, which are stored in QuestionAnswer Table. The Question meta data is in Question table. The QuestionAnswer table records will be large next comes Report table data, the Question table has a set of only 40 questions. The objective of the query is to figure out per county, district and report type for a specific question how many reports we have that have answered that question.
My Postgres Sql version is PostgreSQL 9.3.5.
Query plan is here.
My machine hardware spec is MacOSX 10.9.5 4-Core 8 GM RAM SSD, encrypted file system.
Would indexing all the columns referred by GROUP BY help?
Or should I go for a multicolumn index?