CREATE TABLE test_table
(
id uuid NOT NULL,
"RefId" uuid NOT NULL,
"timestampCol" timestamp without time zone NOT NULL,
"bigint1" bigint NOT NULL,
"bigint2" bigint NOT NULL,
"int1" integer NOT NULL,
"int2" integer NOT NULL,
"bigint3" bigint NOT NULL,
"bigint4" bigint NOT NULL,
"bigint5" bigint NOT NULL,
"hugeText" text NOT NULL,
"bigint6" bigint NOT NULL,
"bigint7" bigint NOT NULL,
"bigint8" bigint NOT NULL,
"denormalizedData" jsonb NOT NULL,
"textCol" text NOT NULL,
"smallText" text NOT NULL,
"createdAt" timestamp with time zone NOT NULL,
"updatedAt" timestamp with time zone NOT NULL,
CONSTRAINT test_pkey PRIMARY KEY (id)
);
SELECT "textCol", SUM("bigint1"), SUM("bigint2") -- etc, almost every single column gets aggregated
FROM "test_table"
WHERE "timestampCol" BETWEEN '2016-06-12' AND '2016-06-17'
GROUP BY "textCol"
ORDER BY SUM("bingint2"), SUM("bigint3") DESC -- the ORDER BY columns are dynamic, but there's only 4 possible combination of columns.
LIMIT 50;
Please correct me where my understanding is incorrect. In Postgres, I can either leverage an index on timestampCol or on textCol, but never both at the same time? The query plans I've pasted are meant to show the algorithms being picked by Postgres only. The real tables have a few million rows, not only ~66,000.
CREATE INDEX timestamp_col_index on test_table using btree ("timestampCol")An index (btree) on
"timestampCol"means that the query planner will slice the whole dataset to only keep the rows in between'2016-06-12'and'2016-06-17'before using aHash Joinor aSort + GroupAggregateto group the rows bytextCol.GroupAggregate (cost=3925.50..4483.19 rows=22259 width=41) (actual time=80.764..125.342 rows=22663 loops=1) Group Key: "textCol" -> Sort (cost=3925.50..3981.45 rows=22380 width=41) (actual time=80.742..84.915 rows=22669 loops=1) Sort Key: "textCol" Sort Method: quicksort Memory: 2540kB -> Index Scan using timestamp_col_index on test_table (cost=0.29..2308.56 rows=22380 width=41) (actual time=0.053..13.939 rows=22669 loops=1) Index Cond: (("timestampCol" >= '2016-06-12 00:00:00'::timestamp without time zone) AND ("timestampCol" <= '2016-06-17 00:00:00'::timestamp without time zone))CREATE INDEX text_col_index on test_table using btree ("textCol")An index (btree) on
textColmeans the query planner already has the rows "pre-grouped" but it has to traverse every single row in the index to filter out those that don't matchtimestampCol BETWEEN timestamp1 AND timestamp2.GroupAggregate (cost=0.42..16753.91 rows=22259 width=41) (actual time=0.281..127.047 rows=22663 loops=1) Group Key: "textCol" -> Index Scan using text_col_index on test_table (cost=0.42..16252.18 rows=22380 width=41) (actual time=0.235..76.182 rows=22669 loops=1) Filter: (("timestampCol" >= '2016-06-12 00:00:00'::timestamp without time zone) AND ("timestampCol" <= '2016-06-17 00:00:00'::timestamp without time zone)) Rows Removed by Filter: 43719Creating both indexes means Postgres will run a cost analysis to decide which of 1. and 2. it thinks will be the fastest. But it'll never leverage both indexes at the same time.
Creating a multicolumn index will not help in any way. From my testing Postgres will not change its query plan at all whether it's
("textCol", "timestampCol")or("timestampCol", "textCol").I've tried the
btree_ginandbtree_gistextensions, but I was never able to get the query planner to keep the rows "pre-grouped" or to leverage decent speed gains at a the ~4,000,000 rows scale compared to 1. and 2. Maybe I wasn't using them correctly? How would I structure those indexes and adapt my query to it?
Please let me know what I might be misunderstanding. How can I optimize such a query for a table containing a few million rows?
Important information about the structure of the data:
The timestamps used in BETWEEN are 99% of the time similar to "last 2 weeks" or "last month". In some cases, the BETWEEN will end up selecting up to 99% of the rows, but very rarely will it select 100% of them.
The
textColcolumn can be incredibly varied or incredibly regular. In some cases, out of let's say 3 million rows there'll be 2.9 million uniquetextColvalues. In other cases for the same number of rows there'll be only 30,000-100,000 uniquetextColvalues.
I'm using Postgres 9.4, but upgrading to 9.5 is feasible as long as the performance gains can justify it.