I have two tables in Postgres 9.4.1 events and event_refs with the following schemas:
events table
CREATE TABLE events (
id serial NOT NULL PRIMARY KEY,
event_type text NOT NULL,
event_path jsonb,
event_data jsonb,
created_at timestamp with time zone NOT NULL
);
-- Index on type and created time
CREATE INDEX events_event_type_created_at_idx
ON events (event_type, created_at);
event_refs table
CREATE TABLE event_refs (
event_id integer NOT NULL,
reference_key text NOT NULL,
reference_value text NOT NULL,
CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
REFERENCES events (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
Both tables hold 2M rows. This is the query I'm running
SELECT
EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
events
INNER JOIN
event_refs
ON
event_refs.event_id = events.id AND
event_refs.reference_key = 'project'
WHERE
events.event_type = 'event1' OR
events.event_type = 'event2' AND
events.created_at >= '2015-07-01 00:00:00+08:00' AND
events.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY event_refs.reference_value
HAVING COUNT(*) > 1
I am aware of the operator precedence in the where clause. It is only supposed to filter events with the type 'event2' by date.
This is the EXPLAIN ANALYZE output
GroupAggregate (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
Group Key: event_refs.reference_value
Filter: (count(*) > 1)
Rows Removed by Filter: 41315
-> Sort (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
Sort Key: event_refs.reference_value
Sort Method: external merge Disk: 3904kB
-> Hash Join (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
Hash Cond: (event_refs.event_id = events.id)
-> Seq Scan on event_refs (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1)
Filter: (reference_key = 'project'::text)
-> Hash (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1)
Buckets: 16384 Batches: 2 Memory Usage: 3315kB
-> Bitmap Heap Scan on events (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1)
Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)))
Heap Blocks: exact=14911
-> BitmapOr (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1)
-> Bitmap Index Scan on events_event_type_created_at_idx (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1)
Index Cond: (event_type = 'event1'::text)
-> Bitmap Index Scan on events_event_type_created_at_idx (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1)
Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))
Planning time: 0.493 ms
Execution time: 4178.517 ms
I am aware that the filter on the event_refs table scan is not filtering anything, that is a result my test data, there will be different types added later.
Everything under and including the HashJoin seems reasonable give my test data, but I'm wondering if it's possible to increase the Sort speed from the GROUP BY clause?
I've tried adding a b-tree index to the reference_value column but it doesn't seem to use it. If I'm not mistaken (and I very well could be so please tell me), it's sorting 153766 rows. Wouldn't an index be beneficial to this sorting process?