I am trying to optimize a text search that involves two large tables, persons and emails, and currently uses ILIKE operators. I have those two indexes:
CREATE INDEX idx_persons_full_name ON persons
USING gin (full_name gin_trgm_ops, to_iso_date(date_created) gin_trgm_ops);
CREATE INDEX idx_emails_gin_trgm_idx ON emails USING gin (email gin_trgm_ops);
This query uses the GIN index:
explain analyze select p.id, e.id from persons p left join emails e on p.email_id=e.id
where email ilike '%john%';
-> Bitmap Heap Scan on emails e (cost=97.80..1000.32 rows=1265 width=4) (actual time=0.884..2.662 rows=89 loops=1)
Recheck Cond: (email ~~* '%john%'::text)
Heap Blocks: exact=81
-> Bitmap Index Scan on idx_emails_gin_trgm_idx (cost=0.00..97.49 rows=1265 width=0) (actual time=0.829..0.829 rows=89 loops=1)
Index Cond: (email ~~* '%john%'::text)
... however when I add a second ilike clause I end up with a sequential scan on emails:
explain analyze select p.id, e.id from persons p left join emails e on p.email_id=e.id
where p.full_name ilike '%john%' or e.email ilike '%john%';
Hash Left Join (cost=2254.88..4527.57 rows=1598 width=8) (actual time=30.053..97.757 rows=138 loops=1)
Hash Cond: (p.email_id = e.id)
Filter: ((p.full_name ~~* '%john%'::text) OR (e.email ~~* '%john%'::text))
Rows Removed by Filter: 78569
-> Seq Scan on persons p (cost=0.00..2066.07 rows=78707 width=23) (actual time=0.016..4.569 rows=78707 loops=1)
-> Hash (cost=1472.17..1472.17 rows=62617 width=27) (actual time=29.364..29.364 rows=62459 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 4148kB
-> Seq Scan on emails e (cost=0.00..1472.17 rows=62617 width=27) (actual time=0.049..13.182 rows=62459 loops=1)
I can think a a few ways to address the issue, what would you recommend?
- Changing something about the existing queries or indexes to get the indexes to be used
- bringing the e.email into the persons table (using triggers, or in a separate materialized view) and using a single gin index on that denormalized table
- Replacing the ILIKEs with trigram searches
The final query looks like this, with pagination on date_created & id
SELECT
p.id AS id, p.is_staff AS is_staff, p.first_name AS first_name, p.last_name AS last_name, p.email_id AS email_id, p.address_id AS address_id, p.is_instructor AS is_instructor, p.date_created AS date_created, p.birthdate AS birthdate, p.phone AS phone, p.full_name AS full_name, p.external_id AS external_id, p.venue_id AS venue_id, p.ssn AS ssn, p.gender AS gender, e.id AS id, e.email AS email, e.confirmed AS confirmed, e.owner_id AS owner_id, e.account_id AS account_id
FROM
persons p
LEFT JOIN emails e ON p.email_id = e.id
WHERE
p.date_archived IS NULL
AND p.date_created < '2023-06-13T08:51:00Z'::timestampwithtimezone
AND p.id < 321482
AND p.is_staff = false
AND (
p.full_name ILIKE UNACCENT('%john%')
OR e.email ILIKE '%john%'
OR TO_ISO_DATE(p.date_created) ILIKE '%john%'
)
ORDER BY
p.date_created DESC
LIMIT 40
Thanks! Franck