1

I have configured trigram index for fuzzy search in Postgres. I inserted 8500000 rows to this table with 4 words text.

CREATE DATABASE people;
\connect people;
CREATE TABLE person
(
    id   bigserial NOT NULL,
    name text      NULL
);

CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;

CREATE INDEX name_trigram_idx ON person USING gin (name gin_trgm_ops);

INSERT INTO person (name)
SELECT CONCAT(g.name, ' ', g.name, ' ', g.name, ' ', (SELECT array_to_string(
                                                                     array(select substr(
                                                                                          'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
                                                                                          ((random() * (36 - 1) + 1)::integer),
                                                                                          1)
                                                                           from generate_series(1, 10)), '')))
FROM generate_series(1, 8500000) AS g (name);

Now I do a fuzzy search:

SELECT * FROM person WHERE name % '1200000'

And have results:

   id    |                name                
---------+------------------------------------
   12000 | 12000 12000 12000 CLXBHQVEKC
  120000 | 120000 120000 120000 CLXBHQVEKC
 1200000 | 1200000 1200000 1200000 CLXBHQVEKC
 1202000 | 1202000 1202000 1202000 CLXBHQVEKC
 1212000 | 1212000 1212000 1212000 CLXBHQVEKC
(5 rows)

It takes about 1 second. That's good but I would like to have it faster (300ms).

What's are possible ways for optimization the speed? Here is explain-analyze request:

EXPLAIN (ANALYZE) SELECT * FROM person WHERE name % '1200000'
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on person  (cost=201.87..24804.13 rows=8499 width=42) (actual time=45.135..972.583 rows=5 loops=1)
   Recheck Cond: (name % '1200000'::text)
   Rows Removed by Index Recheck: 147980
   Heap Blocks: exact=25286
   ->  Bitmap Index Scan on name_trigram_idx  (cost=0.00..199.74 rows=8499 width=0) (actual time=40.833..40.840 rows=147985 loops=1)
         Index Cond: (name % '1200000'::text)
 Planning Time: 1.538 ms
 Execution Time: 973.046 ms
(8 rows)

And a link to my test repo.

Generally speaking, I would like to have a typo-proof search for a text with several words which has about 9 million rows and it shouldn't respond longer than 300ms.

P.S. for this experiment I'm running Postgres in docker with 6 CPUs, 8GB memory and 1GB of swap.

Maxim Yefremov
  • 465
  • 1
  • 4
  • 7

1 Answers1

2

You create the fuzzystrmatch extension, but you never use it. Are you missing an example?

This type of query is very context sensitive, so you have to test it on realistic example data with realistic queries. Your example doesn't look very realistic. For one thing, your random string is the same random string in every row, not a different random one for each. Also, having small integers embedded in your names seem pretty odd.

Your query of '120000' also doesn't seem very realistic, or if it is then it is pretty unfortunate. Due to the repetition of zeros, it is a low complexity query and doesn't give the query engine much to grab on to. If you are worried about the average case rather than the worst case, it is a poor choice of example.

Using the "%" operator on strings of quite unequal length is rarely the right thing to do. Perhaps you want "%>" or "%>>" instead. There were added to pg_trgm in 9.6 and 11, respectively. They measure how similar the query is to the most similar substring chunk of the column, rather than the whole of the column. They differ in that %>> requires the "chunk" to be aligned at word boundaries, while %> does not. Not only do they give what I think are better results, but they are also much faster (in this case, I don't know about in general).

Finally, your query would be much faster if you increased the setting of pg_trgm.similarity_threshold. Of course, then you also get no results, because none of your data is really all that similar to '120000'. This goes back to the choice of operator, as the presence of 'CLXBHQVEKC' drags down how similar anything can be when using "%".

jjanes
  • 42,332
  • 3
  • 44
  • 54