1

I have a TEXT keyvalues column in Postgres:

select * from test5 limit 5;

 id |                      keyvalues
----+------------------------------------------------------
  1 | ^ first 1 | second 3
  2 | ^ first 1 | second 2 ^ first 2 | second 3
  3 | ^ first 1 | second 2 | second 3
  4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
  5 | ^ first 2 | second 3 ^ first 1 | second 3

My queries must exclude the ^ character from the middle of the match, so I'm using regular expressions:

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
   ->  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
               ->  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                     Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                     Rows Removed by Filter: 1666668
 Planning Time: 0.068 ms
 Execution Time: 7337.184 ms

The query works (zero rows match), but is way too slow at > 7 seconds.

I thought indexing with trigrams would help, but no luck:

create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
   ->  Bitmap Heap Scan on test5  (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
         Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
         Rows Removed by Index Recheck: 5000005
         Heap Blocks: exact=47620
         ->  Bitmap Index Scan on test5_keyvalues_idx  (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
               Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
 Planning Time: 0.412 ms
 Execution Time: 23734.722 ms

The query with the trigram index is 3x slower! It still returns the correct result (zero rows). I expected the trigram index to figure out immediately there's no second 0 string anywhere, and be super fast.

(Motivation: I want to avoid normalizing the keyvalues into another table, so I'm looking to encode the matching logic in a single TEXT field using text indexing and regexps instead. The logic works, but is too slow, as is JSONB.)

user124114
  • 113
  • 4

2 Answers2

3

I expected the trigram index to figure out immediately there's no second 0 string anywhere

'second' and '0' are separate words, so it cannot detect their joint absence as such. It seems like it could detect the absence of ' 0', but this comment from "contrib/pg_trgm/trgm_regexp.c" seems pertinent:

 * Note: Using again the example "foo bar", we will not consider the
 * trigram "  b", though this trigram would be found by the trigram
 * extraction code.  Since we will find " ba", it doesn't seem worth
 * trying to hack the algorithm to generate the additional trigram.

Since 0 is the last character in the pattern string, there will be no trigram of the form " 0a", either, so it just misses that opportunity.

Even if it were not for this limitation, your approach seems extremely fragile.

jjanes
  • 42,332
  • 3
  • 44
  • 54
1

Jeff provided the explanation you were looking for.

Phrase search as provided by the text search infrastructure might be your solution.

CREATE INDEX test5_keyvalues_ts_idx ON test5 USING GIN (to_tsvector('simple', keyvalues));

Query:

SELECT *
FROM   test5
WHERE  to_tsvector('simple', keyvalues) @@ phraseto_tsquery('simple','first 1 second 3');
AND    keyvalues ~* '\^ first 1[^\^]+second 0';

Use the 'simple' text search configuration as you don't want language specific stemming.

The ^ character is significant in your case, but considered to be noise by the 'default' text search parser (the only one implemented, actually) - so not indexed. So narrow it down with the text search index, and filter with the regexp.

It's possible to add a custom parser, but I never tried that. And you want to negate its existence in the search pattern anyway, so I guess the above is your best shot.

Related:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633