SIMILAR TO vs ~, ANY vs individual conditions

Question

I'm trying to optimize a somewhat convoluted report-type query that's performed over two large tables and 3 smaller ones. Part of the filtering on the two large tables is done on a varchar nullable (though it's almost never null, and is overwhelmingly a 20 character string) column. The database has a gin index on that column for both tables and the condition is something like (in regex form):

table1.code ~ any(array[
          '^.{3}foo',
          '^.{3}bar(?!list|of|exceptions|for|bar)',
          '^.{3}baz(?!list|of|exceptions|for|baz)',
          '^.{3}quu(?!list|of|exceptions|for|quu)',
          '^.{3}bat(?!list|of|exceptions|for|bat)',
          '^.{3}abc(?!list|of|exceptions|for|abc)',
          '^.{3}xyz(?!list|of|exceptions|for|xyz)'
      ])

with the list varying for each of the options. Only the list varies per "root" (e. g. bar), though, the .{3} part is fixed. The condition is also identical for both large tables.

I've tried rewriting this condition via similar to, which looked something like

table1.code similar to '___(foo|bar|baz|quu|bat|abc|xyz)%'
and table1.code not similar to '___bar(list|of|exceptions|for|bar)%'
and ...

in hopes of filtering out records that don't match the first condition (an overwhelming majority of them don't) first, and even created a partial index with the exact wording as the similar to snippet, but creating the index somehow lead to worse performance despite the gin index on the column being used instead of the partial one, based on a breakdown of the same query on the same database with the same conditions from explain.dalibo.com.

I feel like similar to was the wrong choice, and that I should try again with regex, but my questions are therefore:

Does similar to lend itself well to partial indexes? I sense that regex is a better choice, but maybe I just set my index up wrong. The query plan seems to convert all similar to conditions to plain regex, but checking the source of the index reveals a bunch of field::text ~ similar_escape('^.{3}foo'::text, null::text), which maybe prevents the optimizer from utilizing it?
Is there any meaningful performance gain in making the positive condition separate and joining the exceptions via and?
In the case of regex, is there any meaningful performance difference in using ~ any(array[]) vs a bunch of and-joined matches against individual entries in the array?
Is a better way to handle this evading me? The regex solution was something I came up with to optimize the previous version, which did a bunch of substr() calls and performed orders of magnitude worse.

Edit 1

Related information:

PostgreSQL 12.3
table1 cardinalities:
- Total size: 25200000 records
- Records filtered by an indexed bool flag: 9700000
- Records filtered by an indexed datetime field: 9800000 (using "worst case" scenario, an end of the year report filtering records from January 1st to December 31st)
- Records filtered by the code condition in the post: 9770000
- Records filtered by combining the conditions: 1000000 (yes, the query analyzer did warn about filtering out most of the table)
table2 cardinalities:
- Total size: 40700000 records
- Records filtered by an indexed bool flag: 31400000
- Records filtered by an indexed datetime field: 10700000 (same scenario)
- Records filtered by a partially indexed (where table2.other_field ~ '^abcde') varchar field: 7800000
- Records filtered by the code condition in the post: 9400000
- Records filtered by combining the conditions: 1000000

Another thing of note is that both tables are filtered using a CTE first

Erwin Brandstetter · Accepted Answer · 2025-02-10T00:25:16.077

I feel like similar to was the wrong choice, and that I should try again with regex

Oh yes, SIMILAR TO is always the wrong choice, but for different reasons. See:

Pattern matching with LIKE, SIMILAR TO or regular expressions

TLDR: every SIMILAR TO expression is (currently) translated to an equivalent regex expression internally. Plus, nobody needs the odd hybrid syntax for SIMILAR TO patterns when we can use universally applicable regex patterns to begin with.

Does similar to lend itself well to partial indexes?

It follows logically that it can use the same indexes a regex expression can.

In hopes of filtering out records that don't match the first condition (an overwhelming majority of them don't) first, and even created a partial index with the exact wording ...

(Bold emphasis mine.)

What index exactly? I suspect you tried a GIN index. But a partial B-tree index will do a much better job, given the selective base filter (according to your statement). It will also be much smaller and cheaper to maintain.

Solution

CREATE INDEX tbl_fld_idx ON tbl (field)
WHERE field ~ '^...(foo|bar|baz|quu|bat|abc|xyz)';  -- split off for partial idx!
SELECT *
FROM   tbl
WHERE  field ~  '^...(foo|bar|baz|quu|bat|abc|xyz)'  -- split off to match partial idx!
AND    field !~ '^...((bar(list|of|exceptions|for|bar))|(baz(list|of|exceptions|for|baz))|(quu(list|of|exceptions|for|quu))|(bat(list|of|exceptions|for|bat))|(abc(list|of|exceptions|for|abc))|(xyz(list|of|exceptions|for|xyz)))'

fiddle

I also simplified to regular expressions with simple branches (nested 2 levels), as those perform better than negative lookaheads in my experience.

We might even use the complete regex for the partial index, but if the base expression already filters the overwhelming majority of rows, that will be more expensive and less versatile overall.

If the SELECT list is a subset of the index expressions (only field in the example), index-only scans are possible.

SIMILAR TO vs ~, ANY vs individual conditions

Edit 1

1 Answers1

Solution