I'm trying to optimize a somewhat convoluted report-type query that's performed over two large tables and 3 smaller ones. Part of the filtering on the two large tables is done on a varchar nullable (though it's almost never null, and is overwhelmingly a 20 character string) column. The database has a gin index on that column for both tables and the condition is something like (in regex form):
table1.code ~ any(array[
'^.{3}foo',
'^.{3}bar(?!list|of|exceptions|for|bar)',
'^.{3}baz(?!list|of|exceptions|for|baz)',
'^.{3}quu(?!list|of|exceptions|for|quu)',
'^.{3}bat(?!list|of|exceptions|for|bat)',
'^.{3}abc(?!list|of|exceptions|for|abc)',
'^.{3}xyz(?!list|of|exceptions|for|xyz)'
])
with the list varying for each of the options. Only the list varies per "root" (e. g. bar), though, the .{3} part is fixed. The condition is also identical for both large tables.
I've tried rewriting this condition via similar to, which looked something like
table1.code similar to '___(foo|bar|baz|quu|bat|abc|xyz)%'
and table1.code not similar to '___bar(list|of|exceptions|for|bar)%'
and ...
in hopes of filtering out records that don't match the first condition (an overwhelming majority of them don't) first, and even created a partial index with the exact wording as the similar to snippet, but creating the index somehow lead to worse performance despite the gin index on the column being used instead of the partial one, based on a breakdown of the same query on the same database with the same conditions from explain.dalibo.com.
I feel like similar to was the wrong choice, and that I should try again with regex, but my questions are therefore:
- Does
similar tolend itself well to partial indexes? I sense that regex is a better choice, but maybe I just set my index up wrong. The query plan seems to convert allsimilar toconditions to plain regex, but checking the source of the index reveals a bunch offield::text ~ similar_escape('^.{3}foo'::text, null::text), which maybe prevents the optimizer from utilizing it? - Is there any meaningful performance gain in making the positive condition separate and joining the exceptions via and?
- In the case of regex, is there any meaningful performance difference in using
~ any(array[])vs a bunch of and-joined matches against individual entries in the array? - Is a better way to handle this evading me? The regex solution was something I came up with to optimize the previous version, which did a bunch of
substr()calls and performed orders of magnitude worse.
Edit 1
Related information:
- PostgreSQL 12.3
table1cardinalities:- Total size: 25200000 records
- Records filtered by an indexed bool flag: 9700000
- Records filtered by an indexed datetime field: 9800000 (using "worst case" scenario, an end of the year report filtering records from January 1st to December 31st)
- Records filtered by the
codecondition in the post: 9770000 - Records filtered by combining the conditions: 1000000 (yes, the query analyzer did warn about filtering out most of the table)
table2cardinalities:- Total size: 40700000 records
- Records filtered by an indexed bool flag: 31400000
- Records filtered by an indexed datetime field: 10700000 (same scenario)
- Records filtered by a partially indexed (
where table2.other_field ~ '^abcde') varchar field: 7800000 - Records filtered by the
codecondition in the post: 9400000 - Records filtered by combining the conditions: 1000000
Another thing of note is that both tables are filtered using a CTE first