149

I had to write a simple query where I go looking for people's name that start with a B or a D:

SELECT s.name 
FROM spelers s 
WHERE s.name LIKE 'B%' OR s.name LIKE 'D%'
ORDER BY 1

I was wondering if there is a way to rewrite this to become more performant. So I can avoid or and / or like?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
Lucas Kauffman
  • 1,835
  • 4
  • 17
  • 18

8 Answers8

242

Pattern matching operators

  • LIKE (~~) is simple and fast but limited in its capabilities.
    ILIKE (~~*) the case insensitive variant.

  • ~ (regular expression match) is powerful but more complex and may be slow for anything more than basic expressions.
    ~* is the case insensitive variant.

  • SIMILAR TO is just pointless. A peculiar blend of LIKE and regular expressions. I never use it. See below.

All of the above can use a trigram index. For left-anchored patterns, also a B-tree index using COLLATE "C". Or with any other collation and the operator class text_pattern_ops. Or even an SP-GiST index. See below.

Basics about pattern matching in the manual.

Related operators

  • ^@ is "starts with" operator (for prefix matching), equivalent to the starts_with() function.
    Added with Postgres 11, can use an SP-GiST index. Since Postgres 15 also a B-tree index using a "C" collation. See below.

  • % is the "similarity" operator, provided by the additional module pg_trgm. See below.

  • @@ is the text search operator. See below.

Your query

... is pretty much the optimum. Syntax won't get much shorter, query won't get much faster:

SELECT name FROM spelers
WHERE  name LIKE 'B%' OR name LIKE 'D%'
ORDER  BY 1;

Or equivalent, using regular expressions:

... WHERE name ~ '^B' OR name ~ '^D'

A bit shorter (can use the same index in modern Postgres):

... WHERE name LIKE ANY ('{B%,D%}')
... WHERE name ~ ANY ('{^B,^D}')

A regular expression with branches shortens the syntax some more:

... WHERE name ~ '^(B|D).*'

Or a character class (only for the case with a single character):

... WHERE name ~ '^[BD].*'

For big tables, index support improves performance by orders of magnitude.

Postgres 11 added the "starts with" operator ^@- fast when supported with an SP-GiST index:

... WHERE name ^@ 'B' OR name ^@ 'D'

Or:

... WHERE name ^@ ANY ('{B,D}')

Since Postgres 15 the first variant can also use a B-tree index using COLLATE "C".

db<>fiddle here

Index

For performance with big tables, create an index like this supporting left-anchored search patterns (matching from the start of the string):

CREATE INDEX spelers_name_special_idx ON spelers (name COLLATE "C");

Requires per-column collation support added with Postgres 9.1.

See:

In DBs running with the "C" locale (uncommon), a plain B-tree index does the job.

In older versions (or still today), the special operator class text_pattern_ops fills the same role:

CREATE INDEX spelers_name_special_idx ON spelers (name text_pattern_ops);

SIMILAR TO or regular expressions with basic left-anchored expressions can use this index, too. Very old versions failed to use the index for more complex expressions.

Or use an expression index on a truncated substring for columns with long strings, and repeat the same expression in queries.
Or use the ^@ operator with a matching index.

Trigram matching

Trigram matches or text search use special GIN or GiST indexes.

You can install the additional module pg_trgm to provide index support for any LIKE / ILIKE pattern (and simple regexp patterns with ~ / ~*) using a GIN or GiST index.

Details, example and links:

pg_trgm provides additional operators like:

  • % - the "similarity" operator
  • <% (commutator: %>) - the "word_similarity" operator in Postgres 9.6 or later
  • <<% (commutator: %>>) - the "strict_word_similarity" operator in Postgres 11 or later

Text search

Is a special type of pattern matching with separate infrastructure and index types. It uses dictionaries and stemming and is a great tool to find words in documents, especially for natural languages.

Prefix matching is also supported:

Phrase search is supported since Postgres 9.6:

Consider introductions in the manual and the overview of operators and functions.

Additional tools for fuzzy string matching

The additional module fuzzystrmatch offers more options, but performance is generally inferior to all of the above.

In particular, various implementations of the levenshtein() function may be instrumental.

Why are regular expressions (~) always faster than SIMILAR TO?

SIMILAR TO expressions are rewritten into regular expressions internally. For every SIMILAR TO expression, there is at least one faster regular expression (saving the overhead of rewriting the expression). There is no performance gain in using SIMILAR TO ever.

Simple expressions that can make do with LIKE (~~) are faster with LIKE anyway.

SIMILAR TO is only supported in PostgreSQL because it ended up in early drafts of the SQL standard. They still haven't gotten rid of it. But there are plans to remove it and include regexp matches instead - or so I heard.

EXPLAIN ANALYZE reveals it. Just try yourself:

EXPLAIN ANALYZE
SELECT * FROM spelers WHERE name SIMILAR TO 'B%';

The query plan reveals:

...  
Seq Scan on spelers  (cost= ...  
  Filter: (name ~ '^(?:B.*)$'::text)

SIMILAR TO has been rewritten with a regular expression (~).

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
14

How about adding a column to the table. Depending on your actual requirements:

person_name_start_with_B_or_D (Boolean)

person_name_start_with_char CHAR(1)

person_name_start_with VARCHAR(30)

PostgreSQL doesn't support computed columns in base tables a la SQL Server but the new column can be maintained via trigger. Obviously, this new column would be indexed.

Alternatively, an index on an expression would give you the same, cheaper. E.g.:

CREATE INDEX spelers_name_initial_idx ON spelers (left(name, 1)); 

Queries that match the expression in their conditions can utilize this index.

This way, the performance hit is taken when the data is created or amended, so may only be appropriate for a low activity environment (i.e. much fewer writes than reads).

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
onedaywhen
  • 2,632
  • 1
  • 20
  • 21
10

You could try

SELECT s.name
FROM   spelers s
WHERE  s.name SIMILAR TO '(B|D)%' 
ORDER  BY s.name

I've no idea whether or not either the above or your original expression are sargable in Postgres though.

If you create the suggested index would also be interested to hear how this compares with the other options.

SELECT name
FROM   spelers
WHERE  name >= 'B' AND name < 'C'
UNION ALL
SELECT name
FROM   spelers
WHERE  name >= 'D' AND name < 'E'
ORDER  BY name
Martin Smith
  • 87,941
  • 15
  • 255
  • 354
2

What I have done in the past, faced with a similar performance issue, is to increment the ASCII character of the last letter, and do a BETWEEN. You then get the best performance, for a subset of the LIKE functionality. Of course, it only works in certain situations, but for ultra-large datasets where you're searching on a name for instance, it makes performance go from abysmal to acceptable.

Mel Padden
  • 387
  • 1
  • 5
2

Very old question, but I found another fast solution to this problem:

SELECT s.name 
FROM spelers s 
WHERE ascii(s.name) in (ascii('B'),ascii('D'))
ORDER BY 1

Since function ascii() looks only at first character of the string.

Sole021
  • 21
  • 1
2

For checking of initials, I often use casting to "char" (with the double quotes). It's not portable, but very fast. Internally, it simply detoasts the text and returns the first character, and "char" comparison operations are very fast because the type is 1-byte fixed length:

SELECT s.name 
FROM spelers s 
WHERE s.name::"char" =ANY( ARRAY[ "char" 'B', 'D' ] )
ORDER BY 1

Note that casting to "char" is faster than the ascii() slution by @Sole021, but it is not UTF8 compatible (or any other encoding for that matter), returning simply the first byte, so should only be used in cases where the comparison is against plain old 7-bit ASCII characters.

Ezequiel Tolnay
  • 5,028
  • 1
  • 17
  • 23
1

There are two methods not mentioned yet for dealing with such cases:

  1. partial (or partitioned - if created for full range manually) index - most useful when only a subset of data is required (for example during some maintenance or temporary for some reporting):

    CREATE INDEX ON spelers WHERE name LIKE 'B%'
    
  2. partitioning the table itself (using the first character as partitioning key) - this technique is especially worth considering in PostgreSQL 10+ (less painful partitioning) and 11+ (partition pruning during query execution).

Moreover, if the data in a table is sorted, one can benefit from using BRIN index (over the first character).

Tomasz Pala
  • 181
  • 5
-4

Probably faster to do a single character comparison:

SUBSTR(s.name,1,1)='B' OR SUBSTR(s.name,1,1)='D'
ypercubeᵀᴹ
  • 99,450
  • 13
  • 217
  • 306