1

Trigrams don't work for non-Latin languages on the Mac.

On the Mac, I'm getting the following behavior:

select show_trgm('peace') -> {"  p"," pe",ace,"ce ",eac,pea}
select show_trgm('über')  -> {0xb1fa7d,0x0d7f89,0x4f84fe,ber,"er "}
select show_trgm('שלום')  -> {}

After trying with the default collations and getting the empty results, I tried changing to a UTF-8 collation, which still doesn't work.

postgres=# \l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges
-----------+-------+----------+-------------+-------------+-------------------
 testdb    | tomer | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 postgres  | tomer | UTF8     | C           | C           |
 template0 | tomer | UTF8     | C           | C           | =c/tomer         +
           |       |          |             |             | tomer=CTc/tomer
 template1 | tomer | UTF8     | C           | C           | =c/tomer         +
           |       |          |             |             | tomer=CTc/tomer

I'm using postgresql 14 on macOS 13.4.

Running the third query works on a postgres 13 instance in AWS RDS, with the UTF-8 encoding, collation and ctype.

select show_trgm('שלום')  -> {0x46e146,0x64182e,0x8c40cc,0x9c1fe2,0xda89ec}

I tried running this statement on both machines:

select * FROM ts_debug('english', 'שלום');

On the Mac, I got

 alias │ description   │ token │ dictionaries │ dictionary │ lexemes 
═══════╪═══════════════╪═══════╪══════════════╪════════════╪═════════
 blank │ Space symbols │ שלום  │ {}           │ NULL       │ NULL

On the cloud server, I got

 alias │    description    │ token │  dictionaries  │  dictionary  │ lexemes 
═══════╪═══════════════════╪═══════╪════════════════╪══════════════╪═════════
 word  | Word, all letters | שלום  | {english_stem} | english_stem | {שלום}

Ideas?

Laurenz Albe
  • 61,070
  • 4
  • 55
  • 90
Tomer
  • 111
  • 2

1 Answers1

0

Looks like the CTYPE setting for the English locale on MacOS considers Hebrew characters as white space. PostgreSQL doesn't have its own locale definitions, it uses the ones from the operating system.

I recommend creating the database with the Israeli locale he_IL.utf8 or, if PostgreSQL was built with ICU support, use an ICU locale.

Laurenz Albe
  • 61,070
  • 4
  • 55
  • 90