Compare occurence of chars in PostgreSQL string

Question

I have some data like this:

metaphone	lag
FLKSW	[null]
PPS	FLKSW
PPS	PPS
PSP	PPS

And I want to compare the string values in both columns on the folowing condition: they're similar (assign some value, like 1) if they share at least 2 chars. Otherwise, they're not similar.

So in the example, PPS and PSP would be similar.

How can this substring comparison be achieved?

I know one approach would be to extract substrings and manually compare them, but it feels hacky and I don't know the maximum number of chars that can occur.

Erwin Brandstetter · Accepted Answer · 2022-05-06T03:45:17.050

they're similar ... if they share at least 2 chars.

Unfortunately, there is no built-in "intersect" operator or function for strings or arrays. You can roll your own function to count overlapping characters:

CREATE FUNCTION f_count_overlapping_char(text, text)
  RETURNS int
  LANGUAGE sql PARALLEL SAFE IMMUTABLE STRICT AS
$func$
SELECT count(*)::int
FROM (
   SELECT unnest(string_to_array($1, NULL))
   INTERSECT ALL
   SELECT unnest(string_to_array($2, NULL))
   ) sub;
$func$;

INTERSECT ALL includes duplicate matching characters. To fold duplicates, use just INTERSECT instead.

Then your query can be:

SELECT *, f_count_overlapping_char(t1.metaphone, t2.metaphone) AS overlap
FROM   tbl t1
JOIN   tbl t2 ON t1.id < t2.id
             AND f_count_overlapping_char(t1.metaphone, t2.metaphone) >= 2;

db<>fiddle here

But it's expensive and does not scale well with more rows in the table - O(N²). Depending on your actual objective there are various superior alternatives - like trigram similarity provided by the additional module pg_trgm. See:

What does the Postgres <-> operator do?

Compare occurence of chars in PostgreSQL string

1 Answers1