Compare words in a string without considering their positions?

Question

In Postgres 9.6 I want to test whether two strings like these are considered the same:

'this is a test number 01', 'number this is 01 a test'

So I have created this function:

CREATE OR REPLACE FUNCTION sort_text(a text) RETURNS text AS $$
    declare t1 text;
    BEGIN
      select(array_to_string (
        array(
          select * from unnest(string_to_array(a, ' ')) order by 1), ' ')) into t1;
      RETURN t1;
    END;

$$ LANGUAGE plpgsql;

select (sort_text('this is a test number 01') = sort_text('number this is 01 a test'));

which actually looks to be working correctly.

I was wondering, is there any better way to do this?

'this and this' and 'and this' are considered to be different.

All the strings are already stripped out (spaces and punctuation) and duplication is not a problem. String length 50 characters as max estimation.

Erwin Brandstetter · Accepted Answer · 2017-06-05T18:48:20.883

I suggest a single SELECT in a plain SQL function:

CREATE OR REPLACE FUNCTION strings_equivalent(a text, b text)
  RETURNS bool AS
$func$
   SELECT a1 = b1
   FROM  (
      SELECT string_agg(w, ' ') AS a1
      FROM  (
         SELECT w
         FROM   unnest(string_to_array(a, ' ')) w
         ORDER  BY w
         ) a1
      ) a2
   , (
      SELECT string_agg(w, ' ') AS b1
      FROM  (
         SELECT w
         FROM   unnest(string_to_array(b, ' ')) w
         ORDER  BY w
         ) b1
      ) b2
   WHERE  length(a) = length(b)
   UNION ALL
   SELECT FALSE
   LIMIT 1;  -- for clarity, not needed
$func$ LANGUAGE sql IMMUTABLE;

Call:

SELECT strings_equivalent('this is a test number 01', 'number this is 01 a test');

This is assuming:

Duplicate words count like any other words
Separator is a single space
No leading or trailing spaces

The function returns NULL for any empty string or NULL input.

The UNION ALL construct is a shortcut to return FALSE immediately if the input strings don't have the same length and avoid more expensive processing. Related:

LIMIT 1 is not needed because the function only returns the first column of the first row anyway and ignores the rest if there are more rows.

IMMUTABLE (since the result never changes for the same input) helps performance with repeated evaluation and allows indexes on functional expressions.

You could use regexp_split_to_table(a, ' ') instead of unnest(string_to_array(a, ' ')), but regular expression functions are typically more expensive. (You can cover more sophisticated separator characters with the regex, though, like '\s+' for any white space). Related:

How to preserve the original order of elements in an unnested array?

BTW, your simple function sort_text() looks good. But use string_agg() instead of array_to_string(ARRAY(...)) in a simple SQL function. No variable assignment needed:

CREATE OR REPLACE FUNCTION sort_text(a text)
  RETURNS text AS
$$
   SELECT string_agg(w, ' ')   
   FROM  (
      SELECT w
      FROM   unnest(string_to_array(a, ' ')) w
      ORDER  BY 1
      ) sub
$$ LANGUAGE sql IMMUTABLE;

stefan · Answer 2 · 2017-06-07T05:48:35.727

Update: combining the ideas of @Erwin Brandstetter (use a "single" SELECT), and a set operation, the following may also be a possibility:

create or replace function strings_equivalent(a text, b text) 
returns boolean as $$
select 
  case 
    when length(a) >= length(b) then ( 
      select count( exa1 ) from (
        select unnest( string_to_array( a, ' ' ) ) 
        except all
        select unnest( string_to_array( b, ' ' ) ) ) exa1 
      ) = 0
    when length(a) < length(b) then ( 
      select count( exa2 ) from (
        select unnest( string_to_array( b, ' ' ) ) 
        except all
        select unnest( string_to_array( a, ' ' ) ) ) exa2
      ) = 0
    else false
  end
$$ language sql immutable;

Testing:

select 
  strings_equivalent('s1 s2 s3','s5 s4 s3 s2 s1')       false1
, strings_equivalent('s1 s2 s3 s4 s5','s1 s3 s5 s2 s4') true1
, strings_equivalent('s1 s2 s3 s4 s5','s6 s7')          false2
, strings_equivalent('a b','b a')   true2
, strings_equivalent('a b','a z')   false3
, strings_equivalent('a z','z a')   true3
, strings_equivalent('a b','a z x') false4
, strings_equivalent('a','a')       true4
, strings_equivalent('a 1 b','a a 1 1 b b')    false5
, strings_equivalent('this is','this is this') false6
;

Output:

 false1 | true1 | false2 | true2 | false3 | true3 | false4 | true4 | false5 | false6 
--------+-------+--------+-------+--------+-------+--------+-------+--------+--------
 f      | t     | f      | t     | f      | t     | f      | t     | f      | f
(1 row)

Evan Carroll · Answer 3 · 2017-06-05T19:28:59.583

I would use plperl, but having seen the performance of Erwin's answer I'm not sure that it matters much, both run in approximately the same time.

CREATE OR REPLACE FUNCTION strings_equivalent_pl(a text, b text)
RETURNS bool AS
$func$
   my (%arg1, %arg2);
   $arg1{$_}++ for grep /\w/, split /\s+/, $_[0];
   $arg2{$_}++ for grep /\w/, split /\s+/, $_[1];

   return 0 if length %arg1 != length %arg2;

   foreach ( keys %arg1 ) {
     return 0 if $arg1{$_} != $arg2{$_};
   }
   return 1;
$func$
LANGUAGE plperl
STRICT IMMUTABLE;

In Perl, a hash is a list of key-value pairs. Here we

create two hashes that represent as keys your words and that represent the occurrences as values.
ensure that both hashes have the same length (amount of words)
ensure that both hash have the same words and the same amount of occurrences.

Note we do use \s so we handle all space-chars regardless of the amount, and leading and trailing spaces do not matter (we grep to remove leading spaces).

Abelisto · Answer 4 · 2017-06-07T09:44:11.757

There are several answers here. Lets test them all :o)

--drop table if exists t;
--drop function if exists strings_equivalent_erwin(text, text);
--drop function if exists strings_equivalent_stefan(text, text);
--drop function if exists strings_equivalent_evan(text, text);
--drop function if exists strings_equivalent_my(text, text);
create table t(x text);
insert into t select concat((random()*100)::int, ' ',(random()*100)::int, ' ',(random()*100)::int) from generate_series(1,500);

CREATE OR REPLACE FUNCTION strings_equivalent_erwin(a text, b text)
  RETURNS bool AS
$func$
   SELECT a1 = b1
   FROM  (
      SELECT string_agg(w, ' ') AS a1
      FROM  (
         SELECT w
         FROM   unnest(string_to_array(a, ' ')) w
         ORDER  BY w
         ) a1
      ) a2
   , (
      SELECT string_agg(w, ' ') AS b1
      FROM  (
         SELECT w
         FROM   unnest(string_to_array(b, ' ')) w
         ORDER  BY w
         ) b1
      ) b2
   WHERE  length(a) = length(b)
   UNION ALL
   SELECT FALSE
   LIMIT 1;  -- for clarity, not needed
$func$ LANGUAGE sql IMMUTABLE;

create or replace function strings_equivalent_stefan(a text, b text) 
returns boolean as $$
select 
  case 
    when length(a) >= length(b) then ( 
      select count( exa1 ) from (
        select unnest( string_to_array( a, ' ' ) ) 
        except all
        select unnest( string_to_array( b, ' ' ) ) ) exa1 
      ) = 0
    when length(a) < length(b) then ( 
      select count( exa2 ) from (
        select unnest( string_to_array( b, ' ' ) ) 
        except all
        select unnest( string_to_array( a, ' ' ) ) ) exa2
      ) = 0
    else false
  end
$$ language sql immutable;

CREATE OR REPLACE FUNCTION strings_equivalent_evan(a text, b text)
RETURNS bool AS
$func$
   my (%arg1, %arg2);
   $arg1{$_}++ for grep /\w/, split /\s+/, $_[0];
   $arg2{$_}++ for grep /\w/, split /\s+/, $_[1];

   return 0 if length %arg1 != length %arg2;

   foreach ( keys %arg1 ) {
     return 0 if $arg1{$_} != $arg2{$_};
   }
   return 1;
$func$
LANGUAGE plperl
STRICT IMMUTABLE;

create or replace function strings_equivalent_my(a text, b text) returns bool immutable language plpythonu as $$
  return sorted(a.split(' ')) == sorted(b.split(' '))
$$;

The most slow is the Stefan's variant. On my aged HW it is:

explain analyse select * from t as t1 join t as t2 on (strings_equivalent_stefan(t1.x, t2.x));

Execution time: 37323.874 ms

The accepted Erwin's case is faster a bit (but not much significant):

explain analyse select * from t as t1 join t as t2 on (strings_equivalent_erwin(t1.x, t2.x));

Execution time: 29939.032 ms

The Evan's variant is faster more then 2 times:

explain analyse select * from t as t1 join t as t2 on (strings_equivalent_evan(t1.x, t2.x));

Execution time: 13452.271 ms

But the winner is my lovely Python :o) (and it is also much more readable)

explain analyse select * from t as t1 join t as t2 on (strings_equivalent_my(t1.x, t2.x));

Execution time: 5818.160 ms

Good luck.

PS: To be honest, there is regex variant with removing whitespaces and case insensitive:

create or replace function strings_equivalent_my(a text, b text) returns bool immutable language plpythonu as $$
  import re
  return sorted(re.split('\s+', a.strip().lower())) == sorted(re.split('\s+', b.strip().lower()))
$$;

and it is much slower: Execution time: 12127.945 ms but it is still faster then Perl.

Compare words in a string without considering their positions?

4 Answers4