2

Postgres' full text search to_tsvector() function returns a sorted array of lexemes. I need exactly that, just with lexemes in original sort order. Is there any such Postgres function, or do I need to do it myself? Is so, how?

I need it for Jaro-Winkler string similarity function, for which order of strings/words is important. I need tsvector only because of language support for "lexemization". Actually, I just need to convert text, e.g.:

select array_to_string(tsvector_to_array(to_tsvector('english', 'I just want to lexemize this string and keep the original order of lexemes...')), ' ');

This is OK, except lexemes are ordered alphabetically instead of their original relative order in original text.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
zlatko
  • 175
  • 1
  • 6

2 Answers2

2

unnest(tsvector) is the key, like you already used in your answer. But the function can be a lot simpler - and improved for modern Postgres:

CREATE OR REPLACE FUNCTION public.tokenize_orig(_string text, _dictionary regconfig = 'english')
  RETURNS text[]
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT ARRAY(
   SELECT lexeme
   FROM   unnest(to_tsvector($2, $1)) arr
   ORDER  BY positions[1]
   );
$func$;

PARALLEL SAFE in Postgres 9.6 or later. See:

ARRAY constructor is simpler and cheaper than array_agg(). See:

I added 'english' as default for the function parameter _dictionary, so you can skip that unless you want a different dictionary. See:

The above function returns each lexeme once, in the position of its first appearance.
To keep all appearances:

CREATE OR REPLACE FUNCTION pg_temp.tokenize_orig3(_string text, _dictionary regconfig = 'english')
  RETURNS text[]
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
BEGIN ATOMIC
SELECT ARRAY(
   SELECT lexeme
   FROM   unnest(to_tsvector($2, $1)) arr
        , unnest(positions) pos
   ORDER  BY pos
   );
END;

I made it a standard SQL function this time (requires at least Postgres 14), which is totally optional, but slightly more efficient. See:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
0

This function seems to work, Please, suggest if there is a better way.

CREATE OR REPLACE FUNCTION public.tokenize_orig(
    p_string text,
    p_dictionary regconfig)
  RETURNS text[] AS
$BODY$ 
/* This function turns documents into array of lexemes, keeping original order of lexemes. */ 

select array_agg(lexeme) 
from 
    (
    select (arr).lexeme as lexeme 
    from 
        (select unnest(tsv) arr 
        from to_tsvector($2, $1) as tsv) arr 
        where array_length((arr).positions, 1) > 0 
        order by (arr).positions
    ) as qry

$BODY$
  LANGUAGE sql IMMUTABLE STRICT
  COST 100;
zlatko
  • 175
  • 1
  • 6