Finding the most commonly used non-stop words in a column

Question

Simply put, I have a Postgres column containing ordinary prose and would like to determine the x most commonly used words ("word" being a group of characters delimited by a space, but not being stop word) across all rows.

I've found two solutions that nearly hit the mark:

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('english', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

This is great, except it returns word stems.

SELECT   UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM     item 
GROUP    BY 1 
ORDER    BY 2 DESC
LIMIT    50;

This one returns the full words, but includes stop words.

For the sake of simplicity: stop words are supposed to be found on TABLE stop_words (lowercase_stopword text PRIMARY KEY).

Can anyone help me over the line?

score 6 · Answer 1 · edited Jun 15 '20 at 09:05

Your first query was pretty close. To remove the unwanted stemming, create a text search configuration with a simple dictionary that does not do it.

I suggest to use a separate schema for text search objects, but that's totally optional:

CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';
CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
    TEMPLATE = pg_catalog.simple
  , STOPWORDS = english
);
CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER  TEXT SEARCH CONFIGURATION ts.english_simple
   ALTER MAPPING FOR asciiword WITH ts.english_simple_dict;  -- 1, 'Word, all ASCII'

Then your query works, and very fast, too:

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('ts.english_simple', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

dbfiddle here

This operates with lower case words without stemming and doesn't break for non-ASCII letters.

Backgroud

Read the chapter Simple Dictionary in the manual.

The exact definition of a "word" is a tricky matter. The default text search parser (currently it's the only one) identifies 23 different types of tokens. See:

SELECT * FROM ts_token_type('default');

Built-in text search configurations map most of those to (built-in) dictionaries. Mappings for the english config:

SELECT tt.*, m.mapdict::regdictionary AS dictionary
FROM   pg_ts_config_map m
LEFT   JOIN   ts_token_type(3722) tt ON tt.tokid = m.maptokentype
WHERE  mapcfg = 'english'::regconfig  --  'ts.english_simple'::regconfig
ORDER  BY tt.tokid;

The demo above creates a new config based on the simple config, and since all English stop words are of type 'asciiword', we only need to map this type to remove stop words, no stemming or anything else.

Evan Carroll · Answer 2 · 2017-09-24T04:41:03.153

Erwin's answer is better

Following the sample data from @joanolo above, I wanted to answer in such a way that did not require you to explicitly list stop words. Doing so requires assumptions about the dictionary.

WITH titles(title) AS
(
   VALUES 
      ('This is a title'), 
      ('This is another title'), 
      ('This is finally a third title'), 
      ('and I don''t like Mondays')
)
SELECT token, count(*)
FROM titles
CROSS JOIN LATERAL ts_debug(title)
WHERE alias = 'asciiword'
        AND array_length(lexemes,1) <> 0
GROUP BY token;

  token  | count 
---------+-------
 title   |     3
 another |     1
 like    |     1
 third   |     1
 Mondays |     1
 finally |     1
(6 rows)

We're using array_length to get around an internal bug described here

score 1 · Answer 3 · answered Jan 12 '17 at 23:03

This would give you your expected output:

-- Some example data
WITH titles(title) AS
(
   VALUES 
      ('This is a title'), 
      ('This is another title'), 
      ('This is finally a third title'), 
      ('and I don''t like Mondays')
)

-- List here all the words that you consider 'stop words'
-- in lowercase
, stop_words(word) AS
(

    VALUES ('the'), ('a'), ('and')
)

-- Make list of (lowercased) found words
, found_lower_words AS
(
SELECT 
    lower(unnest(string_to_array(title, ' '))) AS word
FROM
    titles
)

-- And now anti-join with the stop_words, group and count
SELECT
    word, count(*) AS word_count
FROM
    found_lower_words
    LEFT JOIN stop_words USING(word)
WHERE
    stop_words.word is NULL
GROUP BY
    word
ORDER BY
    word_count DESC, word ASC
LIMIT
    50 ;

The results would be:

  |---------+---|
  |   is    | 3 |
  |---------+---|
  |  this   | 3 |
  |---------+---|
  |  title  | 3 |
  |---------+---|
  | another | 1 |
  |---------+---|
  |  don't  | 1 |
  |---------+---|
  | finally | 1 |
  |---------+---|
  |    i    | 1 |
  |---------+---|
  |  like   | 1 |
  |---------+---|
  | mondays | 1 |
  |---------+---|
  |  third  | 1 |
  |---------+---|

Finding the most commonly used non-stop words in a column

3 Answers3

Backgroud

Linked