5

I have a very simple table:

CREATE TABLE content
(
  id serial NOT NULL,
  text text,
  fullfilename text,
  CONSTRAINT "PK_ID" PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE content
  OWNER TO postgres;

CREATE INDEX content_idx
  ON content
  USING gin
  (to_tsvector('danish'::regconfig, text));

Where text is a rather long aggregated text to perform full text searching.

When I run the following query it resorts to a seq scan:

EXPLAIN ANALYZE SELECT id
FROM content
WHERE to_tsvector('danish', text) @@ to_tsquery('danish','felter')

"Seq Scan on content  (cost=0.00..164.57 rows=249 width=4) (actual time=41.147..7235.823 rows=289 loops=1)"
"  Filter: (to_tsvector('danish'::regconfig, text) @@ '''felt'''::tsquery)"
"  Rows Removed by Filter: 1149"
"Planning time: 0.366 ms"
"Execution time: 7235.914 ms"

That is absolutely not okay. But when disabling seq scans with SET enable_seqscan TO 'off' I get the following result:

"Bitmap Heap Scan on content  (cost=17.94..168.53 rows=249 width=4) (actual time=0.145..0.323 rows=289 loops=1)"
"  Recheck Cond: (to_tsvector('danish'::regconfig, text) @@ '''felt'''::tsquery)"
"  Heap Blocks: exact=70"
"  ->  Bitmap Index Scan on content_idx  (cost=0.00..17.87 rows=249 width=0) (actual time=0.121..0.121 rows=289 loops=1)"
"        Index Cond: (to_tsvector('danish'::regconfig, text) @@ '''felt'''::tsquery)"
"Planning time: 0.373 ms"
"Execution time: 0.383 ms"

What exactly is going on, and what parameters needs tuning for it to run better. I don't like the idea of removing a tool from the query planners toolbox. As it stands, the setup is running a stock PostgreSQL 9.4.4

hvidgaard
  • 205
  • 2
  • 7

3 Answers3

8

The problem here is that it thinks to_tsvector is a very cheap operation to perform, even when performing it on an entire table. Basically, it think it is the same cost as adding together two integers. But of course it is not. In the upcoming release 9.5, the default cost for that function has been increased 100-fold (which is probably still not enough to be accurate...but it should be enough to tip the scales for your case, which is all you need).

You can change the cost of that function yourself:

alter function to_tsvector ( regconfig, text) cost 1000;

(Note that because this is a built-in function, changes you make will not survive a pg_upgrade, or a dump, initdb and restore cycle)

This is much safer than fiddling around with the global cost parameters.

The general (global) cost parameters are described here, and ones for specific functions are described here. The way the two interact with each other don't seem to be well-described anywhere in the docs.

jjanes
  • 42,332
  • 3
  • 44
  • 54
6

I encountered this problem myself once, and for my case, I was able to fix it by setting the random_page_cost to a low value for just the query, as

SET random_page_cost = 0.01;

Now, this will of course have set your random page cost low, and you may not want to keep it that way. So, you could do it locally for a single transaction as

BEGIN;

SET LOCAL random_page_cost = 0.01;

SELECT id
FROM content
WHERE to_tsvector('danish', text) @@ to_tsquery('danish','felter');

COMMIT;

In my case, this cleared up the issue and allowed me to "hack" the query and force a bitmap scan.

If this does solve your problem, you could build this as a function as well:

CREATE OR REPLACE FUNCTION matched_ids(in_text text)
RETURNS SETOF content.id%TYPE AS
$$

  SET LOCAL random_page_cost = 0.01;

  SELECT id
  FROM content
  WHERE to_tsvector('danish', text) @@ to_tsquery('danish',in_text);

$$ LANGUAGE SQL;

Then call it using

SELECT * FROM matched_ids('felter');
Chris
  • 2,457
  • 17
  • 20
4

The core of your problem is here:

As it stands, the setup is running a stock PostgreSQL 9.4.4

Stock installations typically come with very conservative settings that are not good for bigger databases. For instance random_page_cost of 4.0, which is fine for casual use and minimal resources, but you need much more aggressive settings for a serious DB server with plenty of RAM. Like random_page_cost = 1.5 or lower. With modern SSD drives and plenty of RAM go down to 1.1 or even lower. You have to calibrate some other settings, too. Basic instructions:

Related:

Hacks to force a different query plan

If cost estimates are still off, after you have configured the server properly, you can still use one of the evil hacks you or @Chris mentioned to force a different query plan. Here is a simple SQL function:

CREATE OR REPLACE FUNCTION matched_ids(_search text)
  RETURNS SETOF content
  LANGUAGE sql STABLE SET random_page_cost = 0.01 AS
$func$
   SELECT *
   FROM   content
   WHERE  to_tsvector('danish', text) @@ to_tsquery('danish', _search);
$func$;

Or (more radically):

...
       SET enable_seqscan = off;

Call:

SELECT * FROM matched_ids('felter');

I supply the configuration parameter to the function instead of using SET LOCAL. There is a subtle but very important difference, the documentation:

The SET clause causes the specified configuration parameter to be set to the specified value when the function is entered, and then restored to its prior value when the function exits.

Bold emphasis mine. SET LOCAL would persist for the duration of the transaction. If you use this function in the context of a bigger transaction or a single nested query, that might be poison for the rest.

You could use SET LOCAL in the function instead and RESET at the end, but this can still have a different effect. You would really need to remember the current setting when entering the function and restore that one ...

Warning: hacks with "forged" cost settings (especially the 2nd) can turn on you later, when they are forgotten and a new Postgres wouldn't need an evil hack any more. This can also produce poor results for very common search terms where a sequential scan would be faster.

This function returns the whole table row. Use RETURNS TABLE(...) matching your actual return type to return selected columns.

Aside: Don't use a basic type name like text as identifier, that's prone to confusing errors.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633