Improving performance of count where only result > 0 matters

Question

I have following query:

select count(*) -- rp.tag
from submission 
inner join post rp on submission.id = rp.submission_id 
where user_name = ?1
and lower(rp.tag) like ('%verified%')
and not lower(rp.tag) like ('%unverified%');

If the result is greater than 0 I do something in my application for this user.

I run this query against a big list of users and it takes close to an hour to finish the whole process. And I only need to know whether the result is greater than 0.

Is there a way to finish the query early if any of the rp.tag match my condition?

Erwin Brandstetter · Accepted Answer · 2019-04-21T22:18:25.350

You can add LIMIT 1, but LIMIT is applied after aggregation (so after the count). Adding it in the same query level does nothing useful, since there is only 1 row left to return anyway: the row count. Would be misleading nonsense.

If you want to go that route, add LIMIT in a subquery and count in the outer SELECT, like:


SELECT count(*) FROM (
   SELECT             -- empty SELECT list
   FROM   submission
   JOIN   ...
   WHERE  ...
   LIMIT 1
   ) sub

For this special purpose it's fastest to leave the SELECT list empty. count(*) only counts the existence of rows without regard to any columns.

You can do that with any LIMIT, really. To check whether there are at least N rows. See:

Fast way to discover the row count of a table in PostgreSQL

Checking for N = 1 is just your special case. And while we only care whether any rows exist and not how many, EXISTS is the better, faster tool:

SELECT EXISTS (
   SELECT FROM submission s
   JOIN   post p ON s.id = p.submission_id 
   WHERE  s.user_name = ?
   AND    p.tag ~* 'verified'
   AND    NOT p.tag ~* 'unverified'
   );

Without index, lower(col) LIKE '%something%' is a bit faster than col ~* 'something'. But if your tables are big and performance matters (like your question indicates), you should have matching indexes to begin with, a trigram index on tag to be precise. Assuming tag is from table post (you left room for interpretation there):

CREATE INDEX post_tag_gin_trgm_idx ON post USING gin (tag gin_trgm_ops);

Then the simpler expression is just as fast. See:

Depending on cardinalities and data distribution it might make sense to combine that with submission_id in a multicolumn index:

CREATE INDEX post_foo_trgm_idx ON post USING gin (submission_id, tag gin_trgm_ops);

You'd need the additional module btree_gin for that. See:

Best index for similarity function

And it might make sense to have another multicolumn index on submission(id, user_name) - or on submission(user_name, id), again depending on the mentioned details. The first variant is more likely to be useful in case Postgres expects the predicate on submission.user_name to be more selective.

(Your handling of tags might be optimized further.)

Aside:

If the result is greater than 0 I do something in my application for this user.

If that something is another SQL DML statement, make it a single statement for additional gains. Example:

UPDATE foo
SET    bar = 'baz'
WHERE  EXISTS ( ...) -- like above
AND    bar IS DISTINCT FROM 'baz' -- avoid empty updates

About the added last line:

How do I (or can I) SELECT DISTINCT on multiple columns?

Or you embed the expression in plpgsql code for procedural needs, like demonstrated here:

PL/pgSQL checking if a row exists

Improving performance of count where only result > 0 matters

1 Answers1