Slow query performance due to temporary file?

Question

Here is the query:

SELECT "products".* 
FROM "products"
WHERE (status > 100) 
AND "products"."above_revenue_average" = 't' 
AND ("products"."category_id" NOT IN (5))
ORDER BY "products"."start_date" DESC

I have an index on status and start_date.

Each time I run the query from my application, I get the following in the logs:

[WHITE] temporary file:
path "pg_tblspc/16386/PG_9.3_201306121/pgsql_tmp/pgsql_tmp2544.0", size 37093376 
Query: SELECT "products".* FROM "products"  WHERE (status > 100) 
AND "products"."above_revenue_average" = 't' 
AND ("products"."category_id" NOT IN (5))  ORDER BY "products"."start_date" DESC

I believe this temporary file creation is the cause of the slow performance.

Running an EXPLAIN ANALYZE I get the following results:

QUERY PLAN
 Sort  (cost=63395.28..63403.51 rows=16460 width=524)
       (actual time=524.134..562.635 rows=65294 loops=1)
   Sort Key: start_date
   Sort Method: external merge  Disk: 36224kB
   ->  Bitmap Heap Scan on products  
        (cost=4803.40..60389.73 rows=16460 width=524)
        (actual time=27.390..397.879 rows=65294 loops=1)
         Recheck Cond: (status > 100)
         Filter: (above_revenue_average AND (category_id <> 5))
         Rows Removed by Filter: 25115
         ->  Bitmap Index Scan on index_products_on_status
            (cost=0.00..4802.58 rows=89662 width=0) 
            (actual time=18.006..18.006 rows=90411 loops=1)
               Index Cond: (status > 100)
 Total runtime: 577.870 ms
(10 rows)

I have then used http://explain.depesz.com/ to make it a bit more readable:

Per node type stats

+-------------------+-------+--------------+------------+
|     node type     | count | sum of times | % of query |
+-------------------+-------+--------------+------------+
| Bitmap Heap Scan  |     1 | 379.873 ms   | 67.5 %     |
| Bitmap Index Scan |     1 | 18.006 ms    | 3.2 %      |
| Sort              |     1 | 164.756 ms   | 29.3 %     |
+-------------------+-------+--------------+------------+

Per table stats

+------------------+------------+--------------+------------+
|    Table name    | Scan count |  Total time  | % of query |
+------------------+------------+--------------+------------+
| scan type        | count      | sum of times | % of table |
| products         | 1          | 379.873 ms   | 67.5 %     |
| Bitmap Heap Scan | 1          | 379.873 ms   | 100.0 %    |
+------------------+------------+--------------+------------+

Can I increase the database performance by adding some more indexes? Maybe some composite ones? Any ideas?

Erwin Brandstetter · Accepted Answer · 2022-10-08T02:38:33.920

`work_mem`

Obviously, the sort operation spills to disk:

Sort Method: external merge  Disk: 36224kB

More work_mem can help the query, like @Kassandry already suggested. Increase the setting until you see Memory instead of Disk in the EXPLAIN output. But it's probably a bad idea to increase the general setting based on one query. Proper setting depends on available RAM and your complete situation. Start by reading in the Postgres Wiki.

To just fix your query, set work_mem high enough for your transaction only with SET LOCAL in the same transaction.

BEGIN;
SET LOCAL work_mem = '45MB';
SELECT ...
COMMIT;   -- or ROLLBACK

You probably need a bit more than 40 MB. In-memory representation is a bit bigger than on-disk representation. Related:

Configuration parameter work_mem in PostgreSQL on Linux

Query

Your query (after trimming some noise):

SELECT * 
FROM   products
WHERE  status > 100
AND    above_revenue_average  -- boolean can be used directly
AND    category_id <> 5
ORDER  BY start_date DESC;

Your rows are half a kilobyte wide (width=524). Do you need to return all columns? (Typically, you don't.) Only list columns in the SELECT list which you need from your query to improve overall performance, especially since you have work_mem issues already.

Can any of the involved columns be NULL? Particularly important for category_id and start_date. You might want to adapt in that case ...

Index

A multicolumn index can certainly help performance. (Like @Paul outlined). You have to weigh cost and gain. If performance for this query is important or it is very common, go for it. Don't create a special index for every query. As few as possible, as many as necessary. Indexes are more powerful when shared, that increases chances that more of them stay in cache.

A boolean column like above_revenue_average is a typical candidate for a condition in a partial index rather than for an index column.

Unexpected Seq Scan when doing query against boolean with value NULL

My wild guess based on incomplete information:

CREATE INDEX prod_special_idx ON products (start_date DESC)
WHERE  above_revenue_average
AND    status > 100
AND    category_id <> 5;

Use DESC NULLS LAST in index and query if start_date can be NULL.

PostgreSQL sort by datetime asc, null first?

score 3 · Answer 2 · answered Aug 25 '15 at 17:13

One of the easiest and most effective ways to increase performance on this sort of query is to execute SET work_mem=40MB (because you have ~32MB of temp file for sorting, and a little extra often helps) then run your query, and see if the EXPLAIN ANALYZE plan changes from disk to an in-memory sort. Afterwards, run RESET work_mem to put the value back to the default value in your postgresql.conf.

The basic idea is covered under work_mem at the following link: Tuning your PostgreSQL Server. Changing this setting for an individual query should give you the benefits, and avoid the downsides mentioned.

Hope that helps. =)

score 2 · Answer 3 · answered Aug 25 '15 at 05:49

Depending on how selective the combined predicates are, I would imagine a good index for this particular query would be:

CREATE INDEX index_name
ON products
    (above_revenue_average ASC, start_date DESC)
WHERE
    status > 100
    AND category_id <> 5;

The SELECT * is potentially problematic because the index above does not contain all columns. If that causes the index not to be selected by the planner, you could fetch just the primary key columns first, then join back to the table to collect the extra columns, for just the selected keys.