3

I try to create a report for my data, but it is really slow on a big table.

The table structure is:

CREATE TABLE posts
(
  id serial NOT NULL,
  project_id integer,
  moderation character varying(255),
  keyword_id integer,
  author_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  server_id character varying(255),
  social_creation_time integer,
  social_id character varying(255),
  network character varying(255),
  mood character varying(255) DEFAULT NULL::character varying,
  url text,
  source_id integer,
  location character varying(255),
  subject_id integer,
  conversation_id integer,
  CONSTRAINT posts_pkey PRIMARY KEY (id)
);

CREATE INDEX index_posts_on_author_id ON posts (author_id);
CREATE INDEX index_posts_on_keyword_id ON posts (keyword_id);
CREATE INDEX index_posts_on_project_id_and_network_and_social_id 
    ON posts (project_id, network, social_id);
CREATE INDEX index_posts_on_project_id_and_social_creation_time 
    ON posts (project_id, social_creation_time DESC);
CREATE INDEX index_posts_on_server_id ON posts (server_id);
CREATE INDEX index_posts_on_social_id ON posts (social_id);

The query:

SELECT date_trunc('hour', timestamp 'epoch' 
           + (posts.social_creation_time * INTERVAL '1 second')) creating,
       network, 
       count(*) posts 
FROM posts 
WHERE posts.project_id = 7 
   AND (posts.moderation NOT IN ('junk','spam')) 
   AND (posts.social_creation_time BETWEEN 1391716800 AND 1392839999) 
GROUP BY network, creating 
ORDER BY creating

The count is 3940689.

Explain plan

 GroupAggregate  (cost=631282.11..671932.05 rows=338750 width=12) (actual time=22576.318..23826.124 rows=1776 loops=1)
   ->  Sort  (cost=631282.11..639750.85 rows=3387494 width=12) (actual time=22576.188..23438.485 rows=3536790 loops=1)
         Sort Key: (date_trunc('hour'::text, ('1970-01-01 00:00:00'::timestamp without time zone + ((social_creation_time)::double precision * '00:00:01'::interval)))), network
         Sort Method: external merge  Disk: 92032kB
         ->  Seq Scan on posts  (cost=0.00..205984.62 rows=3387494 width=12) (actual time=29.542..1954.865 rows=3536790 loops=1)
               Filter: (((moderation)::text <> ALL ('{junk,spam}'::text[])) AND (social_creation_time >= 1391716800) AND (social_creation_time <= 1392839999) AND (project_id = 7))
               Rows Removed by Filter: 404218
 Total runtime: 23842.532 ms
(8 rows)

Time: 23860.876 ms

It's seq scan, but then I force use indexes it not help:

GroupAggregate  (cost=815927.00..856583.47 rows=338804 width=12) (actual time=24634.378..25873.754 rows=1778 loops=1)
   ->  Sort  (cost=815927.00..824397.09 rows=3388039 width=12) (actual time=24634.243..25498.578 rows=3537295 loops=1)
         Sort Key: (date_trunc('hour'::text, ('1970-01-01 00:00:00'::timestamp without time zone + ((social_creation_time)::double precision * '00:00:01'::interval)))), network
         Sort Method: external merge  Disk: 92048kB
         ->  Bitmap Heap Scan on posts  (cost=191020.29..390555.96 rows=3388039 width=12) (actual time=4074.171..5685.734 rows=3537295 loops=1)
               Recheck Cond: (project_id = 7)
               Filter: (((moderation)::text <> ALL ('{junk,spam}'::text[])) AND (social_creation_time >= 1391716800) AND (social_creation_time <= 1392839999))
               Rows Removed by Filter: 67925
               ->  Bitmap Index Scan on index_posts_on_project_id_and_network_and_social_id  (cost=0.00..190173.29 rows=3617164 width=0) (actual time=4054.817..4054.817 rows=3605225 loops=1)
                     Index Cond: (project_id = 7)
 Total runtime: 25891.215 ms

An example row from the table:

   id   | project_id | moderation | keyword_id | author_id |         created_at         |         updated_at         | server_id | social_creation_time |    social_id    |  network  | mood |                url                 | source_id |      location      | subject_id | conversation_id
---
 204202 |          2 | pending    |            |    125845 | 2014-01-22 15:14:14.786454 | 2014-01-22 15:14:14.786454 | 20620977  |           1390318030 | -64193113_14905 | vkontakte |      | https://vk.com/wall-64193113_14905 |        64 | Россия, Черепаново |            |

** UPDATE **

real that help me is work_mem set to

my new plan:

HashAggregate  (cost=247145.17..254270.53 rows=356268 width=12) (actual time=2564.201..2564.731 rows=1853 loops=1)
    ->  Seq Scan on posts  (cost=0.00..220425.11 rows=3562675 width=12) (actual time=32.916..1914.618 rows=3729876 loops=1)
         Filter: (((moderation)::text <> ALL ('{junk,spam}'::text[])) AND (social_creation_time >= 1391716800) AND (social_creation_time <= 1392839999) AND (project_id = 7))
         Rows Removed by Filter: 501865
Total runtime: 2566.071 ms

UPDATE #2 I think create a integer column and save date like 20140220 (YYYMMDD). stackexchange, what you think, it performance gains?

PS: Sorry all for my bad english

2 Answers2

11

In addition to the good advice by @Craig and @dezso:

Statistics

The count is 3940689.

Yet, your query plan says:

Seq Scan on posts  (cost=0.00..205984.62 rows=**3387494** width=12)

And your count is based on a selection:

Rows Removed by Filter: 404218

4344907 (3940689 + 404218) >> 3387494. Your statistics are not up to date. Something may be wrong with your autovacuum settings, which among other things runs ANALYZE automatically. Very bad for overall db performance. For the query at hand run before you retry anything:

ANALYZE posts

If you can afford a lock on the table for some time, run

VACUUM FULL ANALYZE posts

to clean house. More here.

Index

The numbers indicate your query uses about 90 % of all rows. Therefore, a sequential scan will be faster than any possible index scan - except for a covering index (index-only scan). Requires Postgres 9.2+. Be sure to read the Postgres Wiki on the topic.

Since you only use two small columns out of a long list of columns, such an index would be smaller and faster. While being at it, and depending on your overall requirements, a tailored index like the following would squeeze out the maximum performance: a partial, functional, multicolumn, covering index - at some cost for write operations:

CREATE INDEX test_idx ON posts (
   date_trunc('hour', timestamp 'epoch' + social_creation_time * interval  '1 sec')
  ,network)
WHERE moderation NOT IN ('junk','spam')
AND   project_id = 7                         -- ??
AND   social_creation_time BETWEEN 1391716800 AND 1392839999 -- ??

Actual WHERE conditions depend on your actual requirements and need to be added in a more or less identical form to any query that wishes to use this index. Trim rows that are never used from the index. Only use conditions that eliminate more than a few rows, tailor a super-set of rows needed for your queries.

Generally, covering indexes work well for fairly static tables. Read the Wiki. A quick test:

SELECT relallvisible, relpages
FROM   pg_class
WHERE  oid = 'posts'::regclass

If relallvisible is not much smaller that relpages, chances are good. Be sure autovacuum is running properly before you try this.

I would also test without the functional aspect to see which is used / faster:

CREATE INDEX test_idx ON posts (social_creation_time, network)
WHERE moderation NOT IN ('junk','spam')
AND   project_id = 7                         -- ??
AND   social_creation_time BETWEEN 1391716800 AND 1392839999; -- ??

A search for related answers dealing with a partial covering indexe.

Table layout

Finally, in your table definition integer and text columns interchange, bloating the table quite a bit due to data alignment and padding. More in this related answer:
Configuring PostgreSQL for read performance

I would recreate your table along these lines:

CREATE TABLE post (
  post_id serial PRIMARY KEY,
  project_id integer,
  created_at timestamp,
  updated_at timestamp,
  keyword_id integer,
  author_id integer,
  source_id integer,
  subject_id integer,
  conversation_id integer,
  social_creation_time integer,

server_id text, -- could be integer? social_id text, moderation text, network text, url text, location text, mood text );

Will be a bit smaller, helping overall performance.
Why text instead of varchar(255)?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
5
     Sort Method: external merge  Disk: 92048kB

Throw more work_mem at the problem. Lots more. Try:

SET LOCAL work_mem = '300MB';

Be aware that if you're running it in lots of concurrent connections you could exhaust system RAM. So SET only in individual sessions.

Your row-count estimate on the aggregate is a bit dodgy (http://explain.depesz.com/s/RXbq) but not too bad. The killer seems to be that big sort.

Craig Ringer
  • 57,821
  • 6
  • 162
  • 193