Optimizing select count result in Postgresql

Question

I'm trying to optimize a table containing 80million+ rows. It takes 20+ minutes to get the row count results. I've tried clustering, vacuum full and reindex but the performance didn't improve. What do i need to configure or adjust in order to improve data query and retrieval? I'm using Postgresql 12 under Windows 2019.

Update info:

Total rows now around 92million+
Table column count = 44

Explain query result using 'select count(*) from doc_details':  
    Finalize Aggregate  (cost=5554120.84..5554120.85 rows=1 width=8) (actual time=1249204.001..1249210.027 rows=1 loops=1)  
  ->  Gather  (cost=5554120.63..5554120.83 rows=2 width=8) (actual time=1249203.642..1249210.020 rows=3 loops=1)  
        Workers Planned: 2  
        Workers Launched: 2  
        ->  Partial Aggregate  (cost=5553120.63..5553120.63 rows=1 width=8) (actual time=1249153.615..1249153.616 rows=1 loops=3)  
              ->  Parallel Seq Scan on doc_details  (cost=0.00..5456055.30 rows=38826130 width=0) (actual time=3.793..1245165.604 rows=31018949 loops=3)  
Planning Time: 1.290 ms  
Execution Time: 1249210.115 ms

(I don't know how to get the row size in kb/mb)

Machine Info:

Windows 2019 Datacenter
32GB Memory
Postgresql 12

Table info :

                                                Table "public.doc_details"
         Column          |              Type              | Collation | Nullable |                   Default
-------------------------+--------------------------------+-----------+----------+----------------------------------------------
 id                      | integer                        |           | not null | nextval('doc_details_id_seq'::regclass)
 trans_ref_number        | character varying(30)          |           | not null |
 outbound_time           | timestamp(0) without time zone |           |          |
 lm_tracking             | character varying(30)          |           | not null |
 cargo_dealer_tracking   | character varying(30)          |           | not null |
 order_sn                | character varying(30)          |           |          |
 operations_no           | character varying(30)          |           |          |
 box_no                  | character varying(30)          |           |          |
 box_size                | character varying(30)          |           |          |
 parcel_weight_kg        | numeric(8,3)                   |           |          |
 parcel_size             | character varying(30)          |           |          |
 box_weight_kg           | numeric(8,3)                   |           |          |
 box_volume              | integer                        |           |          |
 parcel_volume           | integer                        |           |          |
 transportation          | character varying(100)         |           |          |
 channel                 | character varying(30)          |           |          |
 service_code            | character varying(20)          |           |          |
 country                 | character varying(60)          |           |          |
 destination_code        | character varying(20)          |           |          |
 assignee_name           | character varying(100)         |           |          |
 assignee_province_state | character varying(30)          |           |          |
 assignee_city           | character varying(30)          |           |          |
 postal_code             | character varying(20)          |           |          |
 assignee_telephone      | character varying(30)          |           |          |
 assignee_address        | text                           |           |          |
 shipper_name            | character varying(100)         |           |          |
 shipper_country         | character varying(60)          |           |          |
 shipper_province        | character varying(30)          |           |          |
 shipper_city            | character varying(30)          |           |          |
 shipper_address         | text                           |           |          |
 shipper_telephone       | character varying(30)          |           |          |
 package_qty             | integer                        |           |          |
 hs_code                 | integer                        |           |          |
 hs_code_manual          | integer                        |           |          |
 reviewed                | boolean                        |           |          |
 created_at              | timestamp(0) without time zone |           |          |
 updated_at              | timestamp(0) without time zone |           |          |
 invalid                 | boolean                        |           |          |
 arrival_id              | integer                        |           |          |
 excel_row_number        | integer                        |           |          |
 is_additional           | boolean                        |           |          |
 arrival_datetime        | timestamp(6) without time zone |           |          |
 invoice_date            | timestamp without time zone    |           |          |
 unit_code               | character varying(100)         |           |          |
Indexes:
    "doc_details_pkey" PRIMARY KEY, btree (id) CLUSTER
    "doc_details_box_no_idx" btree (box_no)
    "doc_details_trans_ref_number_idx" btree (trans_ref_number)
Triggers:
    trigger_log_awb_box AFTER INSERT ON doc_details FOR EACH ROW EXECUTE FUNCTION log_awb_box()

John K. N. · Accepted Answer · 2022-10-01T08:02:55.877

From the PostgreSQL Wiki:

The reason is related to the MVCC implementation in PostgreSQL. The fact that multiple transactions can see different states of the data means that there can be no straightforward way for "COUNT(*)" to summarize data across the whole table. PostgreSQL must walk through all rows to determine visibility. This normally results in a sequential scan reading information about every row in the table.

^{Reference: Slow Counting (PostgreSQL Wiki)}

Because of that, there is no faster way (for PostgreSQL) to read the 94 million + rows. PostgreSQL is going to painstakingly read row for row as can be seen in your explain plan.

Possible Solutions

Increasing the shared_buffers setting in the postgresql.conf file might help alleviate the performance issue slightly, by allowing PostgreSQL to store more data in memory.

Sets the amount of memory the database server uses for shared memory buffers. The default is typically 128 megabytes (128MB), but might be less if your kernel settings will not support it (as determined during initdb). This setting must be at least 128 kilobytes. However, settings significantly higher than the minimum are usually needed for good performance. If this value is specified without units, it is taken as blocks, that is BLCKSZ bytes, typically 8kB. (Non-default values of BLCKSZ change the minimum value.) This parameter can only be set at server start.

If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system. There are some workloads where even larger settings for shared_buffers are effective, but because PostgreSQL also relies on the operating system cache, it is unlikely that an allocation of more than 40% of RAM to shared_buffers will work better than a smaller amount. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time.

On systems with less than 1GB of RAM, a smaller percentage of RAM is appropriate, so as to leave adequate space for the operating system.

^{Reference: 20.4. Resource Consumption (PostgreSQL Documentation)}

Biller Builder · Answer 2 · 2022-10-01T10:52:47.173

At this scale you have to resort to tricks to handle counting and their effectiveness depends entirely on your business logic.
The most "simple" one is just cache the result in the application layer and run the 20 minutes long counting query as a background task which would update the cache on its completion. This ofc means the count number will always be behind the "real" one by 20 minutes + cooldown between runs.
If the exact amount isn't critical then you can use the advice from this article.
Another approach is to rely on the fact that you know the exact amount of items on INSERT and DELETE queries and therefore can increment/decrement the "current" value on the successful commit. Let's say we have a table for counts:

CREATE TABLE table_counts (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
  schema text NOT NULL,
  name text NOT NULL,
  type text NOT NULL,
  count bigint NOT NULL
)

The type column is a discriminator key for conditional counts of schema and name combos. For each INSERT and DELETE query you will also increase/decrease count value in the related row. This approach requires query and db changes on top of introducing maintenance overhead, but will yield (almost) precise counts. The drawback it can only work if the application is the only consumer of the database and therefore has a complete control over data-modifying queries. And migrations involving changes to foreign keys (as they can affect the count numbers implicitly) will become a complete PITA, so you'd have to run the background task mentioned in the start every now and then too.

score 0 · Answer 3 · answered Dec 22 '23 at 04:51

0

This article makes performance tests:

https://www.citusdata.com/blog/2016/10/12/count-performance/

answered Dec 22 '23 at 04:51

Karl Zillner

101
1

Optimizing select count result in Postgresql

3 Answers3

Possible Solutions

Linked