0

We have a query that intermittently runs very poorly for no apparent reason.

We have Postgres 14.13 configured with pg.log_min_duration_statement set to 30000.

We have a query that runs as part of our core hosting service that is basically

SELECT * FROM table 
LEFT OUTER JOIN table2 ON table.table2_id = table2_id 
WHERE "table"."url" = ? LIMIT 1;

This runs thousands of times every hour, and is fine, but then occasionally we'll get a log saying it took minutes.

Here's an example log

 pid=864204,user=service_account,db=app_db,app=pg-node,client=XX.XX.XX.XX LOG:  duration: 156414.095 ms  statement:  SELECT * FROM table LEFT OUTER JOIN table2 ON table.table2_id = table2_id WHERE "table"."url" = ? LIMIT 1;

Here's an example log for the auto_explain

pid=2377906,user=service_account,db=app_database,app=pg-node,client=XX.XX.XX.XX LOG: duration: 156414.095 ms plan:
Query Text: SELECT * ...
Limit (cost=5.29..8.33 rows=1 width=741) (actual time=0.153..0.157 rows=1 loops=1
-> Nested Loop Left Join (cost=5.29..8.33 rows=1 width=741) (actual time=0.152..0.154 rows=1 loops=1
-> Bitmap Heap Scan on XXXX (cost=5.01..6.02 rows=1 width=546) (actual time=0.112..0.113 rows=1 loops=1)
Recheck Cond: (XXXXXX)
Filter: ((XXXXX) AND (XXXXX))
 Heap Blocks: exact=1
Bitmap Index Scan on XXXXX (cost=0.00..5.01 rows=1 width=0) (actual time=0.097..0.097 rows=1 loops=1)
Index Cond: (XXXXX)
-> Index Scan using XXXX on table (cost=0.29..2.30 rows=1 width=422) (actual time=0.037..0.037 rows=1 loops=1)
Index Cond: (XXXXX)
Filter: (XXXXX)

Metrics show no extra load on the server (CPU is below 20%, memory is unchanged, there's no spike in other queries).

We turned on auto_explain to try and track this down, but the analysis shows that all parts of the query took <1ms

The database where this is occurring is a read replica.

We've contacted support for the hosting service providing the database. They've suggested that network errors with the connection to the client could be causing the issue.

Is this possible? If not, what other explanations could there be?

ChrisJ
  • 621
  • 1
  • 8
  • 22

1 Answers1

0

Network time is reported as part of the "duration" field but not part of the "actual time" field, so it is plausible that a network freeze up could account for this if it prevented the response from being delivered for a long time. But I think it is unlikely unless the entire response doesn't fit in one TCP/pipe buffer, as I think that PostgreSQL only waits for the server's kernel to accept all the data, not for the client machine to ACK the data. So I think it is more likely that the row being selected has a very large value in it, and it is taking a long time to send that value over the network.

Is it just some rows which reproducibly cause the problem? or do all rows do it but not consistently?

jjanes
  • 42,332
  • 3
  • 44
  • 54