-1

I have an "insert only" table that store user information

  • id bigint -> Primary key, not null
  • user_id bigint -> foreign key, not null
  • first_name character varying null
  • middle_name character varying null
  • last_name character varying not null
  • email character varying not null
  • birth_date date not null
  • phone_number character varying not null
  • mobile_number character varying not null
  • street_name character varying null

and a user table (insert + update) (most of the times, I don't query this table)

  • id primary key, not null
  • created_at date, not null
  • modified_at date, null

I have an index on birth_date column in user_info table and my query looks like this

SELECT * FROM user_info WHERE birth_date = @p1 and (first_name = @p2 or email = @p3) and (last_name = @p4 or email =@p3);

Most of the times, the query latency on production environment is 7-26ms but sometimes, and it's happening frequently actually, the latency jumps up to 900ms so I see in logs 3-4 queries with latency 180, 200, 700, 900ms and then goes back to 7-26ms.

  • The table user_info is insert-only, so no updates nor deletes
  • I am using PostgreSQL 11
  • The table user_info contains 5 million records, data distribution looks good on production, most of birth_date(s) are between 200-1000 records, but only one particular birth date has 110K record (Would this be the issue?)
  • server configurations are the default as specified in (Azure Postgres single server)
  • Server specs are (4 vCPU, 20 GB memory, Gen 5)
  • Maximum concurrent connection to production db is 20 per second
  • Best case scenario is only one SQL query executed (the one attached in this question)
  • Worst case scenario is First: SQL Query attached in this question, Second: Insert query to user table, Third Insert query to user_info table
  • Another scenario is First: SQL Query attached in this question, Second: Insert query to user_info table, Third: update modified time in the user table
  • querying the pg_stat_user_tables trying to understand how many sequential scan vs sequential scan yielded the following result: enter image description here
  • executing two queries with explain analyze from my local machine using pgAdmin against azure postgres server, one with the birth date that contains the most count and one with another birth date, the rest of the values are completely random yieleded the following results

enter image description here

My problem is:

  • I need the latency to not exceed certain threshold

My questions are:

  • What the reason behind the latency jump on user_info table?
  • Would breaking the relationship between the two tables solve the issue? Perhaps when doing updates on user table user_info table needs to update the foreign key column, because update is actually insert and delete which causes deadrows in the table?
  • Is it the data-distribution? And what can I do to improve the latency?

UPDATED:

I enabled auto_explain on production, I used a condition to log only queries with latency > 40ms. And ran automated script that queries existing data, 4 of them is with birth date that has 100K row. And as expected I saw only logs for that particular birth date value:

2022-09-19 {TIME REDACTED}-LOG:  duration: 42.421 ms  plan:
    Query Text: 
                    select * from unf.user_info unf
                    where (
                     unf.birth_date = $1
                     and ( unf.first_name = $2 or unf.email = $3 )
                     and ( unf.last_name = $4 or unf.email = $3 ))
    Index Scan using idx_b_date on unf.user_info unf  (cost=0.43..76483.96 rows=1 width=112) (actual time=0.044..42.411 rows=2 loops=1)
      Output: id, user_id, birth_date, first_name, last_name, email,  phone_number, mobile_number, street_name, created_date
      Index Cond: (unf.birth_date = '{REDACTED} 00:00:00'::timestamp without time zone)
      Filter: ((((unf.last_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)) AND (((unf.first_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)))
      Rows Removed by Filter: 119228
      Buffers: shared hit=11025
2022-09-19 {TIME REDACTED}--LOG:  duration: 41.370 ms  plan:
    Query Text: 
                    select * from unf.user_info unf
                    where (
                     unf.birth_date = $1
                     and ( unf.first_name = $2 or unf.email = $3 )
                     and ( unf.last_name = $4 or unf.email = $3 ))
    Index Scan using idx_b_date on unf.user_info unf  (cost=0.43..76483.96 rows=1 width=112) (actual time=0.087..41.359 rows=2 loops=1)
      Output: id, user_id, birth_date, first_name, last_name, email,  phone_number, mobile_number, street_name, created_date
      Index Cond: (unf.birth_date = '{REDACTED} 00:00:00'::timestamp without time zone)
      Filter: ((((unf.last_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)) AND (((unf.first_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)))
      Rows Removed by Filter: 119228
      Buffers: shared hit=11025
2022-09-19 {TIME REDACTED}--LOG:  duration: 41.709 ms  plan:
    Query Text: 
                    select * from unf.user_info unf
                    where (
                     unf.birth_date = $1
                     and ( unf.first_name = $2 or unf.email = $3 )
                     and ( unf.last_name = $4 or unf.email = $3 ))
    Index Scan using idx_b_date on unf.user_info unf  (cost=0.43..76483.96 rows=1 width=112) (actual time=0.079..41.682 rows=2 loops=1)
      Output: id, user_id, birth_date, first_name, last_name, email,  phone_number, mobile_number, street_name, created_date
      Index Cond: (unf.birth_date = '{REDACTED} 00:00:00'::timestamp without time zone)
      Filter: ((((unf.last_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)) AND (((unf.first_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)))
      Rows Removed by Filter: 119228
      Buffers: shared hit=11025
2022-09-19 {TIME REDACTED}--LOG:  duration: 40.581 ms  plan:
    Query Text: 
                    select * from unf.user_info unf
                    where (
                     unf.birth_date = $1
                     and ( unf.first_name = $2 or unf.email = $3 )
                     and ( unf.last_name = $4 or unf.email = $3 ))
    Index Scan using idx_b_date on unf.user_info unf  (cost=0.43..76483.96 rows=1 width=112) (actual time=0.057..40.568 rows=2 loops=1)
      Output: id, user_id, birth_date, first_name, last_name, email,  phone_number, mobile_number, street_name, created_date
      Index Cond: (unf.birth_date = '{REDACTED} 00:00:00'::timestamp without time zone)
      Filter: ((((unf.last_name)::text = '{REDACTED}'::text) OR ((unf.email)::text = '{REDACTED}'::text)) AND (((unf.first_name)::text = 'n'::text) OR ((unf.email)::text = '{REDACTED}'::text)))
      Rows Removed by Filter: 119228
      Buffers: shared hit=11025
Rickless
  • 101
  • 5

2 Answers2

2

In your slow query, there are 1970 rows with birth_date = 'REDACTED', while in the slow query, there are 124287. So it is only natural that the index scan takes longer.

Since you put an OR into your WHERE condition (which makes indexing harder), the only simple improvement I can think of is to add the filter columns to the index:

CREATE INDEX ON user_info (birth_date, first_name, email);

Perhaps you can rewrite the query to avoid the OR, then more is possible.

Laurenz Albe
  • 61,070
  • 4
  • 55
  • 90
2

The query's index condition is unnecessary complex:

-- version 1, original
WHERE birth_date = @p1 
  AND ( first_name = @p2 OR email = @p3 )
  AND ( last_name = @p4 OR email = @p3 )

It is equivalent to:

-- version 2
WHERE birth_date = @p1 
  AND (  first_name = @p2 AND last_name = @p4
      OR email = @p3
      )

and to this (parentheses redundant, only shown for clarity):

-- version 3
WHERE ( birth_date = @p1 AND first_name = @p2 AND last_name = @p4 )
   OR ( birth_date = @p1 AND email = @p3 )

The 3rd variation above makes it easier to identify better index candidates, both for humans and Postgres' planner. The planner has some capabilities of rewriting/simplifying boolean conditions but it can't identify every possible simplification.

Therefore, I suggest you add both these indexes:

INDEX (birth_date, first_name, last_name)
INDEX (birth_date, email)

and check both the 3rd option above and the rewriting using UNION:

-- version 3
SELECT * 
FROM user_info 
WHERE birth_date = @p1 AND first_name = @p2 AND last_name = @p4
   OR birth_date = @p1 AND email = @p ;

-- version 4: UNION SELECT * FROM user_info WHERE birth_date = @p1 AND first_name = @p2 AND last_name = @p4 UNION SELECT * FROM user_info WHERE birth_date = @p1 AND email = @p3 ;

ypercubeᵀᴹ
  • 99,450
  • 13
  • 217
  • 306