17

I have the following query and expect that as a result I will have a list of IDs without duplicates. But sometimes it produces duplicates (1-2 on 4 million rows).

Why can it happen? I run it with the default (read committed) isolation level. I can't use tablock/serializible because these are OLTP tables with hundreds of changes.

CREATE TABLE #characterId   (
            CharacterId BIGINT  
        )

DECLARE @dateTimeFrom DATETIME = '2025-05-04' , @dateTimeTo DATETIME = '2025-05-07'

INSERT INTO #characterId (CharacterId)
SELECT Id FROM table1 u WHERE u.DateUpdated >= @dateTimeFrom AND u.DateUpdated < @dateTimeTo UNION SELECT CharacterId FROM table2 usi WHERE usi.DateUpdated >= @dateTimeFrom AND usi.DateUpdated < @dateTimeTo UNION SELECT ust.CharacterId FROM table3 ust WHERE ust.DateCreated >= @dateTimeFrom AND ust.DateCreated < @dateTimeTo AND ust.TokenType = 2

Current execution plan enter image description here

SQL Server version - Microsoft SQL Server 2019 (RTM-CU26) (KB5035123) - 15.0.4365.2 (X64) Mar 29 2024 23:02:47 Copyright (C) 2019 Microsoft Corporation Standard Edition (64-bit) on Windows Server 2022 Datacenter 10.0 (Build 20348: ) (Hypervisor)

Plan with UNION ALL instead of first UNION:

enter image description here

Paul White
  • 94,921
  • 30
  • 437
  • 687
Novitskiy Denis
  • 331
  • 1
  • 11

1 Answers1

24

Uniqueness

This can happen under locking read committed (default for SQL Server on premises) but not under read committed snapshot isolation (RCSI, default for Azure SQL Database).

When a uniqueness guarantee exists (a UNIQUE index or constraint), the optimizer can omit plan operators that remove duplicates explicitly. As an example, in the UNION case, it means the optimizer does not have to introduce a de-duplicating operator on the lower input to a Hash Union.

Concurrency

The problem with this is that concurrent database changes may cause the plan to (apparently) encounter duplicate values at lower locking isolation levels. This can happen when rows change or move around such that a value that has been seen earlier in the seek or scan operation appears again later.

The chance of concurrent changes happening in just the right way to generate a false duplicate is quite low, which explains why you only see a few of them from time to time.

Note, the uniqueness guarantee is never violated. The query just sees the same value more than once because it is seeing the current state of rows at different points in time.

Example

Say you encounter the value 123 on some early row. Meanwhile, another process changes that row's value to 456 and adds a new row with the value 123. That new row happens to appear later in the index structure we're reading from, so it will be encountered in the future. Thus, we encountered the value 123 twice, even though it only ever appeared once in the index at any given moment.

db<>fiddle demo illustrating the general idea.

Forrest McDaniel also illustrates the principle in A Distinct (Query) Murder Mystery. The details are different (showing a DISTINCT apparently being ignored) but the underlying cause is similar enough to be useful.

Workarounds

This effect is always a possibility at lower isolation levels. Using a higher (or row versioning) isolation level or table locking hints will eliminate the root cause.

Use a versioning isolation level

The issue cannot occur when using a row-versioning isolation level like RCSI or SI because these operate on a 'snapshot' of the database as of a particular point in time.

You should look into changing from locking read committed to RCSI in the future. Its behaviour makes more intuitive sense to most people.

Higher isolation or hints

If you cannot use a row versioning isolation level, run the query at the SERIALIZABLE isolation level or add table locking hints:

INSERT #characterId (CharacterId)                                                          
SELECT Id 
FROM table1 AS u WITH (TABLOCK)
WHERE 
    u.DateUpdated >= @dateTimeFrom 
    AND u.DateUpdated < @dateTimeTo
UNION
SELECT CharacterId 
FROM table2 AS usi WITH (TABLOCK)
WHERE 
    usi.DateUpdated >= @dateTimeFrom 
    AND usi.DateUpdated < @dateTimeTo
UNION 
SELECT ust.CharacterId
FROM table3 AS ust WITH (TABLOCK)
WHERE 
    ust.DateCreated >= @dateTimeFrom 
    AND ust.DateCreated < @dateTimeTo
    AND ust.TokenType = 2
OPTION (HASH UNION);

Final distinct or ignore options

Or you could accept the results may contain duplicates due to concurrency and perform a final DISTINCT on the temporary table after the insert is complete.

Or, add a unique index or constraint to the target table with the IGNORE_DUP_KEY option.

Use a different UNION implementation

The Hash Union physical implementation of UNION is particularly prone to this issue.

Without the uniqueness guarantee, the optimizer would take steps to ensure rows arriving on the lower (probe) input were unique. For example, it might introduce a Distinct Sort. The build input ignores duplicates as it builds the hash table, so no explicit operator is required here.

The issue would not arise with a Distinct Sort on the probe input, since the isolation level only affects rows as they are fetched from the database. Duplicates might well still be read from time to time, but the Distinct Sort would remove them.

Detailed execution plan behaviour can depend on physical implementation details. The optimizer chose a Hash Union in your case because that was the alternative with the lowest cost and is likely to perform best.

You may find the issue does not occur with a different physical UNION implementation. You can give the optimizer a choice between Merge and Concat Union with the following query hint:

OPTION (CONCAT UNION, MERGE UNION);

Further reading

For more background, see Query Plans and Read Committed Isolation Level by Craig Freedman.

I've demonstrated three ways that queries may behave unexpectedly when running in read committed isolation level. I want to emphasize that these results are not incorrect. SQL Server guarantees that the committed data is consistent at all times and does not permit any constraints to be violated.

These results are merely a consequence of running at a relatively low isolation level. If we repeat the above experiments with snapshot read committed enabled or at a higher isolation level, we do not see these results.

The benefit of read committed is higher concurrency with less blocking and fewer deadlocks. The disadvantage is lower consistency guarantees.

Paul White
  • 94,921
  • 30
  • 437
  • 687