Union does not always eliminate duplicates

Question

I have the following query and expect that as a result I will have a list of IDs without duplicates. But sometimes it produces duplicates (1-2 on 4 million rows).

Why can it happen? I run it with the default (read committed) isolation level. I can't use tablock/serializible because these are OLTP tables with hundreds of changes.

CREATE TABLE #characterId   (
            CharacterId BIGINT  
        )
DECLARE @dateTimeFrom DATETIME = '2025-05-04'
    , @dateTimeTo DATETIME = '2025-05-07'
INSERT INTO #characterId (CharacterId)

SELECT Id 
FROM table1 u
WHERE u.DateUpdated >= @dateTimeFrom 
    AND u.DateUpdated < @dateTimeTo
UNION
SELECT CharacterId 
FROM table2 usi
WHERE usi.DateUpdated >= @dateTimeFrom 
    AND usi.DateUpdated < @dateTimeTo
UNION 
SELECT ust.CharacterId
FROM table3 ust
WHERE ust.DateCreated >= @dateTimeFrom 
    AND ust.DateCreated < @dateTimeTo
    AND ust.TokenType = 2

Current execution plan

SQL Server version - Microsoft SQL Server 2019 (RTM-CU26) (KB5035123) - 15.0.4365.2 (X64) Mar 29 2024 23:02:47 Copyright (C) 2019 Microsoft Corporation Standard Edition (64-bit) on Windows Server 2022 Datacenter 10.0 (Build 20348: ) (Hypervisor)

Plan with UNION ALL instead of first UNION:

Paul White · Accepted Answer · 2025-05-07T12:22:37.293

Uniqueness

This can happen under locking read committed (default for SQL Server on premises) but not under read committed snapshot isolation (RCSI, default for Azure SQL Database).

When a uniqueness guarantee exists (a UNIQUE index or constraint), the optimizer can omit plan operators that remove duplicates explicitly. As an example, in the UNION case, it means the optimizer does not have to introduce a de-duplicating operator on the lower input to a Hash Union.

Concurrency

The problem with this is that concurrent database changes may cause the plan to (apparently) encounter duplicate values at lower locking isolation levels. This can happen when rows change or move around such that a value that has been seen earlier in the seek or scan operation appears again later.

The chance of concurrent changes happening in just the right way to generate a false duplicate is quite low, which explains why you only see a few of them from time to time.

Note, the uniqueness guarantee is never violated. The query just sees the same value more than once because it is seeing the current state of rows at different points in time.

Example

Say you encounter the value 123 on some early row. Meanwhile, another process changes that row's value to 456 and adds a new row with the value 123. That new row happens to appear later in the index structure we're reading from, so it will be encountered in the future. Thus, we encountered the value 123 twice, even though it only ever appeared once in the index at any given moment.

db<>fiddle demo illustrating the general idea.

Forrest McDaniel also illustrates the principle in A Distinct (Query) Murder Mystery. The details are different (showing a DISTINCT apparently being ignored) but the underlying cause is similar enough to be useful.

Workarounds

This effect is always a possibility at lower isolation levels. Using a higher (or row versioning) isolation level or table locking hints will eliminate the root cause.

Use a versioning isolation level

The issue cannot occur when using a row-versioning isolation level like RCSI or SI because these operate on a 'snapshot' of the database as of a particular point in time.

You should look into changing from locking read committed to RCSI in the future. Its behaviour makes more intuitive sense to most people.

Higher isolation or hints

If you cannot use a row versioning isolation level, run the query at the SERIALIZABLE isolation level or add table locking hints:

INSERT #characterId (CharacterId)                                                          
SELECT Id 
FROM table1 AS u WITH (TABLOCK)
WHERE 
    u.DateUpdated >= @dateTimeFrom 
    AND u.DateUpdated < @dateTimeTo
UNION
SELECT CharacterId 
FROM table2 AS usi WITH (TABLOCK)
WHERE 
    usi.DateUpdated >= @dateTimeFrom 
    AND usi.DateUpdated < @dateTimeTo
UNION 
SELECT ust.CharacterId
FROM table3 AS ust WITH (TABLOCK)
WHERE 
    ust.DateCreated >= @dateTimeFrom 
    AND ust.DateCreated < @dateTimeTo
    AND ust.TokenType = 2
OPTION (HASH UNION);

Final distinct or ignore options

Or you could accept the results may contain duplicates due to concurrency and perform a final DISTINCT on the temporary table after the insert is complete.

Or, add a unique index or constraint to the target table with the IGNORE_DUP_KEY option.

Use a different `UNION` implementation

The Hash Union physical implementation of UNION is particularly prone to this issue.

Without the uniqueness guarantee, the optimizer would take steps to ensure rows arriving on the lower (probe) input were unique. For example, it might introduce a Distinct Sort. The build input ignores duplicates as it builds the hash table, so no explicit operator is required here.

The issue would not arise with a Distinct Sort on the probe input, since the isolation level only affects rows as they are fetched from the database. Duplicates might well still be read from time to time, but the Distinct Sort would remove them.

Detailed execution plan behaviour can depend on physical implementation details. The optimizer chose a Hash Union in your case because that was the alternative with the lowest cost and is likely to perform best.

You may find the issue does not occur with a different physical UNION implementation. You can give the optimizer a choice between Merge and Concat Union with the following query hint:

OPTION (CONCAT UNION, MERGE UNION);

Union does not always eliminate duplicates

1 Answers1

Uniqueness

Concurrency

Example

Workarounds

Use a versioning isolation level

Higher isolation or hints

Final distinct or ignore options

Use a different `UNION` implementation

Further reading

Union does not always eliminate duplicates

1 Answers1

Uniqueness

Concurrency

Example

Workarounds

Use a versioning isolation level

Higher isolation or hints

Final distinct or ignore options

Use a different UNION implementation

Further reading

Use a different `UNION` implementation