I'm trying to produce a sample query plan to show why UNIONing two result-sets can be better than using OR in a JOIN clause. A query plan I've written has me stumped. I'm using the StackOverflow database with a nonclustered index on Users.Reputation.
CREATE NONCLUSTERED INDEX IX_NC_REPUTATION ON dbo.USERS(Reputation)
SELECT DISTINCT Users.Id
FROM dbo.Users
INNER JOIN dbo.Posts
ON Users.Id = Posts.OwnerUserId
OR Users.Id = Posts.LastEditorUserId
WHERE Users.Reputation = 5
The query plan is at https://www.brentozar.com/pastetheplan/?id=BkpZU1MZE, query duration for me is 4:37 min, 26612 rows returned.
I haven't seen this style of constant-scan being created from an existing table before - I'm unfamiliar with why there's a constant scan ran for every single row, when a constant scan is usually used for a single row inputted by the user for example SELECT GETDATE(). Why is it used here? I would really appreciate some guidance in reading this query plan.
If I split that OR out into a UNION, it produces a standard plan running in 12 seconds with the same 26612 rows returned.
SELECT Users.Id
FROM dbo.Users
INNER JOIN dbo.Posts
ON Users.Id = Posts.OwnerUserId
WHERE Users.Reputation = 5
UNION
SELECT Users.Id
FROM dbo.Users
INNER JOIN dbo.Posts
ON Users.Id = Posts.LastEditorUserId
WHERE Users.Reputation = 5
I interpret this plan as doing this:
- Get all 41782500 rows from Posts (the actual number of rows matches the CI scan on Posts)
- For each 41782500 rows in Posts:
- Produce scalars:
- Expr1005: OwnerUserId
- Expr1006: OwnerUserId
- Expr1004: The static value 62
- Expr1008: LastEditorUserId
- Expr1009: LastEditorUserId
- Expr1007: The static value 62
- In the concatenate:
- Exp1010: If Expr1005 (OwnerUserId) is not null, use that else use Expr1008 (LastEditorUserID)
- Expr1011: If Expr1006 (OwnerUserId) is not null, use that, else use Expr1009 (LastEditorUserId)
- Expr1012: If Expr1004 (62) is null use that, else use Expr1007 (62)
- In the Compute scalar: I don't know what an ampersand does.
- Expr1013: 4 [and?] 62 (Expr1012) = 4 and OwnerUserId IS NULL (NULL = Expr1010)
- Expr1014: 4 [and?] 62 (Expr1012)
- Expr1015: 16 and 62 (Expr1012)
- In the Order By sort by:
- Expr1013 Desc
- Expr1014 Asc
- Expr1010 Asc
- Expr1015 Desc
- In the Merge Interval it removed Expr1013 and Expr1015 (these are inputs but not outputs)
- In the Index seek below the nested loops join it is using Expr1010 and Expr1011 as seek predicates, but I don't understand how it has access to these when it hasn't done the nested loop join from IX_NC_REPUTATION to the subtree containing Expr1010 and Expr1011.
- The Nested Loops join returns only the Users.IDs that have a match in the earlier subtree. Because of predicate pushdown, all rows returned from the index seek on IX_NC_REPUTATION are returned.
- The last Nested Loops join: For each Posts record, output Users.Id where a match is found in the below dataset.



