Filtering a result set on a very large table (using INCLUDE index in SQL Server)

Question

Consider a very large table having hundreds of millions of rows defined as

-- Here ObjectIdName and ObjectTypeName together are unique.
-- Due to their size they aren't indexed, but rather they are
-- hashed to ObjectIdName -> ObjectId and ObjectTypeName -> ObjectType
-- which are used in the index instead. 
CREATE TABLE VeryLarge
(
    ObjectId INT NOT NULL,
    ObjectIdName NVARCHAR(512) NOT NULL,
    ObjectType INT NOT NULL,
    ObjectTypeName NVARCHAR(512) NOT NULL,
    PayLoad VARBINARY(MAX) NULL,
    IsDeleted BIT NOT NULL
);

-- This was originally in the question. By mistake includes the Payload column.
CREATE NONCLUSTERED INDEX IX_VeryLarge1 ON VeryLarge(ObjectId, ObjectType) INCLUDE(IsDeleted, PayLoad, ObjectIdName, ObjectTypeName);

-- This was the index I intended to have, without the Payload column.
CREATE NONCLUSTERED INDEX IX_VeryLarge2 ON VeryLarge(ObjectId, ObjectType) INCLUDE(IsDeleted, ObjectIdName, ObjectTypeName);

If one assumes the ObjectId and ObjectType parameters locate a single row in 99.99% of the cases, it looks to me there needs to be an interplay between the query and the indexing to be so effective the query will always return predictably in sensible time (considering hardware and the actual amount of rows). However, I'm not sure how should the query be constructed. Basically what I would like to do is query first with ObjectId and ObjectType and filter this small result set with ObjectIdName and ObjectTypeName (if there are more than one row).

Considering this, with the defined indexing, is the following query sensible on SQL Server?

SELECT
    Payload
FROM
    VeryLarge
WHERE
    ObjectId = @objectId
    AND ObjectType = @objectType
    AND IsDeleted = 0
    AND ObjectIdName = @objectIdName
    AND ObjectTypeName = @objectTypeName;

-- Or alternatively, considering the question:
SELECT
    PayLoad
FROM
(
    SELECT
        PayLoad,
        ObjectIdName,
        ObjectTypeName
    FROM
        VeryLarge
    WHERE
        ObjectId = @objectId
        AND ObjectType = @objectType
        AND IsDeleted = 0
) AS x
WHERE x.ObjectIdName = @objectIdName AND x.ObjectTypeName = @objectTypeName;

I don't know how to construct such a query that would prune the results of the first query that will is guaranteed to hit an index and return only one or a few rows. It would look that using INCLUDE has the same effect, forcing and index seek but I don't know if there will be costs associated with this. Also, I know MySQL doesn't have this concept of included index, but it works somehow similarily and I assume filtering the result set of at most a few rows would work on any database that supports indexing regardless of other structures.

This is related to my other question at Hashed and heap indexed object storage table insert and query performance, which has the table and query defined.

<edit: Additional question: Having those repeating ObjectType columns in the same table looks like wasteful. Might be subject for another post, but I believe they ought to be separated to a different table. Then the question is that should the INNER JOIN be avoid with a query that does the join only if there are more rows than one after the first filtering. And how to do that.

Hannah Vernon · Accepted Answer · 2016-04-13T16:59:27.693

From the details you've provided it seems reasonable that the IX_VeryLarge non-clustered index would support both queries you've shown in your question. You have the Payload column typed as VARBINARY(MAX) - if you expect large objects to be stored in that column, I'd likely not INCLUDE it in the index since that will cause the index to be much much larger than it would otherwise be. If the vast majority of your queries result in one or two rows, a simple lookup into the clustered index/heap (the table) is preferable. The pattern you've shown whereby you have a clustered index/heap and supporting non-clustered indexes with included columns (a so-called covering index) is extremely common, and perhaps the single best way to ensure good performance, along with good statistics.

Having said that, I wouldn't expect the 2nd query to outperform the 1st query since the results returned would be exactly the same. SQL is not a procedural language, it is a declarative language. With a procedural language, you tell the computer what to do, whereas with a declarative language you describe the desired results. Since SQL is declarative, and since both queries return the same result set, it is reasonable to assume SQL Server will likely come up with the same plan for both queries. Since the 2nd query is more difficult for a human to understand, I'd favor the 1st variant. Try not to outsmart the SQL Server query optimizer, it ~~cannot be done~~ is very hard to do.

At the end of the day, the only way for you to truly understand the best option in your situation, on your data, with your table definition and output requirements, is to test them.

score 1 · Answer 2 · answered Apr 13 '16 at 18:38

ObjectId is a misnomer if it is not unique. You are saying that it takes 4 columns to uniquely identify a row? Rethink.

This WHERE does not make sense; it seems like you are over-specifying the row by filtering on so many things, including a flags:

WHERE ObjectId = @objectId
  AND ObjectType = @objectType
  AND IsDeleted = 0
  AND ObjectIdName = @objectIdName
  AND ObjectTypeName = @objectTypeName;

"repeating ObjectType" -- Yes, do normalize, and turn into some suitably sized INT for type_id. Ditto for any other "repeating" columns.

If 99.99% of cases work with 2 columns, don't bother to have 2 indexes. The loss of cache space would cause more performance trouble than the slowdown for the 0.01%. In fact, don't even us INCLUDE because that costs cache space, too. KISS.

MySQL-specific notes:

Use VARCHAR(256) CHARACTER SET utf8 if possible; this allows using the column in indexes; 512 does not.
It smells like ObjectTypeName could/should be normalized. (How many different value of are there?
Use Engine=InnoDB.
In MySQL, a PK is both unique and clustered. This is the most efficient way to get to a row. And there can be only one `PRIMARY KEY. Because of 'clustering', the PK "includes" all the other columns.
Payload would probably be TEXT or BLOB, or some variant on them.
Using the suggested subquery would be inefficient.

Filtering a result set on a very large table (using INCLUDE index in SQL Server)

2 Answers2