Clustered Index Selection - PK or FK?

Question

I have a SQL Server 2014 table that looks like the following:

OrderId     int           not null IDENTITY --this is the primary key column
OrderDate   datetime2     not null
CustomerId  int           not null
Description nvarchar(255) null

Some folks on my team have suggested that the clustered index should be on OrderId, but I think that the CustomerId + OrderId would be a better choice for the following reasons:

Almost all queries will be looking WHERE CustomerId = @param, not OrderId
CustomerId is a foreign key to the Customer table, so having a clustered index with CustomerId should speed up joins
While CustomerId isn't unique, having the additional OrderId column specified in the index will ensure uniqueness (We can use the UNIQUE keyword when creating the clustered index on those 2 columns, to avoid the overhead of not having uniqueness)
Once data is inserted, the CustomerId and OrderId never change, so these rows wouldn't be moving around after initial write.
Data access happens via an ORM that requests all columns by default, so when a query based on CustomerId comes in, the clustered index will be able to provide all columns without any additional work.

Does the CustomerId and OrderId approach sound like the best option given the above? Or, is OrderId on its own better, since it's a single column that's guaranteeing uniqueness by itself?

Currently, the table has a clustered index on OrderId, and a nonclustered index on CustomerId, but it's not covering, so since we're using an ORM and all columns are requested, it's extra work to retrieve them. So with this post, I'm trying to consider improving performance with a better CI.

The activity on our DB is about 85% reads and 15% writes.

score 5 · Accepted Answer · 2017-08-31T06:44:33.400

_{Community wiki answer:}

I think a composite clustered index key with CustomerID as the first column will be best since that's in the WHERE clause of nearly all queries.

There may be more splits compared to an incremental key (or more likely suboptimal page density for a time if you manage and maintain fill factor to avoid 'bad' splits). However, the overall performance improvement for customer queries is substantial, because the key lookup is avoided.

OrderID or OrderDate may be best for the second column depending on your most critical queries.

For example, if customers see a chronological list of recent orders after logging in to a web site, OrderDate should be next, to optimize ORDER BY OrderDate DESC.

If you choose OrderID as the clustered index, with a non-clustered index on CustomerID, you'll still get splits and fragmentation, just in the non-clustered index.

score 3 · Answer 2 · edited Aug 31 '17 at 06:24

If this table is heavily write intensive (e.g. many more INSERT statements are occurring rather than SELECT statements against it), I'm going to disagree with the wiki answer.

Choosing CustomerID as the first column of a composite clustered key is going to generate a lot of mid-page splits. You hopefully have lots of existing customers and also get many new customers all the time. Because customers are (hopefully) placing multiple orders as your business continues to grow, this approach will exhibit a fair amount of mid-page splits that are going to kill performance not only on writes, but also reads as your indexes will be both heavily fragmented and likely contain higher amounts of white space (which means wasted storage and memory).

If you feel CustomerID should be a leading column of a composite clustered index, you can reduce the impact of the mid-page splits by adjusting FILLFACTOR on all indexes for this table. This will decrease the amount of mid-page splits by increasing the size of the table/index. If you want to go this route, I'd suggest testing with a value of 80 and reduce if analysis reveals mid-page splits are still killing performance.

My suggestion is to use OrderId. OrderID should naturally be sequential and generate more of the end-page splits which are good and expected with table growth. Additionally this approach will play better with Table Partitioning if you choose to use the OrderDate column as a partition key. Regarding queries that constantly use the CustomerID field, create a nonclustered index to handle those queries. This index would need to be defined with the proper FILLFACTOR as it will suffer from mid-page splits that I mention above, though these won't be as bad overall in contrast to if the splits were occurring against the clustered index.

The activity on our DB is about 85% reads and 15% writes.

CustomerID + OrderID (and specifying a fillfactor to allow for growth without splits) is probably better if that assessment holds true. Just make sure that assessment is accurate. Test test test.

Clustered Index Selection - PK or FK?

2 Answers2