Does my table need a primary key and clustered index change?

Question

I have a table holding 1.7 Milion rows.

Definition:

CREATE TABLE T(
[ID] [uniqueidentifier] NOT NULL,
[AID] [uniqueidentifier] NOT NULL,
[BID] [uniqueidentifier] NOT NULL,
[iType] [int] NOT NULL,
[MT] [ntext] NOT NULL,
[isM] [tinyint] NOT NULL,
[CDate] [datetime] NOT NULL,
[CBy] [nvarchar](50) NOT NULL,
[xi] [tinyint] NOT NULL

There is no Primary Key and no clustered index. There is one non clustered index on a uniqueidentifier column (that is not the PK candidate).

Looking at the usage statistics I see the following:

           Seeks Scans Lookups Updates TotalKB  UsedKB
HEAP       0     2     1500    65000   1810736  1757128
Non Cl.IX  1500  0     0       65000   56280    56240

I am thinking about adding a primary key on the (obviously previously for this purpose implemented) "ID" column - just to follow the normalization requirements and to have it. I am also thinking if a clustered index would be helpful or not... There is not a really good candidate to fullfil the Clustered index requirements. Maybe CDate is the least bad one...?

Unfortunately I do not have insight in regulary running queries that I could use to identify if there are range queries and to compare before / after performance with.

Questions:

After adding a primary key - can I expect any performance changes / improvements?
Is it even possible to add a primary key WITHOUT adding a clustered index?
Would you suggest adding a clustered index in this scenario to improve performance? I have read tons of opinitons, discussions and benchmark tests about this - Brent Ozar says "create it as long you can't prove it's better without" - others say (sorry, it is in German) performance usually is not getting better but ofthen worse having a clustered index in place because it requires more reads to get the required data, a benchmark on SE resulted in "Ci are good fur updates only" => so right now I am completely confused what to do with my several HEAP tables (like this one).

score 3 · Accepted Answer · answered Oct 16 '15 at 08:59

To answer question 2 first

2

A primary Key is NOT always the clustered index, it can be the clustered index and in the majority of cases is the way things are done, but it isn't always the best for your data. The culstered index is the order in which your data is physically stored on the disk whereas the primary key (which can be composite) designates the field to be a unique field and is beneficial for not inserting duplicate values and for foreign key lookups if you wish to do joins (In summary for 2, yes you can add a primary key without a clustered index)

NOTE: sometimes if you want an identifiable row you can add a new field to the table to simply act as the primary key (not always advisable, but can sometimes be the a solution to improve performance)

1

Adding a primary key can change performance, however the only true way to know if it's going to improve performance is to test it, if you have a pre-live environment consider adding it to there and running your queries across it.

If you have queries that run as joins to this table on ID,AID,BID all together (Sorry I'm not 100% sure how these are all coming together) the Potentially create a composite primary key across all three which means when anything wishes to get data with this table comparing all three of those it can find that row with ease. (Hope this makes sense)

3

Adding a Clustered Index completely depends on your data, once again a pre-live environment would be an ideal situation for testing.

A few things to consider when creating a clustered index, what data are you retrieving and what are you inserting

(this is a general example)If you're inserting and retrieving data that is the most recent data then a clustered index on the date field sounds the best idea, however if there is a LOT of data going in and out you will have very high contention on the most recent pages in your table, an alternative would be to have the clustered index around a category that those dates are on, eg client, this would mean that the data is grouped by a client which is more likely to have data gathered by, and spreads the read write load across the disk / disks

If the data retrieve is very random then a clustered index is quite pointless, if the data you get back has no real order to it then a Heap is completely acceptable.

Ultimately there is no be all answer to should I add a clustered index or a primary key because every situation is slightly different and will react in different ways.

If you have a pre-live environment (even a cut down version) can help make your decisions. Personally we have tables with primary clustered composite keys, and some tables that are simply heaps.

Hope this helps (and makes sense, sometimes I find I ramble)

Julien Vavasseur · Answer 2 · 2015-10-29T13:31:12.880

Lets start with the differences between Primary Key, Clustered or NonClustered Index and Heap.

Primary Key

Restrictions:

The column used in a Primary Key must be unique
If a primary key contains more than one column, each combination of values for all these columns must be unique
Must be NOT NULL

The PK itself is not meant to improve performance. It is a constraint which is used to make sure that the key (1 column or a combination of Column) is unique in the table:

ALTER TABLE dbo.test ADD CONSTRAINT PK_test PRIMARY KEY NONCLUSTERED(ID);

A Primary Key will be either NONCLUSTERED (just a special constraint used for Foreign Keys) or CLUSTERED (clustered index used for storage at the page level). By Default SQL Server Management Studio create them as CLUSTERED.

Clustered Index

A Clustered Index is used to store and order rows and their data at the leaf level. Unlike non clustered indexes, it contains the data for all the columns at the leaf level of the index. Rows will be ordered by the columns used in the Index key.

To be efficient, a Clustered index key should be:

Narrow = is a small as possible (number of bytes for the index index key)
Unique = when duplicates exist, it adds a 4 bytes "uniqueifier" value to duplicates (=the key is bigger)
Static = is never updated
Ever Increasing = consevutive values are inserted in order to avoid fragmentation. Data will be added at the end of the index (1, 2, 3, 4 but not 1, 2, 4, 3...)

You can only have one Clustered Index on a table. Rows are recorded in the Clustered key order. This is why it is better to have an ever increasing key. New data will be recorded at the end of the index in that case. With a random key, rows will have to be inserted in between data pages and it will create fragmentation and row movements.

NonClustered Index

They don't store the row data at the leaf level. They only store the values(s) of the Clustered key(s) from the Clustered Index. The key is then use as a pointer to the real data in the Clustered index.

Because the clustered index keys are used in indexes branches and leaves, a smaller key will make non clustered index smaller. With smaller index, it will have less pages to read, it will keep more data in memory cache and more rows can be retrieved out of each pages of data being read.

Heap

Heap is a table without Clustered Index. It can have a Primary Key or no PK. In order to distinguish rows, it use a RowIDentifier of 8 bytes.

Because new rows are added at the end of the Heap with no specific order, there is no fragmentation when you insert new rows

Question 1

The primary key won't improve performances. It will just make sure the combination of column used in the PK is unique.

The clustered index may improve performance if the key and other non clustered indexes are well choosen based on you queries and typical usage.

Question 2

You can use either of this queries:

ALTER TABLE dbo.test ADD CONSTRAINT PK_test PRIMARY KEY CLUSTERED(ID);
ALTER TABLE dbo.test ADD CONSTRAINT PK_test PRIMARY KEY NONCLUSTERED(ID);

If you use the first, you cannot add a 2nd custom clustered index. If you use the second, you table will remain a Heap (no clustered index with a 8 bytes RID key) until you add a Clustered index.

Question 3:

If we look at your columns, what are the candidate keys:

RID key...

You already have an nonclustered index. Because there is no clustered index, this index use Row IDentifier (RID) of 8 bytes at the moment. 8 is not too big but this is not a narrow key either.

ID, AID and BID

They are not good candidates because they are wide (16 bytes) and are random (not ever increasing). However, they are unique. They will make non clustered index bigger and slower. NEWSEQUENTIALID could be used in order to make it ever increasing but it won't make it smaller.

CDate

It can be a good candidate if you insert them in order (ever increasing). You may have duplicates which will have to rely on a uniqueifier (4 bytes) to distinguyish them. It also cannot be a Primary Key if it is not unique. Datetime storage size is 8 bytes. It would be better to convert it to datetime2 which takes 6 to 8 bytes depending on the precision you need.

New column of type int with identity

In that case, the storage size would be 4 bytes which is big enough for 1.7M rows. This is much smaller than datetime or uniqueidentifier hence small indexes. Identity will also make them unique and ever increasing.

Solution?

There is no Black or White answer and it has to be decided according to you needs, queries and usages. I would probably not keep the Heap and I would create a Clustered index. Without knowing the queries you execute on this table I would either create a Clustered index on a new int identity column or create it on the datetime2 columns.

It would help to know which query are executed. This way, good indexes can be created with an efficient Clustered Index.

Either Clustered or Non Clustered indexes also create Statistics. They help the engine when it looks for the best execution plan. Too many, useless or poor indexes, either Clustered or Non Clustered, will reduce performances.

Does my table need a primary key and clustered index change?

2 Answers2