NVARCHAR column as PRIMARY KEY or as UNIQUE column

Question

I'm developing a SQL Server 2012 database and I have a doubt about nvarchar columns as primary keys.

I have this table:

CREATE TABLE [dbo].[CODES]
(
    [ID_CODE] [bigint] IDENTITY(1,1) NOT NULL,
    [CODE_LEVEL] [tinyint] NOT NULL,
    [CODE] [nvarchar](20) NOT NULL,
    [FLAG] [tinyint] NOT NULL,
    [IS_TRANSMITTED] [bit] NOT NULL DEFAULT 0,
     CONSTRAINT [PK_CODES] PRIMARY KEY CLUSTERED 
    (
        [CODE_LEVEL] ASC,
        [CODE] ASC
    )
)

But now I want to use [CODE] column as primary key and remove [ID_CODE] column.

Is there any issue or penalty if I have a NVARCHAR column as PRIMARY KEY?

[CODE] column value must be unique, so I've thought that I can set an UNIQUE constraint to that column.

Do I have to use [CODE] as primary key or it is better if I set an UNIQUE constraint on [CODE] column?

Solomon Rutzky · Answer 1 · 2018-12-14T21:33:00.623

Yes, absolutely there are negative consequences for using a string instead of a numeric type for a Primary Key, and even more so if that PK is Clustered (which it indeed is in your case). However, the degree to which you see the effect(s) of using a string field is a function of a) how many rows are in this table, and b) how many rows in other tables are Foreign Keyed to this PK. If you only have 10k rows in this table and 100k rows in a few other tables that FK to this table via that field, then maybe it won't be so noticeable. But those effects certainly become more noticeable as row counts increase.

You need to consider that the fields in a Clustered Index are carried over to Non-Clustered Indexes. So you aren't just looking at up to 40 bytes per row, but (40 * some_number) bytes. And in any FK tables you have those same 40 bytes in the row plus more often than not there will be a Non-Clustered index on that field as it is being used in JOINs, so now it is really doubled in any tables that FK to this one. If one is inclined to think that 40 bytes * 1 million rows * 10 copies of it is nothing to be concerned about, please see my article Disk Is Cheap! ORLY? which details all (or at least most) of the areas impacted by this decision.

The other thing to consider is that filtering and sorting on strings, especially when not using a binary Collation (I assume you are using the database default which is typically case-insensitive) is far less efficient (i.e. takes longer) than when using INT / BIGINT. This impacts all queries that filter / join / sort on this field.

Hence, using something like CHAR(5) would probably be OK for a Clustered PK, but mostly if it was also defined with COLLATE Latin1_General_100_BIN2 (or something like that).

And can the value of [CODE] ever change? If yes then that is even more reason to not use it as a PK (even if you do set the FKs to ON UPDATE CASCADE). If it can't or won't ever change that is fine, but still there is more than enough reason already to not use it as a Clustered PK.

Of course, the question might be incorrectly phrased as it appears that you currently already have this field in your PK.

Regardless, your best option, by far, is to use [ID_CODE] as the Clustered PK, use that field in related tables as the FK, and keep [CODE] as a UNIQUE INDEX (which means it is an "alternate key").

Update
A little more info based on this question in a comment on this answer:

Is [ID_CODE], as PRIMARY KEY, the best option if I use [CODE] column to look up the table?

This all depends on a great many factors, some of which I have already mentioned but will restate:

A Primary Key is how the individual row is identified, whether or not it is referenced by any Foreign Keys. How your system internally identifies the row is related to, but not necessarily the same as, how your users identify themselves / that row. Any NOT NULL column with unique data could work, but there are practicality issues to consider, especially if the PK is, in fact, referenced by any FKs. For example GUIDs are unique and some people really like using them for various reasons, but they are quite bad for Clustered Indexes (NEWSEQUENTIALID is better, but not perfect). On the other hand, GUIDs are just fine as alternate keys and used by the app to look up the row, but the JOINs are still done using an INT (or similar) PK.

So far you haven't told us how the [CODE] field fits into the system from all angles, outside of now mentioning that this is how you look up rows, but is that for all queries or just some? Hence:

Regarding the [CODE] value:
- How is it generated?
- Is it incremental or psuedo-random?
- Is it uniform length or varying length?
- What characters are used?
- If using alphabetical characters: is it case-sensitive or insensitive?
- Can it ever change after being inserted?
Regarding this table:
- Do any other tables FK to this table? Or are these fields ([CODE] or [ID_CODE]) used in other tables, even if not explicitly Foreign Keyed?
- If [CODE] is the only field used to get individual rows, then what purpose does the [ID_CODE] field serve? If it isn't used, why have it in the first place (which might depend on the answer to "Can the [CODE] field ever change?")?
- How many rows in this table?
- If other tables to reference this table, how many and how many rows in each of them?
- What are the indexes for this table?

This decision can't be made purely on the question of "NVARCHAR yes or no?". I again will say that generally speaking I do not find it to be a good idea, but there are certainly times when it is fine. Given so few fields in this table it is not likely that there are any more, or at least not many, indexes. So you might be fine either way to have [CODE] as the Clustered Index. And if no other tables reference this table then you might also be fine making it the PK. But, if other tables do reference this table then I would opt for the [ID_CODE] field as the PK, even if Non-Clustered.

score 7 · Answer 2 · answered Feb 11 '15 at 07:55

You have to separate the concepts:

primary key is a design concept, a logical property of the entries in the table. It should be immutable during the lifetime of the table entry, and should be the key used in the application to reference the entry.
clustered index is a storage concept, a physical property. It should be the most common access path for queries, it should serve to satisfy as covering index for most cases, and satisfy as many range queries as possible.

Is not required for the primary key to be the clustered index. You can have ID_CODE as PK and (CODE_LEVEL, CODE) as clustered key. Or the other way around.

A larger clustered key has some negative repercussions, as the wider key means lower density on the index pages and larger size consumed on all non-clustered indexes. there have already been tonnes of ink spilled on this topic, eg. start from More considerations for the clustering key – the clustered index debate continues!.

But the gist of the matter is that the choice of clustered index key is primarily a trade-off. On one hand you have storage size requirements, with general repercussions in performance (larger key -> bigger size -> more IO, and IO bandwidth is probably the most scarce resource you have). On the other hand choosing the wrong clustered key in the name of space savings can have query performance consequences, often worse than the problems resulted from a wide key.

As for the primary key choice, it shouldn't even be an issue: your data model, your app logic, should dictate what the primary key is.

That being said, my 2c: NVARCHAR(20) is not wide. Is a perfectly acceptable clustered key size, even for a large table.

score 4 · Answer 3 · edited Feb 10 '15 at 17:49

I would never allow anyone to make a nvarchar(20) to be a PK in my database. You waste disk space and cache memory. Every index on this table and all FKs to it replicates this wide value. Maybe a char(20) if they can justify it. What kind of data are you trying to store in CODE? Do you really need to store nvarchar characters? I tend to make PKs "internal" values not seen by the users, and I try to keep values that are displayed separate. Displayed values sometimes need changed, which becomes very problematic with PKs+FKs.

Also, do you realize that a 'bigint identity (1,1)' can increment up to 9,223,372,036,854,775,807?

[ID_CODE] [bigint] IDENTITY(1,1)

Unless you are building this database for Google, won't a normal int identity (1,1) with its over 2 Billion limit be enough?

score 3 · Answer 4 · answered Feb 10 '15 at 08:15

There should be no inherent/noticeable penalty other than you risk using wide keys when using nvarchar/varchar if not aware. Especially if you start combining them in composite keys.

But in your example of a (20) length you should be fine and I wouldn't worry much about it. Because if CODE is how you mainly query your data - a clustered index on that sounds very sensible.

However, you should consider whether you actually want it as a primary key or just a unique (clustered) index. There's a (small) difference between the clustered index and the primary key (basically - primary key identifies your data, but the index is how you query data), so if you wish you could just as easily make your ID_Code as a primary key and make a unique clustered index over CODE. (notice: SQL Server will automatically make your Primary Key into a clustered index, unless you've manually created the clustered index yourself)

Also consider whether you actually need ID_Code now you have a unique CODE.

NVARCHAR column as PRIMARY KEY or as UNIQUE column

4 Answers4