Primary Key Int vs String

Question

I have an attributes table, where I want a unique name for each attribute in order to prevent confusion between potentially similar attributes.

In terms of the table design, the way I see it, there are at least two realistic options.

Have an autoincrement int PK, with an additional unique-constraint placed on the column that holds the name data.
Drop the integer autoincrementing PK, and just use the name as the primary key, it will by definition then have a unique constraint.

This attributes table will be joined all round the place, like say many thousands of products each having one or more attributes associated with them.

I am wondering if there is any disadvantage to #2 over #1?, like the indexing costs of adding new attributes down the path, or, are joins slower. What is the best practice?

I suspect #1 is the best option from performance, but #2 has advantages for readability of the data (ie pk:product_weight is more intuitive than pk:2)

Database is MySql, and, I am writing code using a SqlAlchemy/Python ORM abstraction layer, with migrations etc...

score 0 · Answer 1 · answered Sep 09 '19 at 15:08

0

There is no problem in using the name as a primary key, provided that you never update that column.

Avoiding unnecessary constraints is a performance gain.

answered Sep 09 '19 at 15:08

Laurenz Albe

61,070
4
55
90

score 0 · Answer 2 · answered Sep 10 '19 at 02:08

Generally speaking, less is more. I don't think this is clear-cut in your case, however.

Let's say your integer takes 4 bytes. With this you can have a few billion unique values. For case-insensitive ASCII Latin characters you get 26^4 = 456,976‬ unique values. Is this enough?

That will give you values 'aaaa', 'aaab', .. 'zzza' .. 'zzzy', 'zzzz'. However, most people, when they see letters, expect meaning. If your codes / names are to be meaningful you will need significantly longer columns. This has incremental overhead in storage, IO, backup, memory etc. This is not, generally, a problem as long as you don't get silly (names hundreds of characters long) and the server isn't starved of resources to start with.

System-generated incremental IDs should be internal to the DB only. All interaction with users or external systems should be using the attribute names. So, if you use the int PK, all external queries must join to the attribute table, predicated on the name. Conversely, if the attribute name is the PK & propagated as foreign keys in other tables, the child tables can be queried directly using the externally-provided value without joining the attribute table.

So you have int-keyed queries being slightly faster because of shorter tuples and char-keyed queries being slightly faster because of one fewer join. The overall winner is going to be very work-load dependent.

User-meaningful names have a nasty habit of changing over time. This happens no matter what promises are made initially. New products are introduced, confusingly similar words are altered, new managers want to stamp their authority. While it is possible to cascade primary key changes to child tables it is a pain.

On balance, for your case, I'd go with the int-keyed design. If the names were set by some external authority and stable over time (ISO country codes, planets) I'd likely use the names.

Primary Key Int vs String

2 Answers2