How to handle primary keys and UUIDs in a database

Question

I've heard advice from several people about using UUIDs as ID in your database. For one this has the benefit of making your URLs unguessable. It also masks how many objects you have in the system. e.g. If your user ID is #412, then you know there are about 412 users in the system.

One way to do this is to use a UUID as the primary key in your database instead of a long or an integer. I've heard this can have performance implications on your database.

I've also heard that you should never expose the primary key to your end users. This is because it exposes an implementation detail that the users don't need to know about. The problem with this is you essentially have two IDs, one public and one private and it adds complexity between swapping between the two.

For instance let's say I have an account object and each account can have multiple users. My user object would need a reference to the foreign key of the accounts table. Which ID do I reference? The public one or the private one? When end users call the API they will POST with the public ID of the account because they don't know the private, primary ID.

Or is this all just too much work and not worth it. I know some companies don't bother and just expose the primary key. Then again I don't want to be Parler. https://www.wired.com/story/parler-hack-data-public-posts-images-video/

score 11 · Answer 1 · edited Oct 07 '24 at 22:53

Lots of questions here,

There are many reason to use a UUID as a PK in your database rather than an auto incremented int. The main reason though is because you can create it before and without calling the database.

This means you can construct a complex object with FKs in your front end, or distributed microservice without any round trips to a DB.
There are many reasons not to expose a db/internally generated PK to external users. But the main reason is that usually this key is meaningless outside of the internal system

By exposing, or using, some other, natural, key you avoid the number of ids expanding with every system the entity is saved to. This doesn't mean you don't need a key of some kind though. It also doesn't mean you should create extra keys that are not used.

For instance let's say I have an account object and each account can have multiple users. My user object would need a reference to the foreign key of the accounts table. Which ID do I reference?

On your accounts table it would be great if you have a natural key. For example an email address is unique and can serve as a PK. You don't need to add an extra UUID or int ID.

Or is this all just too much work and not worth it. I know some companies don't bother and just expose the primary key

You have to expose a unique key of some kind if you have an interface that returns a single record. The key can be:

A natural key like email address
An internally generated key like a UUID or auto inc Int
A composite key, for example firstname + lastname + company

The advice you are reading is saying, don't just add an auto inc int id to everything and expose it. Natural keys are the best, and if the natural key isn't unique maybe there is a composite natural key. If there is no natural key, you don't have to hit the database or some other shared service and expose information with an auto inc id. You can use a random string of some kind.

DavidT · Answer 2 · 2024-10-06T22:13:14.527

At the end of the day a UUID is just a 128 bit integer with:

A special formatting rule used to display it to humans.
A non-sequential generation rule.

Note: there are multiple UUID generation formats and nothing stopping you from inventing your own sequential one (although I would question the value of doing that).

Performance

Generally speaking processing less data (i.e. smaller 32 or 64 bit integers) is quicker than processing more data. However having non sequential data can have some advantages for example, when indexing it means that the majority of new records are not all being inserting at the end of the index.

However there is an open question of whether the extra 8 or 12 bytes per record have a significant impact on overall system performance - frankly you will have to benchmark it and weigh the pro's and con's against the other factors.

Leaking Growth/Size Information

If you use a sequential number for anything and an adversary has the ability to see newly created numbers on some frequent basis, then the adversary is able to identify how many of that thing are being created in a set time frame.

One adversarial use case would be in conjunction with the financial markets for estimating new customer or order growth - however any use case that relies on data growth numbers would also benefit.

Data Sparseness

If you had 10,000 widgets, but store the widget ID in a short (16 bit) integer, even if you distribute the ID's across the entire range (0 - 65535) it would still be practical for an attacker to brute force all of them by iterating through the entire range.

Hence with UUID's it is the combination of both the size of the range (128 bits) and the non sequential generation that provides the sparseness benefit.

Human Entry

Typically businesses use "short" numeric IDs for customer, order or account numbers as it is likely someone will telephone a call center and need to give the number verbally to a representative, who will type it into a computer.

If you expose an API to a third party that also interacts with humans they may need a lookup function, to identify a record by a "Human ID" however after the initial lookup they can probably use a UUID for all future functionality.

Unfortunately these human ID's likely violate the previous two points (Data Growth and Sparseness), however its going to be difficult to avoid that and still keep the ID's small for humans.

Summary

There is no "right" answer here its all trade offs, I typically make my life simple by using UUID's for everything I can:

Primary keys
Foreign keys between tables
ID's in API/URL paths.

Then supplement these with Human ID's where required for business functions.

However your milage may vary - for example you might choose to use a composite key of the OrderID and a simple line number on an OrderLines table.

JimmyJames supports Canada · Answer 3 · 2024-10-07T19:57:18.960

A couple of points of clarification here.

For one this has the benefit of making your URLs unguessable.

You are likely thinking of v4 UUIDs. If you plan to use UUIDs for any purpose, you should familiarize yourself with the various types of UUIDs.

Most versions of UUIDs are not fully random. Some encode information about the system that created them, timestamps, etc. On a side note, using UUIDs as a security feature is likely misguided. They are not designed for that purpose (even v4) so it's basically rolling your own security. Also, the unpredictability of an v4 Id depends on the quality of the PRNG which has some other implications that are discussed below.

I've also heard that you should never expose the primary key to your end users. This is because it exposes an implementation detail that the users don't need to know about. The problem with this is you essentially have two IDs, one public and one private and it adds complexity between swapping between the two.

I agree with the guidance to not expose primary keys. The reason is that once you expose a key, it can constrain your flexibility to make changes to your design or at least complicate them. It shouldn't matter to your clients whether your key is the actual primary key of your table or a key on a table at all. They just need an identifier. Having an Id designed for the client's use and independent of your internal identifiers can help you avoid some really thorny problems. For example, if an account is compromised and you want to issue a new Id, it's usually a lot easier if you don't have to worry about modifying primary keys or internal ids. This is a useful feature, not a burden.

Large numbers

One of the issues with using UUIDs that you don't mention is that as 128-bit numbers, UUIDs have a huge number space. I mean, really extraordinarily large. That's a big part of how they work but creates some challenges. The big one when working with DBs is that most common indexing strategies don't work well with sparse numbers, and this can lead to performance issues. Some of the UUID versions produce values which are particularly bad for many indexing strategies. If you index a UUID, make sure you are using an index that can support them efficiently. Also realize this limits your indexing options.

Another potential issue is generating good random numbers. The fact that UUIDs are such large numbers does not, by itself, preclude collisions. You must also ensure your generator(s) have a good distribution across the available number space. Often this means using a secure PRNG (SPRNG) and those generally require a source or pool of entropy. In a high-volume situation, this pool can be 'drained' and the SPRNG can become a bottleneck for Id generation. If you are using UUIDs generated outside of your system, you also have a risk that the clients do not generate good, random UUIDs.

Another issue with UUIDs is they are really hard to read. If you need to have a user or support team member saying these Ids over the phone, for example, it's a pretty miserable situation. And when comparing UUIDs, I think it's pretty common to look at the beginning and end of the UUID without checking every character. This can create opportunities for social engineering.

While UUIDs are useful in a lot of scenarios, I think they are often oversold as a panacea. They introduce their own issues. You can provide a non-PK Id to your clients without using UUIDs.