38

I generally use auto increment IDs as Primary Keys in databases. I am trying to learn the benefits of using GUIDs. I have read this article: https://betterexplained.com/articles/the-quick-guide-to-guids/

I realise that these GUIDs are used to identify objects at the application level. Are they also stored as the primary key at the database level. For example, say I had the following class:

public class Person
{
public GUID ID;
public string Name;
..

//Person Methods follow
}

Say I wanted to create a new person in memory and then insert the Person into a database. Can I just do this:

Person p1 = new Person();
p1.ID=GUID.NewGUID();
PersonRepository.Insert(p1);

Say I had a database containing millions and millions of rows with a GUID as the Primary Key. Will this always be unique? Am I even understanding GUIDs correctly?

I read this article earlier: http://enterprisecraftsmanship.com/2014/11/15/cqs-with-database-generated-ids/. It confuses me a little as it appears to recommend a happy medium between GUIDs and integers as Primary Keys.

Edit 11/06/18

I have come to believe that Guids are more suitable than ints for my requirements. I am using CQRS more these days and GUIDs fit in more nicely.

I do notice that some developers model the GUIDs as strings in the domain model e.g. here: https://github.com/dotnet-architecture/eShopOnContainers/blob/dev/src/Services/Ordering/Ordering.Domain/AggregatesModel/BuyerAggregate/Buyer.cs - in this case: IdentityGuid is a GUID modelled as a string. Is there any reason to do this other than what is stated here: Use a custom value object or a Guid as an entity identifier in a distributed system?. Is it "normal" to model the GUID as a string or should I be modelling it as a GUID in the model and database?

w0051977
  • 7,139

10 Answers10

43

GUIDs are by definition "Globally Unique IDentifiers". There's a similar but slightly different concept in Java called UUIDs "Universally Unique IDentifiers". The names are interchangeable for all practical use.

GUIDs are central to how Microsoft envisioned database clustering to work, and if you need to incorporate data from sometimes connected sources, they really help prevent data collisions.

Some Pro-GUID Facts:

  • GUIDs prevent key collisions
  • GUIDs help with merging data between networks, machines, etc.
  • SQL Server has support for semi-sequential GUIDS to help minimize index fragmentation (ref, some caveats)

Some Ugliness with GUIDs

  • They are big, 16 bytes each
  • They are out of order, so you can't sort on ID and hope to get the insertion order like you can on auto-increment ids
  • They are more cumbersome to work with, particularly on small data sets (like look up tables)
  • The new GUID implementation is more robust on SQL Server than it is in the C# library (you can have sequential GUIDS from SQL Server, in C# it is random)

GUIDs will make your indexes bigger, so the disk space cost of indexing a column will be higher. Random GUIDs will fragment your indexes.

If you know you aren't going to synchronize data from different networks, GUIDs can carry more overhead than they are worth.

If you have a need to ingest data from sometimes connected clients, they can be a lot more robust for preventing key collisions than relying on setting sequence ranges for those clients.

28

Will this always be unique?

Always? no, not always; it's a finite sequence of bits.

Say I had a database containing millions and millions of rows with a GUID as the Primary Key.

Millions and millions, you are probably safe. A million millions, and the likelihood of a collision becomes significant. There's good news, though: you've already run out of disk space by the time that happens.

Can I just do this?

You can; it's not a wholly good idea. Your domain model shouldn't normally be generating random numbers; they should be inputs to your model.

Beyond that, when you are dealing with an unreliable network, where you might get duplicate messages, a deterministically generated UUID will protect you from having duplicate entities. But if you assign a new random number to each, then you have more work to do to identify the duplication.

See the description of name-based uuid in RFC 4122

Is it "normal" to model the GUID as a string or should I be modelling it as a GUID in the model and database?

I don't think it matters very much. For most of your domain model, it is an identifier; the only query you ask of it is whether or not it is the same as some other identifier. Your domain model won't normally be looking at the in memory representation of an identifier.

If GUID is available as a "primitive type" in your domain agnostic setting, I would use it; it allows the supporting context to choose appropriate optimizations that may be available.

What you should recognize, however, is that the representation of the identifier, both in memory and in storage, is a decision you are making in your implementation, and therefore you should be taking steps to ensure that the foot print of code coupled to that decision is small -- see Parnas 1972.

VoiceOfUnreason
  • 34,589
  • 2
  • 44
  • 83
12

The GUID or UUID will very likely be unique because of how they are generated and they provide a safe way to guarantee uniqueness without having to communicate with a central authority.

Benefits of GUIDs as a Primary Key:

  • You can copy data between different shards of a cluster and not need to worry about PK collisions.
  • It allows you to know your primary key before you have inserted any records.
  • Simplifies the transaction logic for inserting child records.
  • Cannot be easily guessed.

In the example you provided:

Person p1 = new Person();
p1.ID = GUID.NewGUID();
PersonRepository.Insert(p1);

Specifying the GUID before insertion time can save a round trip to the database when inserting successive child records and allow you to commit them in the same transaction.

Person p2 = new Person();
p2.ParentID = p1.ID
PersonRepository.Insert(p2);

Detriments to GUIDs as a Primary Key:

  • They are large 16 bytes meaning they will consume more space as indexes and foreign keys are added.
  • They do not sort well as they are essentially random numbers.
  • Index usage is very, very, very bad.
  • A lot of leaf moving.
  • They are hard to remember.
  • They are hard to verbalize.
  • They can make URL's harder to read.

If your application has no need for sharding or clustering it would be best to stick with a smaller, simpler data types such as int or bigint.

Many databases have their own internal implementations that attempt to mitigate the storage problems caused by GUID's and SQL Server even has a function newsequentialid to help with the ordering of UUID's allowing better usage of indexes and they generally have better performance characteristics.

Additionally, from the perspective of a tester, user, or developer working with the application, using an ID over a GUID will significantly improve communication. Imagine having to read a GUID over a phone.

In the end, unless large scale clustering or obfuscating URLs is a requirement it's more pragmatic to stick with auto-incrementing IDs.

icirellik
  • 429
6
Person p1 = new Person();
p1.ID=GUID.NewGUID();
PersonRepository.Insert(p1);

This is by far the most important reason for using GUIDs.

The fact that you can create a unique id without your code knowing about or communicating with your persistence layer is a huge benefit.

You can be sure that the Person object you just generated on your server, pc phone, laptop, offline device or whatever is unique across all your servers all over the world however distributed.

You can stick it in any kind of database rdb or no-sql, file, send it to any webservice or throw it away immediately as uneeded

No you will never get a collision.

Yes inserts can be slightly slower as the index may need to be fiddled with.

Yes it is bigger than an int.

  • edit. had to shoot off before finishing up.

I know many people feel strongly about auto inc ints and this is a controversial topic with DBAs

But I really cant state strongly enough how superior guids are. You should use guids by default in any application.

auto inc ints have many many flaws

  • You use a No-Sql distributed db. You simply cant talk to all the other instances to find out what the next number is.

  • You use a message queue system. Things need IDs before they hit the db

  • You are creating several items and editing them before saving. Each needs an id before you have hit the db

  • You want to delete and re-insert rows. Make sure you don't count up your auto inc ids and run out!

  • You want not to expose how many Orders you have taken this year to every user

  • You want to move anonymised data from production to test and keep the relationships intact. But not delete all the existing test data.

  • You want to merge your single tenant product into a multi tenanted database but everyone has an order 56.

  • You create objects which are persisted but ephemeral. (incomplete orders) again, dont use up all your ints with stuff that no longer exists.

The list is endless and they are all real problems which happen to people all the time. unlike running out of diskspace because of slightly larger FK cols

Finally the massive issue with ints is you run out of them!!! ok in theory you dont, there are loads. But in practice you do because people don't treat them like random numbers with no meaning. they do things like

  • oh I don't want customers to think we are new. start at 10,000

  • I had to import a load of data so I just upped the seed to 1m so we know what is imported

  • we need category's of data. every period start at the next million so we can use the first digits as a magic number

  • I deleted and reimported all the data again with new ids. Yes even the audit logs.

  • use this number, which is a composite key, as the id of this other thing

Ewan
  • 83,178
5

I'd say no, don't use GUIDs as primary keys. I'm actually dealing with such a DB now, and they are one of the primary causes of performance issues.

The extra 12 bytes add up quickly; remember, most PKs will be FKs in other tables, and just three FKs in a table you now have 48 bytes extra for every row. That adds up in the table and in indexes. It also adds up in disk I/O. Those extra 12 bytes need to be read and written.

And if you're not using sequential guids and the PKs are clustered (which is what happens by default), SQL will from time to time have to move entire pages of data around to squeeze more into the right "spot." For a highly transaction database with lots of inserts, updates and deletes, things bog down fast.

If you need some kind of unique identifier for syncing or something, add a guid column. Just don't make it the PK.

Andy
  • 2,045
3

Always use database generated, auto-incrementing Primary Keys(PKs).

Why use auto-incrementing instead of GUID/UUID?

  • GUID(UUID)s do not prevent key collisions since they are not unique and there is no way to make them unique since they are generated from numerous sources.
  • GUIDs do not help with merging as they greatly increase the already time-consuming merge process with extremely long, non-integer PK and FK columns that take a great deal of time to process. Remember that for most PKs, there will be at least 1 other table with at least 2 keys of the same size: it's own PK and a FK back to the first table. All have to be resolved in a merge.

But how then to handle shards, clusters, etc.?

  • Create multi-column PKs made up of separate columns identifying each shard/cluster/database/whatever that manages it's own auto-incrementing keys. For example...

A 3-column PK for a clustered table might be...

 DB | SH | KEY     |
----|----|---------|
 01 | 01 | 1234567 |

But What About...?

  • Multiple trips to the database -- Most applications have no need to uniquely identify a record being created until it's inserted into the database since that thread/session/whatever is only working on one at a time. If the application really does need this ability, use an application generated temporary PK that is not sent to the database. Let the database then put it's own auto-increment PK on the row when it's inserted. Inserts will use the temporary PK, while updates and deletes will use the permanent PK assigned by the database.

  • Performance -- Computers can process simple integers far faster than anything else because of the greatly larger domain if possible values per element in a GUID(37) vs. an integer (10). Remember too that each character in a GUID must first be converted into a number to be manipulated by the CPU.

Common Misuses of Primary Keys PKs have only one purpose... to absolutely uniquely identify a row in a table. Anything else is an all-too-common misuse.

Detecting Missing Records

  • Missing records cannot be detected by looking at the PKs. Bless QA for at least attempting to ensure data quality. However, they and programmer's lack of understanding of how keys in modern database systems are assigned often leads them to the misbelief that a missing number in an auto-incrementing PK means missing data. It does not because...
  • For performance, database systems allocate blocks of numbers in 'sequences'(batches, ranges) to minimize trips to the actual database in storage. The size of these sequences of numbers is often under the control of the DBA but may not be tunable on a per-table basis.
  • The key takeaway is... unused numbers from these sequences are never returned to the database so there are always gaps in the PK numbers.
  • Why would there be unused numbers you ask? Because a variety of database maintenance actions may cause sequences to be abandoned. These are things like restarts, bulk reloads of tables, some types of restoration from backups and some other operations.

Sorting

  • Sorting by PK is very error-prone since most people will think it lists the rows in the order they were created and that that corresponds to clock time. Mostly, but not necessarilly.
  • Database engines are optimized for maximum performance and that may mean delaying insert of the results of a long-running complicated transaction in order to insert short simple ones, "out-of-turn" so to speak.
2

I realise that these GUIDs are used to identify objects at the application level. Are they also stored as the primary key at the database level.

That's where you should stop, right there, and rethink.

Your database primary key should NEVER have business meaning. It should be meaningless by definition.

So add the GUID as your business key, and a normal primary key (usually a long int) as the database primary key. You can always put a unique index on the GUID to ensure uniqueness.

That's talking database theory of course, but it's good practice as well. I've dealt with databases where the primary keys had business meaning (one customer had thought to save some database resources by using them as employee numbers, customer numbers, etc. etc. for example) and it always leads to trouble.

jwenting
  • 10,099
2

Like anything, there are advantages and disadvantages to doing this:

The Good:

  1. Your keys are always the same length (very large databases can have very large keys)

  2. Uniqueness is pretty much guaranteed - even when you're generating them from a separate system, and/or haven't read the last ID from the database

The Bad:

  1. As mentioned a lot above - larger indexes and data store.

  2. You can't order by ID, you have to order by something else. More indexes, probably less efficient.

  3. They're less human readable. Integers are generally easier to parse, remember and type for people. Using GUIDs as IDs in WHERE clauses across multiple joined tables can make your head melt.

Like everything, use them where appropriate, don't be dogmatic – in many situations auto-incrementing integers are better, occasionally GUIDs are great.

Glorfindel
  • 3,167
Phil S
  • 209
1

Yes you can use GUID as primary key. The down side is size and rapid fragmentation of the index.

Unless you need uniqueness across databases (e.g. a cluster) integer is preferred.

paparazzo
  • 1,927
1

Here's my take on this issue - the solution is a halfway house between GUID and int values, taking the best of both.

The class generates a pseudo random (but increasing over time) Id value, which is similar to a Comb GUID.

The key advantage is that it allows Id values to be generated on the client, rather than using auto-increment values generated on the server (which requires a round trip) with almost zero risk of duplicated values.

The generated values only use 8 bytes rather than 16 for a GUID, and not dependent on one specific database sort order (e.g. Sql Server for GUIDs). The values could be expanded to use the whole unsigned long range, but this would cause issues with any database or other data repository that only has signed integer types.

public static class LongIdGenerator
{
    // set the start date to an appropriate value for your implementation 
    // DO NOT change this once any application that uses this functionality is live, otherwise existing Id values will lose their implied date
    private static readonly DateTime PeriodStartDate = new DateTime(2017, 1, 1, 0, 0, 0, DateTimeKind.Utc);
    private static readonly DateTime PeriodEndDate = PeriodStartDate.AddYears(100);
    private static readonly long PeriodStartTicks = PeriodStartDate.Ticks;
    private static readonly long PeriodEndTicks = PeriodEndDate.Ticks;
    private static readonly long TotalPeriodTicks = PeriodEndTicks - PeriodStartTicks;

    // ensures that generated Ids are always positve
    private const long SEQUENCE_PART_PERMUTATIONS = 0x7FFFFFFFFFFF; 

    private static readonly Random Random = new Random();

    private static readonly object Lock = new object();
    private static long _lastSequencePart;

    public static long GetNewId()
    {
        var sequencePart = GetSequenceValueForDateTime(DateTime.UtcNow);

        // extra check, just in case we manage to call GetNewId() twice before enough ticks have passed to increment the sequence 
        lock (Lock)
        {
            if (sequencePart <= _lastSequencePart)
                sequencePart = _lastSequencePart + 1;

            _lastSequencePart = sequencePart;
        }

        // shift so that the sequence part fills the most significant 6 bytes of the result value
        sequencePart = (sequencePart << 16);

        // randomize the lowest 2 bytes of the result, just in case two different client PCs call GetNewId() at exactly the same time
        var randomPart = Random.Next() & 0xFFFF;

        return sequencePart + randomPart;
    }

    // used if you want to generate an Id value for a historic time point (within the start and end dates)
    // there are no checks, compared to calls to GetNewId(), but the chances of colliding values are still almost zero
    public static long GetIdForDateTime(DateTime dt)
    {
        if (dt < PeriodStartDate || dt > PeriodStartDate)
            throw new ArgumentException($"value must be in the range {PeriodStartDate:dd MMM yyyy} - {PeriodEndDate:dd MMM yyyy}");

        var sequencePart = GetSequenceValueForDateTime(dt.ToUniversalTime());
        var randomPart = Random.Next() & 0xFFFF;
        return ( sequencePart << 16 ) + randomPart;
    }

    // Get a 6 byte sequence value from the specified date time - startDate => 0 --> endDate => 0x7FFFFFFFFFFF
    // For a 100 year time period, 1 unit of the sequence corresponds to about 0.022 ms
    private static long GetSequenceValueForDateTime(DateTime dt)
    {
        var ticksFromStart = dt.ToUniversalTime().Ticks - PeriodStartTicks;
        var proportionOfPeriod = (decimal)ticksFromStart / TotalPeriodTicks;
        var result = proportionOfPeriod * SEQUENCE_PART_PERMUTATIONS;
        return (long)result;
    }

    public static DateTime GetDateTimeForId(long value)
    {
        // strip off the random part - the two lowest bytes
        var timePart = value >> 16;
        var proportionOfTotalPeriod = (decimal) timePart / SEQUENCE_PART_PERMUTATIONS;
        var ticks = (long)(proportionOfTotalPeriod * TotalPeriodTicks);
        var result = PeriodStartDate.AddTicks(ticks);
        return result;
    }
}
Peregrine
  • 1,246