33

Has anybody done any real research on the probability of UUID collisions, especially with version 4 (random) UUIDs, given that the random number generators we use aren't truly random and that we might have dozens or hundreds of identical machines running the same code generating UUIDs?

My co-workers consider testing for UUID collision to be a complete waste of time, but I always put in code to catch a duplicate key exception from the database and try again with a new UUID. But that's not going to solve the problem if the UUID comes from another process and refers to a real object.

Paul Tomblin
  • 1,949

6 Answers6

18

Wikipedia has some details:

http://en.wikipedia.org/wiki/Universally_unique_identifier

http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates

But the probability only holds if the bits are perfectly random. However, the RFC https://www.rfc-editor.org/rfc/rfc4122#page-14 linked in the other answer defines this for version 4:

"4.4. [...] The version 4 UUID is meant for generating UUIDs from truly-random or pseudo-random numbers. [...] Set all the other bits to randomly (or pseudo-randomly) chosen values."

This pretty much allows anything from the xkcd random generator http://xkcd.com/221/ to a hardware device using quantum noise. The security considerations in the RFC:

"6. Distributed applications generating UUIDs at a variety of hosts must be willing to rely on the random number source at all hosts. If this is not feasible, the namespace variant should be used."

I read this as: You're on your own. You're responsible for your random generator within your own application, but this and anything else is based on trust. If you don't trust your own ability to correctly understand and use the random generator of your choice, then it is indeed a good idea to check for collisions. If you do not trust the programmer of the other processes, then check for collisions or use a different UUID version.

Secure
  • 1,928
11

You should certainly detect if a collision occurs, and your application should throw an exception if it does happen. E.g. if the UUID is used as primary key in the database, then the database should throw an error when inserting a colliding ID.

However, I would believe that writing code for generating a new UUID in the case of a collision and trying again to be a waste of time. The chance of a collision occurring is so small that throwing an exception would be a perfectly reasonable way of dealing with it.

Remember, it is not only a waste of your own time writing the code, but it also makes the code more complex, making it more difficult for the next person to read, for almost no gain at all.

Pete
  • 9,016
7

This is a very good question. I don't believe it's been adequately considered in the rush to use UUIDs everywhere. I haven't found any solid research.

A suggestion: tread very carefully here, and know your cryptography well. If you use a 128-bit UUID, the 'birthday effect' tells us that a collision is likely after you've generated about 2^64 keys, provided you have 128 bits of entropy in each key.

It is actually rather difficult to ensure that this is the case. True randomness can be generated from (a) radioactive decay (b) random background radio noise, often contaminated unless you're careful (c) suitably chosen electronic noise, e.g. taken from a reverse-biased Zener diode. (I've played with the last, and it works like a charm, BTW).

I wouldn't trust pronouncements like "I haven't seen this in a year of usage", unless the user has generated something approaching 2^64 (ie. about 10^19) keys, and checked them all against one another, a non-trivial exercise.

The problem is this. Let's say you have just 100 bits of entropy, when comparing your keys against all of the other keys everyone else is generating in a common keyspace. You'll start seeing collisions in about 2^50 ie. about 10^15 keys. Your chances of seeing a collision if you've populated your database with just 1000 billion keys are still negligible. And if you don't check, then you will later get unexpected errors that creep into your peta-row sized database. This could bite hard.

The very fact that there are multiple approaches to generating such UUIDs should cause a momentary spasm of concern. When you realise that few generators use 'truly random' processes with sufficient entropy for a type 4 UUID, you should be excessively concerned unless you've carefully examined the entropy content of the generator. (Most people will not do this, or even know how to; you might start with the DieHarder suite). Do NOT confuse pseudorandom number generation with true random number generation.

It's critical that you realise that the entropy you put in is the entropy that you have, and simply perturbing the key by applying a cryptographic function doesn't alter the entropy. It may not be intuitively obvious that if my entire space comprises the digits 0 and 1, the entropy content is the same as that of the following two strings, provided they are the only two options: "This is a really really complex string 293290729382832*!@@#&^%$$),.m}" and "And NOW FOR SOMETHING COMPLETELY DIFFERENT". There are still just two options.

Randomness is tricky to get right, and simply believing that "experts have looked at it, it's therefore OK" may not suffice. Expert cryptographers (and there are few of these who are really proficient) are the first to admit they often get it wrong. We trusted heartbleed, DigiNotar, etc.

I think Paul Tomblin is exercising appropriate caution. My 2c.

6

The issue you have is that if you use a "Random number generator" and you don't know how random that generator is, then the probability of collision is actually unknown. If the random number generators are correlated in some way, the probability of collision may dramatically increase - possibly many, many orders or magnitude.

Even if you have a very small probability of collision, you have a fundamental problem: The probability is NOT 0. This means that a collision WILL eventually occur, they just won't occur very often.

The more frequently you generate and use the UUIDs the sooner that collision is likely to be seen. (generating 1 a year means a longer waiting time than generating a million per second, all other things being equal).

If that probability is finite, unknown, and you use a lot of UUIDs then you need to consider the consequences of a collision. If it is not acceptable to throw an exception and shut down a business application, then don't do it! (Examples off the top of my head: "It's OK to shut down the web server in the middle of updating a library checkin... it won't happen often" and "It's ok to shut down the payroll system in the middle of doing the pay run". These decisions may be career limiting moves.)

You may have a worse case though, again depending on your application. If you test for presence of a UUID (ie, do a lookup) and then make a new one if one is not already there - which is a common enough kind of thing to do - then you may find you are linking records or making relationships, when in fact you are hooking up 2 things via a UUID that should not be hooked up. This is something where throwing an exception won't solve anything and you have an undetectable mess created somewhere. This is the kind of thing that leads to information leakage and can be very embarrassing. (ex: Log in to your bank and find you can see the balance of somebody elses account! Bad!)

Summary: you need to consider the way your UUIDs are used, and the consequences of a collision. This determines if you should take care to detect and avoid collisions, take some simple action in the event of a collision, or do nothing. A simple, single, one-fits-all solution is likely to be inappropriate in some circumstances.

quickly_now
  • 15,060
0

There are two issues involved:

  1. Quality of the random number generators that are used.

  2. Amount of UUIDs that may be generated.

A "random" UUID has 122 random bits. Assuming perfect randomness, you can expect the first collision at around 2^61 generated UUIDs (that's the square root of 2^122). If everyone on this earth were to generate a UUID per second, that's 10,000,000,000*365*24*60*60 = 315360000000000000 UUIDs per year, which is quite close to 2^58. That is, after a few years you would get the first collisions. Unless your application gets anywhere near those numbers, you can be pretty sure that you won't get a collision if your random generator is of decent quality.

Talking about the random number generator: If you use the standard C libraries generators (directly, indirectly, or similar generators), probably seeding them with the time, you are skrewed. These can't draw on enough entropy to avoid collisions. However, if you are on linux, just read 16 bytes of data from /dev/urandom: This draws on an entropy pool that's stirred by the kernel, which has access to some real random events. Unless you typically generate UUIDs really, really early in the boot sequence, /dev/urandom should behave like a true random source.

-1

I've tested it once using a quite simple (brute force) program that generated 10 million UUID-s and I haven't experienced collision.

The UUID RFC says that the UUID is not just a bunch of (pseudo) random numbers.

xea
  • 115