9

The Austrian electronic ID card relies on the so-called sector identifiers. For example a hospital gets to identify a person by getting a sectorId for that person, which is computed roughly as follows:

sha1(personalId + "+" + prefix + sectorId); // prefix is constant and irrelevant

Is that a good idea? I think the possibility of collision, no matter how small, poses a risk.

In hashtables, when there's a collision, you have other means of establishing equality, but with primary keys you can't possibly have two that are identical. That can be circumvented by a composite key, but then the point of a unique sector identifier is lost.

Is it ok to do that and is there a good way to have it that way without it breaking at some point?

svidgen
  • 15,252
Bozho
  • 2,785
  • 3
  • 19
  • 12

3 Answers3

9

This former SO article tells you how to calculate the collision probability. For SHA-1, b is 160. The number of people living in austria is below 10 millions. Even if each living person in austria is registered in a hospital with a unique person/sector ID, that just makes a collision probability of less than 3.5 x 10^-35. I guess that should be small enough for most practical purposes.

Doc Brown
  • 218,378
4

Hashes will inevitably collide if they're smaller than all possible combinations of data.

See this excellent answer: https://softwareengineering.stackexchange.com/a/145633

If primary keys are not supposed to be meaningful (human readable; containing retrievable traits of data), I would just go with GUIDs.

Yes, theoretically they can collide as well, but heat death of the universe is likely to happen first. See https://stackoverflow.com/a/184897


EDIT: addressing @DocBrown's counterpoints to clear things up (and to avoid lengthy discussion in comments)

Generating the identifier out of person id or sector id was not OP's requirement (indeed, he admitted that resorting to GUIDs was what he suggested himself).

I never claimed GUIDs are suitable as an overall replacement for SHA-1, or hashing in general (of course they're not), I'm only saying they could be used in this particular case - for uniquely identifying some entities. As this is what they're for by definition.

It was never a requirement that these identifiers must be reconstructible from the data (which is an advantage of hash functions). Please evaluate my answer within the context of the actual question.

1

Using a Hash or GUID as Primary Key is also bad idea because it causes Index Fragmentation and frequent Page Splits.