15

I know this type of question comes up a lot, but I've yet to read any compelling arguments to help me make this decision. Please bear with me!

I have a huge database - it grows by about 10,000,000 records per day. The data is relational, and for performance reasons I load the table with BULK COPY. For this reason, I need to generate keys for the rows, and cannot rely on an IDENTITY column.

A 64-bit integer - a bigint - is wide enough for me to use, but in order to guarantee uniqueness, I need a centralised generator to make my IDs for me. I currently have such a generator service which allows a service to reserve X sequence numbers and guarantees no collisions. However, a consequence of this is that all the services I have are reliant on this one centralised generator, and so I'm limited in how I can distribute my system and am not happy about the other dependencies (such as requiring network access) imposed by this design. This has been a problem on occasion.

I'm now considering using sequential GUIDs as my primary keys (generated externally to SQL). As far as I've been able to ascertain from my own testing, the only drawback to these is the disk space overhead of a wider data type (which is exacerbated by their use in indexes). I've not witnessed any discernible slowdown in query performance, compared to the bigint alternative. Loading the table with BULK COPY is slightly slower, but not by much. My GUID-based indexes are not becoming fragmented thanks to my sequential GUID implementation.

Basically, what I want to know is if there are any other considerations I may have overlooked. At the moment, I'm inclined to take the leap and start using GUIDs. I'm by no means a database expert, so I'd really appreciate any guidance.

Barguast
  • 341
  • 3
  • 6

5 Answers5

4

I'm in a similar same situation. Currently, I'm using the sequential GUID approach and have no fragmentation and easy key generation.

I have noticed two disadavantages that caused me to start migrating to bigint:

  1. Space usage. 8 bytes more per index. Multiply that by 10 indexes or so and you get a huge waste of space.
  2. Columnstore indexes do not support GUIDs.

(2) Was the killer for me.

I will now generate my keys like this:

yyMMddHH1234567890

I'll be using a leading date plus hour and having a sequential part after that. That allows me to range-query my data by date without any addition index at all. This is a nice bonus for me.

I'll generate the sequential part of the bigint using a HiLo algorithm that lends itself well to being distributed.

Hope some of this transfers to your situation. I definitely recommend using bigint.

usr
  • 7,390
  • 5
  • 33
  • 58
3

With a type INT, starting at 1, you get over 2 billion possible rows - that should be more than sufficient for the vast majority of cases. With BIGINT, you get roughly 922 quadrillion (922 with 15 zeros - 922'000 billions) - enough for you??

If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit ....

If you use a BIGINT IDENTITY starting at 1, and you insert one thousand rows every second, you need a mind-boggling 292 million years before you hit the 922 quadrillion limit ....

Using your 10 million rows per day, that'll take you have enough numbers for roughly 1'844'674'407'370 days (1844 billion days or a tick over 5 billion years) of data - is that good enough for your needs?

Read more about it (with all the options there are) in the MSDN Books Online.

marc_s
  • 9,052
  • 6
  • 46
  • 52
2

I recommend you use SEQUENCE of BIGINT data type in SQL 2012 This is much more flexible than IDENTITY with options like cache/nocache, you can also assign a sequence range for your batch operation as sp_sequence_get_range.

2

Is the reason you can't use IDENTITY because there are already foreign key relationships between separate tables you are loading? And there is no other natural key for you to be able to link them up in an operation from a staging area to the production area? For that reason, I'd like to know a little more about how they are currently "linked" in the source system before you bulk copy? Do multiple source systems simply use their own sequences and have the possibility for conflicting sequences when brought into a shared database?

The COMB ID/sequential GUID technique is one which I am familiar with, and it is workable any time you effectively need that global uniqueness assigned outside the database - it's effectively a usable row identity both inside and outside the database. For that reason, in highly-distributed environments or disconnected scenarios, it's an OK choice

Except if you really don't need it, Because that extra width difference is significant when the size of the data grows and these keys are in every index and the working sets for a lot of queries.

Also, with generation distributed, if the rows don't actually come in the order of the GUID column, the issues with using this for the clustered index key (narrow, static, increasing) potentially causing some fragmentation compared to clustering on an IDENTITY still remain.

Cade Roux
  • 6,684
  • 1
  • 33
  • 55
0

In general it's possible to use OUTPUT clause of INSERT command to have data inserted into both tables and related with identity field.

Identifier which is based on timestamp should not be considered reliable - it depends on system clock which in turn depends on many things - from hardware clock to time synchronization services.

Serg
  • 143
  • 1
  • 4