18

My workplace's database has a pattern that I've not seen before. Every column that is intended to be a key, whether primary or foreign, ends in _SK. This is shorthand for "surrogate key". It appears to be an informal way to tell the developer that said column is safe to use in joins and won't have any type mismatches or unexpected behavior. For example, our table of dates has many columns that represent the date. DATE_PLAIN is the typical SQL DATE variable that shows data ISO style (e.g. 2022-10-30), DATE_VENDOR puts the date in the style that our vendor uses (e.g. 44300... their epoch is weird), and DATE_SK is always an INT that uses the familiar 20221030 format. By reading these column names, the pattern immediately tells the developer that DATE_SK is the one that you want for joins. Someone who uses either of the other two options in joins will run in to type mismatches and trouble (I learned that the hard way, e.g. our vendor inconsistently stored their dates as INT and DECIMAL).

This strikes me as a remarkably good idea, which raises the question of why I've not seen it before. Is it a known anti-pattern?

Christophe
  • 81,699
J. Mini
  • 1,005

3 Answers3

54

The use of surrogate keys is by itself not an anti pattern. It is a way to create a stable primary key for an entity, that will never change and not depend on any application data.

But there are three anti-patterns in the practice you describe:

  • Using a suffix _SK to identify potential surrogate keys is a variant of the Hungarian notation, with all its drawbacks
  • Calling surrogate key DATE_SK that bears value such as 20220531 is misleading since it is in reality a natural key with a special encoding, so just the contrary of a surrogate key that should be completely unrelated to meaning to the data it refers to.
  • Keeping several DATEs columns in a same table with different encodings to refer to the same date seems to be a denormalization, with the risk of inconsistencies.
Christophe
  • 81,699
29

There is nothing wrong in using surrogate keys as primary keys in a consistent way throughout a whole database schema, for each table, with a defined type and a column name suffix like "_SK". There is also nothing wrong in using corresponding foreign keys, with corresponding column names to the primary key columns whenever possible (hence also getting the suffix "_SK"). I have seen this working well in practice for larger data models (with different suffixes, but that does not matter).

However, IMHO it is not a good idea to abuse a standard term like "surrogate key" with a specific meaning for a key which does not fit to that definition - that will sooner or later create confusion. The whole point of a surrogate key is that it is completely artifical with zero domain logic encoded, for example a GUID or some raw ID or ordinal number. Your example for DATE_SK which encodes a date is clearly an example of not being a surrogate key, it is quite the opposite of it, a natural or domain key.

So I would recommend to either call those kind of keys differently, or use real surrogate keys instead.

Doc Brown
  • 218,378
9

First of all, whoever came up with this idea seems to misunderstood what a surrogate key is. A surrogate key is a form of primary key. There are two competing ideas among database administrators about how primary keys should be structured. One philosophy is that of "natural keys". This philosophy says that when your data already has an unique identifier, use it. The other philosophy is that of "surrogate keys". Adherents to this philosophy believe that natural keys are often either not as unique and immutable as you assume them to be, or much longer than required. So you should use an additional ID column as primary key for each table which contains auto-generated values which are guaranteed to be unique, also known as a surrogate key.

Which of those philosophies is correct is besides the point. But fact is that a column which isn't the primary key isn't a surrogate key... at least for this table. It might be the surrogate key of another table. Then it would be a foreign key referencing a surrogate key. But it would not be a surrogate key in itself.

However, a consistent naming convention for columns which contain keys from other tables is actually a good idea. Especially if that column contains a true surrogate key, because a true surrogate key would not allow you to guess what table it could refer to just based on context. I have seen this naming convention before. But usually the name for it is ID. For example, when you have a table customer, that table has a primary key ID containing a surrogate key. When you then have a table order, that table would have a column customer_ID which references customer.ID.

And here we see another advantage of surrogate keys: Because of the synthetic nature of surrogate keys, nobody would ever get the idea to store them in different formats. When your customer.ID is, say, a 8 character hexadecimal string (because someone thought that was a good idea at some time), then it would make zero sense to encode customer_ID as an integer. You just use the format of the column you reference.

What is a bad idea, though, is to have multiple fields which contain the exact same data in different formats. Not only does it increase the size of the data, it also allows inconsistencies. If you do need multiple representations of the same data for convenience, then create a VIEW for the table which projects the one date column onto multiple view columns showing that date in different formats. And when that conversion in real-time eats up too much performance in your particular use-case, then materialize that view.

Philipp
  • 23,488