2

I was following a tutorial on the normal forms of SQL databases, and I got confused landing on this example : https://www.tutorialspoint.com/sql/third-normal-form.htm.

From

CREATE TABLE CUSTOMERS(
   CUST_ID       INT              NOT NULL,
   CUST_NAME     VARCHAR (20)      NOT NULL,
   DOB           DATE,
   STREET        VARCHAR(200),
   CITY          VARCHAR(100),
   STATE         VARCHAR(100),
   ZIP           VARCHAR(12),
   EMAIL_ID      VARCHAR(256),
   PRIMARY KEY (CUST_ID)
);

It creates a new table ADDRESS like this because there is a "transitive dependency between zipcode and address".

CREATE TABLE ADDRESS(
   ZIP           VARCHAR(12),
   STREET        VARCHAR(200),
   CITY          VARCHAR(100),
   STATE         VARCHAR(100),
   PRIMARY KEY (ZIP)
);

This is where I am really confused. Why use the zipcode as a primary key? Primary keys have to be unique, can't you have two addresses with the same zipcode?

Either

  • I don't understand how a zipcode work
  • I don't understand how a primary key work
  • This example is clearly wrong
  • I don't understand something else.
Ricola
  • 289
  • 1
  • 3
  • 7

3 Answers3

15

The example is making a fundamental mistake: it's using data as a primary key. It should create and use unique IDs.

The comments debate how correct it is to assume that a zipcode maps to a particular street. Whether that's correct or not, the simple fact is that for this to work without a unique ID it must be correct, not only now but forever more. That is exactly why it's wrong to do this. You can't possibly know the future. Use a unique ID.

If zipcode uniquely identifies the data you're normalizing now then you have a natural key. But adding just one more record can destroy that. Natural keys can be used when importing data to help build unique ID relations. They should not be used when an application is collecting data from a user that can ensure the relationship is real.

People get these two use cases of structuring data confused all the time. Unique ID's should always be preferred in operational systems. The problem is they don't always exist. When they don't we can construct uniqueness by selecting data fields to be natural keys as we normalize. But that constructed uniqueness is ALWAYS brittle. It might only be true now. It's ok to use the fact that it's true now to generate unique IDs now. But after that new data should be assigned unique ID's properly.

Unique ID's don't erode as more data is added. Natural keys often do. Developers who field systems that insist that their natural key assumptions hold regardless of reality often cause problems that operators find themselves having to work around. Please don't do that.

candied_orange
  • 119,268
4

Using a zip code as a primary key would be incorrect. The author of the tutorial is correct in that tables should be carefully looked at to determine what can be split apart to reduce large quantities of duplicate data in the grand scheme of things. However, a zip code isn't unique as more than one customer may (and most likely would) live in the same zip code. Even if the extra 4 digits the post office uses are added it is still not always unique. The address itself would also not be unique, as more than one resident in a household could be a customer. In fact the best course of action would be to create a unique id for the table, if the address is going to split away like this, as any combination for a composite key would still not remove the possibility of duplicate keys. The tutorialspoint example is not a good example or correct example of 3nF.

The basics to look for in each normalization form are this:

  • 1nF: Make sure there aren't duplicate columns (horizontal).
  • 2nF: break tables up until there is only one purpose. Like breaking apart a customer table that includes orders so that customers and orders are separated.
  • 3nF: "Transitive" basically just means can one column be determined by another without looking at the primary key. Like a customer order table may have the orderID, customer, manufacturer, and product as columns. The product may be determined by the manufacturer, so the product column or vice versa doesn't actually have to rely on the order number. This would be a table where it could be split into two tables where the productID or manufacturer can be the primary key. The new tables primary key would then be used as a foreign key in the order table.
  • 4nF: Make sure there is only on piece of data in each column. Let's use the manufacturer table. If the manufacturer column is the primary key, there could be more than one product in the same row/column. I guess this isn't the best example, but I hope you get the idea. So for 4nF one would make sure that having more than one piece of data, like multiple products, in the same spot doesn't happen.

3nF and 4nF aren't always used by companies as strict standards, but they are good to know and use when possible. Also, like others have mentioned using an ID as a primary key instead of one of the columns can be very useful. For example, instead of making a composite key out of the customers first name, last name, and user name or storing sensitive information like their SSN, an auto-generated ID can be used instead as the primary key.

3

The example is clearly wrong.

It is a good example of over-normalization. Where you by normalizing everything theoretically possible create a solution that does not hold in the future.

Zip codes even those in the US that might look rather permanent (I'm not an expert in US zip codes but I give the author of the tutorial the benefit of the doubt) and may define the street are not a good key for addresses in general.

In some countries it does make sense to have a zip code table to help the user enter an address, but the address needs to be save for each record, not to be referenced through a key.

But streets changes names, are rebuild, split into two zip codes when new housings are built along them etc.

In a little time one of the addresses in your database is going to change, perhaps the street name is changed. Now you edit the entry for the zip code and all the other addresses using that zip code are now wrong and you will later have no idea why, and you no longer have a clue of what the address was before the change.

Where I grew up you could derive the zip code from the first 4 digits of the phone number. If you go by the logic of the author of the tutorial both could not be in the customers table.

Bent
  • 2,596