Suppose I've a table named agency with some columns:
internal_id(integer, unique)
, external_id(bigint, unique)
, name, location, created_at, ...
internal_id and external_id are each one unique and candidates for being as the primary key column.
There are some other tables (says A, B, C, D, E) that reference to this table. Suppose each of these tables may contains millions or billions of rows.
Normally I have the external_id when I need to filter the tables A, B, C, D, E data.
Which of the following scenarios are the best way to go, considering performance and storage space:
- Use
internal_idas primary key inagency, and as foreign key in other tables. Because this field takes 4 bytes of storage space, we can save billion of bytes. However as I normally have theexternal_id, I have to do an extraJOINfor each query as a penalty:
SELECT A.* FROM A
INNER JOIN agency ON A.internal_id=agency.internal_id
WHERE agency.external_id=5;
- Use
internal_idas primary key inagency, and as foreign key in other tables. But to get rid of an extraJOIN, in my application I could first mapexternal_idtointernal_idwith a simple query (SELECT internal_id FROM agency WHERE external_id=5), and then use the fetchedinternal_idfor another simple query:
SELECT * FROM A
WHERE internal_id=59; -- 59 is the fetched internal_id from the other query
Does it have better performance than JOIN considering an extra round trip between app and database?
- forgetting
internal_idand useexternal_idas the primary key and foreign key, with the penalty of 4 more extra bytes per record in each other tables (A, B, C, D, E) and cost of billions of more storage space or potentially even slower database operations (because of bigger database files):
SELECT * FROM A
WHERE external_id=5
Update:
agencytable may contains 10s of thousands or at most a few millions of rows.internal_idandexternal_idwill not change over time, but other non-identity columns may rarely change.- There are about 5 to 7 related tables (
A, B, C, D, E, ...) that a few of them may get too large over time, say a few million rows per day (billions of rows over a year)