Use BIGINT and twice the storage space or INTEGER but with extra operations?

Question

Suppose I've a table named agency with some columns:

  internal_id(integer, unique)
, external_id(bigint, unique)
, name, location, created_at, ...

internal_id and external_id are each one unique and candidates for being as the primary key column.

There are some other tables (says A, B, C, D, E) that reference to this table. Suppose each of these tables may contains millions or billions of rows.

Normally I have the external_id when I need to filter the tables A, B, C, D, E data.

Which of the following scenarios are the best way to go, considering performance and storage space:

Use internal_id as primary key in agency, and as foreign key in other tables. Because this field takes 4 bytes of storage space, we can save billion of bytes. However as I normally have the external_id, I have to do an extra JOIN for each query as a penalty:

    SELECT A.* FROM A
         INNER JOIN agency ON A.internal_id=agency.internal_id
    WHERE agency.external_id=5;

Use internal_id as primary key in agency, and as foreign key in other tables. But to get rid of an extra JOIN, in my application I could first map external_id to internal_id with a simple query (SELECT internal_id FROM agency WHERE external_id=5), and then use the fetched internal_id for another simple query:

    SELECT * FROM A
    WHERE internal_id=59; -- 59 is the fetched internal_id from the other query

Does it have better performance than JOIN considering an extra round trip between app and database?

forgetting internal_id and use external_id as the primary key and foreign key, with the penalty of 4 more extra bytes per record in each other tables (A, B, C, D, E) and cost of billions of more storage space or potentially even slower database operations (because of bigger database files):

    SELECT * FROM A
    WHERE external_id=5

Update:

agency table may contains 10s of thousands or at most a few millions of rows.
internal_id and external_id will not change over time, but other non-identity columns may rarely change.
There are about 5 to 7 related tables (A, B, C, D, E, ...) that a few of them may get too large over time, say a few million rows per day (billions of rows over a year)

score 2 · Answer 1 · edited Aug 29 '20 at 11:32

The third option for the reason you outline: you do not have to query agency every time. It's not that the join/lookup would be particularly expensive for queries that return a small amount of rows, but:

Your database is easier to work with
Your code is more concise and easier to debug
It's easier to optimize queries that return more than a few rows
Avoids the overhead of the meaningless internal_id on the agency table.

That's certainly worth 4 bytes/row. We're not storing data on tape any more, it's not as big of a consideration as it used to be.

You are right, but how about when the related tables contains trillions of rows. So for 1 trillion rows, it takes more 4TB data. Doesn't it affect the overall performance of database?

If you are reading the whole table, yes. But most of the time we look for a few dozen/hundred rows at most. And why extend to "trillions" of rows? If you're dealing with that volume the hardware required isn't going to break due to an extra 4TB in the heap.

Erwin Brandstetter · Accepted Answer · 2020-08-29T14:43:06.643

Assumptions

agency has fewer rows than the "millions and billions" you mention for other tables. Way below the range of integer: -2147483648 to +2147483647. Else we need bigint for internal_id to begin with.
But agencyis still big. Else, don't bother with the index optimizations below.
Both internal_id and external_id hardly ever change.
ID values are roughly evenly distributed. Not a few extremely common agencies and many very rare ones. (That might favor query optimization without key translation.)

I would consider a combination of scenario 1 & 2, using this query style:

SELECT *
FROM   A
WHERE  internal_id = (SELECT internal_id FROM agency WHERE external_id=5);

The subquery encapsulates the key translation and can be used as drop-in replacement for providing a literal internal_id. Also makes the job of the query planner a bit simpler when involving many joins.

Unless you reuse internal_id for many subsequent queries, a separate lookup needlessly adds costs for a separate round trip to the server.

You might encapsulate the key translation in a simple SQL function:

CREATE FUNCTION public.f_ext2int(_external_id bigint)
 RETURNS int
 LANGUAGE sql STABLE PARALLEL SAFE AS
'(SELECT internal_id FROM public.agency WHERE external_id = _external_id)';

Then the above query becomes:

SELECT * FROM A WHERE internal_id = f_ext2int(5);

The function can be "inlined" by the query planner. See:

I suggest this table definition:

CREATE TABLE agency (
  internal_id integer          -- PK is NOT NULL implicitly
, external_id bigint NOT NULL  -- NOT NULL, right?
-- , name, location, created_at, ...
, PRIMARY KEY (internal_id) INCLUDE (external_id)
, UNIQUE      (external_id) INCLUDE (internal_id)
);

This provides the crucial indices on (internal_id, external_id) and (external_id, internal_id) and enforces the constraints you mentioned, without redundant indices.

The second one (UNIQUE (external_id) INCLUDE (internal_id)) is intended for reverse lookups. Seems likely you need that, too. Else you can skip the INCLUDE clause there. Why do we need both indices? See:

Is a composite index also good for queries on the first field?

It makes heavy use of covering indices (Postgres 11 or later). See:

Do covering indexes in PostgreSQL help JOIN columns?

Among other things, covering indices negate the ballast of additional columns in agency for the purpose of key translation.

With these indices in place, key translation burns down to very fast index-only scans for key translation. The cost will be practically negligible in the context of queries on your huge tables.

This saves "millions and billions" times 4 bytes for every additional table and index (which may matter a lot more). True, storage is getting cheaper all the time, but RAM (and fast cache memory!) is still typically limited. Bigger tables and indices means that less of it can stay in cache. And that's crucial for performance.

Wider rows always affect overall performance of the database more or less negatively, even with cheap storage. Related discussion:

What is retrieved from disk during a query?

And it's typically much easier on the human eye to operate with smaller integer numbers in the many tables (and log files, and debugging, ...). May even be the most important practical benefit.

Use BIGINT and twice the storage space or INTEGER but with extra operations?

2 Answers2