3 additional varchar(32) keys and also updating them -- too bad for performance?

Question

I have a table in mind which basicaly would look like this:

table1
--------
id      bigserial autoincrement primary key,
token1  varchar(32) unique not null,
token2  varchar(32) unique not null,
token3  varchar(32) unique not null
[........some other data ......]

My client web app is build is such a way that I can't always use "id" to retrieve a record. Namely, I need to be able to retrieve a record by one of the "tokens" instead:

 select * from table1 where id = 1;
 select * from table1 where token1 = 'fdsafd';
 select * from table1 where token2 = 'ghgfdhg';
 select * from table1 where token3 = 'rewqrw';

Also, my client web app may need to update the "token(s)" also, different ones at different times, if a certain condition ocurrs. Not too often.

I estimate that the table would grow, at most, at 100.000 records per day, evenly distributed during a day; with around 20k updates per day.

Questions:

1) Will using these 3 tokens, of 32 characters long, as oppossed to using only "id",

and, also

2) Will updating them these

worsen the performance significantly?

3) Ways to improve it? All 3 tokens must remain. Should I create a separate table2 for tokens, to which I'd insert tokens only, to avoid having to update them in "table1"?

jjanes · Answer 1 · 2020-06-07T13:16:53.697

Maintaining those indexes will have a cost. It is not possible to say if that cost is "too bad" or if it is significant to you. The cost will depend on your RAM and memory settings, on your data size, and on your storage system. Do some benchmarking on your own system (or a clone of it) and see.

If 100,000 rows are arriving evenly throughout the day, any reasonably sized server should be able to cope with this load. You might not want to use a AWS t2.nano. But you didn't say how many rows are updated per day, but "not too often" doesn't seem hard to deal with.

Assuming all index pages other than the leaf pages are already cached, then an update of one random unique token to another random unique token would need to read one index leaf page (for the old value), read and dirty one table page, read and dirty one index leaf page (for the new value), and possibly read and dirty a 2nd table page if the new record can't fit on the same page as the old one, or if there is an index entry for the new value already and it needs to verify that entry is not "live" in violation of the unique constraint. The WAL records covering all of this will need to be written (most likely fitting into one write request) and flushed. This shouldn't take more than 100ms in the typical case if you have 5400 RPM hard disk, substantially less if many of those random reads are satisfied by the caches. In the worst case could be a lot longer if someone else is syncing large amounts of data to disk at the same time and you get contention on that.

Erwin Brandstetter · Accepted Answer · 2020-06-09T19:19:54.947

First off, as commented, work with the data type uuid for valid UUID values not varchar(32). Cheaper, safer, cleaner. See:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

The answer to the main question depends on the complete picture. You provided some good information, much more is relevant:

Do these UUID naturally belong to the main table? Are they attributes of the entity described in the table?
How wide is the row?
How much concurrent write activity? Typical read and write patterns?
How fast / limited / costly is your storage and your RAM?
Ultimately, the sum total of all activity and the outlook how it might develop over time is relevant.

That's why a definitive answer is hard. There are arguments to be made for either design:

Add 3 columns to the main table.
Add a separate 1:1 table with 3 UUIDs per row.
Add a separate 1:n table with 1 UUIDs per row.

Assuming:

There are always exactly 3 UUID values (as indicated by your NOT NULL constraints).
The main row is relatively wide.
There are lots of updates (you mentioned 20k/day) to the main row and/or the added UUIDs, but hardly ever updates involving both at the same time.
You always query one of the three UUID values and know exactly which column to look in. Never in all three of them at once.
Postgres 11 or later (needed for INCLUDE clause).

Then I would go with option 2:

CREATE TABLE table1_uuids (
  id      bigint PRIMARY KEY REFERENCES table1
, token1  uuid NOT NULL
, token2  uuid NOT NULL
, token3  uuid NOT NULL
, CONSTRAINT token1_uni UNIQUE (token1) INCLUDE (id)
, CONSTRAINT token2_uni UNIQUE (token2) INCLUDE (id)
, CONSTRAINT token3_uni UNIQUE (token3) INCLUDE (id)
);

Your example lookup would look like this:

SELECT *
FROM   table1
WHERE  id = (SELECT id FROM table1_uuids token1 = 'xyz');

Basically, the subquery resolves a given token UUID to the PK id, using one of the three indexes that are created internally to implement the UNIQUE constraints. THE INCLUDE clauses allow index-only scans.

Or like this:

SELECT t.*
FROM   table1_uuids tu
JOIN   table1 t USING (id)
WHERE  tu.token1 = 'xyz';

Rationale

The split adds 28 bytes of overhead per row, and the size of the PK index, and some complications and costs for queries that have to join both tables or insert / update in two tables instead of just one. But it avoids table and index bloat. Makes updates faster. Less locking contention with updates on just UUIDs or just the main table. And it speeds up queries on the main table a little, that don't need the additional UUIDs (I expect, most queries don't.)

Since Postgres 11 we can conveniently add id to each UNIQUE constraint with the INCLUDE clause. This allows for very fast index-only scans, without having to add another, redundant index. Related:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

One of the reasons I did not go for option 3: there is a UNIQUE constraint on each of the three UUIDs. If you put all of them in a single column, you can't make it UNIQUE. It would have to be a multicolumn constraint including a "token_type" or something. More expensive, less elegant.

The width of the main row and the frequency of updates for the main rows and each UUID column have key importance. Every update has to write a new row version in Postgres' MVCC model (TOASTed values may be excluded). Plus, except for HOT updates, a new index entry in every attached index. See:

Redundant data in update statements

This cost scales up with the width of the main row. If that happens a lot and/or rows are wide, your tables and indexes bloat and/or you get a lot of additional work for VACUUM. All bad for performance.

With no updates on UUID columns, option 1 would look a lot better, with a lot more updates, option 3 would.

3 additional varchar(32) keys and also updating them -- too bad for performance?

2 Answers2

Rationale