I need to store emails (like 'ivan@email.com') in Postgres 15, be able to search them in case-insensitive manner ('iVaN@email.com', 'IVAN@email.com', etc are the same), and be able to retrieve the original email to use it for actual sending of emails.
The suggested approach for handling case-insensitive data is to use collations:
DROP TABLE IF EXISTS test_collation;
DROP COLLATION IF EXISTS case_insensitive;
CREATE COLLATION case_insensitive (PROVIDER = icu, LOCALE = '@colStrength=secondary', DETERMINISTIC = FALSE);
CREATE TABLE test_collation
(
original_email TEXT COLLATE case_insensitive NOT NULL UNIQUE PRIMARY KEY
);
INSERT INTO test_collation (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_collation (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_collation
WHERE original_email = 'iVaN@EmaIl.com';
So far so good, but the documentation states that this approach has its drawbacks in terms of performance. This makes sense as it is challenging to make indexes on non-deterministic data.
This got me thinking: what if we use two columns, one will store the original email, and the second one will store the lower-cased email. To make it fool-proof, the lower-cased email will be a generated column (so it cannot be manually changed), and a primary key at the same time to avoid duplication. Searching on the lower-cased email column can be very efficient as it is deterministic in nature and can use B-trees. Example:
DROP TABLE IF EXISTS test_two_columns;
CREATE TABLE test_two_columns
(
original_email TEXT NOT NULL UNIQUE,
lowered_email TEXT NOT NULL UNIQUE PRIMARY KEY GENERATED ALWAYS AS ( LOWER(original_email) ) STORED
);
INSERT INTO test_two_columns (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_two_columns (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_two_columns
WHERE lowered_email = LOWER('iVaN@EmaIl.com');
What are the downsides of such solution except the obvious waste of space for an additional column?