Why not use two columns instead of collation for case insensitive data?

Question

I need to store emails (like 'ivan@email.com') in Postgres 15, be able to search them in case-insensitive manner ('iVaN@email.com', 'IVAN@email.com', etc are the same), and be able to retrieve the original email to use it for actual sending of emails.

The suggested approach for handling case-insensitive data is to use collations:

DROP TABLE IF EXISTS test_collation;
DROP COLLATION IF EXISTS case_insensitive;
CREATE COLLATION case_insensitive (PROVIDER = icu, LOCALE = '@colStrength=secondary', DETERMINISTIC = FALSE);
CREATE TABLE test_collation
(
    original_email TEXT COLLATE case_insensitive NOT NULL UNIQUE PRIMARY KEY
);
INSERT INTO test_collation (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_collation (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_collation
WHERE original_email = 'iVaN@EmaIl.com';

So far so good, but the documentation states that this approach has its drawbacks in terms of performance. This makes sense as it is challenging to make indexes on non-deterministic data.

This got me thinking: what if we use two columns, one will store the original email, and the second one will store the lower-cased email. To make it fool-proof, the lower-cased email will be a generated column (so it cannot be manually changed), and a primary key at the same time to avoid duplication. Searching on the lower-cased email column can be very efficient as it is deterministic in nature and can use B-trees. Example:

DROP TABLE IF EXISTS test_two_columns;
CREATE TABLE test_two_columns
(
    original_email TEXT NOT NULL UNIQUE,
    lowered_email  TEXT NOT NULL UNIQUE PRIMARY KEY GENERATED ALWAYS AS ( LOWER(original_email) ) STORED
);
INSERT INTO test_two_columns (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_two_columns (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_two_columns
WHERE lowered_email = LOWER('iVaN@EmaIl.com');

What are the downsides of such solution except the obvious waste of space for an additional column?

score 11 · Answer 1 · answered Mar 21 '23 at 08:18

Third option:

CREATE TABLE test
(
    original_email TEXT NOT NULL
);
CREATE UNIQUE INDEX test_email_uniq on test(lower(original_email));
INSERT INTO test (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');

In this case, we do not store an additional field in the table, only the unique index itself. The condition WHERE LOWER(original_email) = lower(?) will use this unique index to speed up this query.

score 7 · Answer 2 · answered Mar 21 '23 at 07:28

The downsides of using a generated column are the wasted space and the processing time it takes to calculate the computed column whenever you modify the row. On the other hand, the index would be faster to modify and search (particularly if you use the C collation).

Both are viable solutions. If good performance is your main goal, I suggest that you benchmark both approaches with realistic data.

score 7 · Accepted Answer · edited Mar 22 '23 at 06:23

I did some testing with all approaches plus the case-sensitive one as a baseline by inserting 1,000,000 entries to each table with random strings of random case of equal length, and here are the results for PostgreSQL 15 in Podman on Macos:

I. Case-sensitive (only for benchmarking baseline) Insert: 1,000,000 rows affected in 11 s 698 ms Select: 0 rows retrieved in 24 ms (execution: 3 ms, fetching: 21 ms)

II. Using case-insensitive collation Insert: 1,000,000 rows affected in 11 s 880 ms Select: 0 rows retrieved in 27 ms (execution: 3 ms, fetching: 24 ms)

III. Lower-index solution by @melkij Insert: 1,000,000 rows affected in 15 s 418 ms Select: 0 rows retrieved in 24 ms (execution: 2 ms, fetching: 22 ms)

IV. Two columns solution Insert: 1,000,000 rows affected in 15 s 627 ms Select: 0 rows retrieved in 23 ms (execution: 1 ms, fetching: 22 ms)

Conclusions: The two columns solution I toyed with is slow, more room for wrong usage in code, and wastes space. The 'index on lowercase' and case-insensitive collation solutions give me comparable results, but collation-based is simpler to use and a little bit faster, so I will use the collation-based solution.

Here is the full code:

--
-- Case-SENSITIVE solution (just as a baseline for benchmarking)
--
DROP TABLE IF EXISTS test_case_sensitive;
CREATE TABLE test_case_sensitive
(
    original_email TEXT NOT NULL UNIQUE PRIMARY KEY
);
INSERT INTO test_case_sensitive (original_email)
VALUES ('IVAN@email.com');
-- This entry SUCCEEDS unlike case-insensitive solutions:
INSERT INTO test_case_sensitive (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user:
SELECT original_email
FROM test_case_sensitive
WHERE original_email = 'iVaN@EmaIl.com';
--
-- Case-insensitive solution with collation
DROP TABLE IF EXISTS test_collation;
DROP COLLATION IF EXISTS case_ins;
CREATE COLLATION case_ins (PROVIDER = icu, LOCALE = '@colStrength=secondary', DETERMINISTIC = FALSE);
CREATE TABLE test_collation
(
    original_email TEXT COLLATE case_ins NOT NULL UNIQUE PRIMARY KEY
);
INSERT INTO test_collation (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected:
INSERT INTO test_collation (original_email)
VALUES ('ivAn@email.com');
-- Sending an email to exact email provided by the user:
SELECT original_email
FROM test_collation
WHERE original_email = 'iVaN@EmaIl.com';
--
-- Case-insensitive solution with an index on 'lowered' original email
DROP TABLE IF EXISTS test_lower_index;
DROP INDEX IF EXISTS test_email_uniq;
CREATE TABLE test_lower_index
(
    original_email TEXT NOT NULL UNIQUE PRIMARY KEY
);
CREATE UNIQUE INDEX test_email_uniq ON test_lower_index (LOWER(original_email));
INSERT INTO test_lower_index (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_lower_index (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_lower_index
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');
--
-- Case-insensitive solution with two columns
DROP TABLE IF EXISTS test_two_columns;
CREATE TABLE test_two_columns
(
    original_email TEXT NOT NULL UNIQUE,
    lowered_email  TEXT NOT NULL UNIQUE PRIMARY KEY GENERATED ALWAYS AS ( LOWER(original_email) ) STORED
);
INSERT INTO test_two_columns (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected:
INSERT INTO test_two_columns (original_email)
VALUES ('ivAn@email.com');
-- Sending an email to exact email provided by the user:
SELECT original_email
FROM test_two_columns
WHERE lowered_email = LOWER('iVaN@EmaIl.com');
--
-- Testing
CREATE OR REPLACE FUNCTION random_string(length INTEGER) RETURNS TEXT AS
$$
DECLARE
    chars  TEXT[] := '{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
    result TEXT   := '';
BEGIN
    IF length < 0 THEN
        RAISE EXCEPTION 'Given length cannot be less than 0';
    END IF;
    FOR _ IN 1..length
        LOOP
            result := result || chars[1 + RANDOM() * (ARRAY_LENGTH(chars, 1) - 1)];
        END LOOP;
    RETURN result;
END;
$$ LANGUAGE plpgsql;
TRUNCATE test_case_sensitive;
INSERT INTO test_case_sensitive(original_email)
SELECT random_string(30)
FROM GENERATE_SERIES(1, 1000000) id;
TRUNCATE test_collation;
INSERT INTO test_collation(original_email)
SELECT random_string(30)
FROM GENERATE_SERIES(1, 1000000) id;
TRUNCATE test_lower_index;
INSERT INTO test_lower_index(original_email)
SELECT random_string(30)
FROM GENERATE_SERIES(1, 1000000) id;
TRUNCATE test_two_columns;
INSERT INTO test_two_columns(original_email)
SELECT random_string(30)
FROM GENERATE_SERIES(1, 1000000) id;
SELECT original_email
FROM test_case_sensitive
WHERE original_email = 'iVaN@EmaIl.com';
SELECT original_email
FROM test_collation
WHERE original_email = 'iVaN@EmaIl.com';
SELECT original_email
FROM test_lower_index
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');
SELECT original_email
FROM test_two_columns
WHERE lowered_email = LOWER('iVaN@EmaIl.com');

Why not use two columns instead of collation for case insensitive data?

3 Answers3

-- Case-insensitive solution with collation

-- Case-insensitive solution with an index on 'lowered' original email

-- Case-insensitive solution with two columns

-- Testing