You can try the following - all of the code below is available on the fiddles here (GENERATED column - solution 1) and here (expression index - solution 2).
As per convention, I'm calling your referenced table parent and the other table child with the Ref: string suffixed to the field referencing the PRIMARY KEY of parent.
Given the scenario described, you have (simplified):
Parent:
CREATE TABLE parent
(
x INTEGER PRIMARY KEY, -- presumably IDENTITY or SERIAL...
y TEXT NOT NULL -- dummy field for this example, just the PK as TEXT
-- other fields
--
);
and then populate it with 100 INTs and the dummy text (the dummy is the PK as text concatenated with the string ' PK text' - makes reading results easier):
INSERT INTO parent
SELECT s, s::TEXT || ' PK text' FROM GENERATE_SERIES(1, 100) AS r (s);
Child (initial):
So, your current child table looks something like:
CREATE TABLE child
(
b TEXT, -- i.e. 'RefXXXXX'
--
-- other fields
--
t TEXT -- dummy field for this example
);
Solution 1:
Now, we add the GENERATED field:
ALTER TABLE child ADD COLUMN c INT
GENERATED ALWAYS AS (REPLACE(b, 'Ref:', '')::INT) STORED
and turn the new field into a FOREIGN KEY pointing at the parent's PK:
ALTER TABLE child
ADD CONSTRAINT ch_c_pa_x_fk FOREIGN KEY (c) REFERENCES parent (x);
and add an INDEX for efficient JOINing:
CREATE INDEX ch_c_ix ON child (c);
now, we populate the child table as follows:
INSERT INTO child
SELECT GENERATE_SERIES (1, 100) AS m, n.t::TEXT || ' text part!'
FROM GENERATE_SERIES(1, 5) AS n(t);
What the above does is to have the numbers 1 - 100 cycle through 5 times - so there are 5 records for each record in the parent table - the result of SELECT * FROM child; is:
b t c
1 1 text part! 1
2 1 text part! 2
3 1 text part! 3
4 1 text part! 4
...
... snipped for brevity
...
The Join:
SELECT
pa.x, pa.y, ch.b, ch.c, ch.t
FROM
parent pa
JOIN child ch
ON pa.x = ch.c
WHERE pa.x = 23;
Result:
x y b c t
23 23 PK text 23 23 1 text part!
23 23 PK text 23 23 2 text part!
23 23 PK text 23 23 3 text part!
23 23 PK text 23 23 4 text part!
23 23 PK text 23 23 5 text part!
So, we can see that for each parent record where the PK = 23, we have 5 child records with a b value of 23 (as expected). The inverse also works (WHERE ch.c = 23; - c is the new GENERATED column) in the same way.
Performance:
I decided to test this solution against the solution proposed by @a_horse_with_no_name - i.e. where the JOIN is on the non-sargeable b column expression:
replace(st.the_column, 'Ref:', '')::int -- st.the_column is ch.b in this case!
The same parent table was used, but the child obviously didn't have the GENERATED column and there was no point putting an index on the ch.b field because its non-sargeability.
It should be remembered that with db<>fiddle, I (we) have no idea what else is going on at the same time on the server(s) which sit on that/those box/es, so the results should be taken with a pinch of salt. Having said that, they appear to be very consistent between runs.
all timings are in milliseconds.
on db<>fiddle the number of records is 100 in the parent table and 500 in the child table (5 per parent).
on my home machine1, there are 10,000 parent records and 1,000,000 in the child (100 per parent).
1 - spec. of home machine - 8GB RAM, CPU 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 1.38 GHz, running W10, 228 GB NVMe disk
Note: the important factor here is the difference between the two home queries with a) a relatively large dataset and b) knowledge that there's nothing else going on with the server!
These queries - note different WHERE clauses were run. The times were the same with either WHERE. (I found this post very helpful for timings):
EXPLAIN (ANALYZE, TIMING OFF)
SELECT
pa.x, pa.y, ch.b, ch.c, ch.t
FROM
parent pa
JOIN child ch
ON pa.x = ch.c
WHERE pa.x = 23; -- or WHERE ch.c = 23;
- db<>fiddle - first run 0.070 (vs. 0.450 home) ms and then ~ 0.045 - 0.050 (vs. ~ 0.270 home) on subsequent runs - approximately 5x between the fiddle and home.
However, as discussed above, these comparisons are largely irrelevant - it's the comparison between the queries above and below that is important! A point of note above is that the home test took ~ 5 times as long as the fiddle but has 2,000 times more records!
For these queries (non-sargeable):
EXPLAIN (ANALYZE, TIMING OFF)
SELECT
pa.x, pa.y, ch.b, ch.t
FROM
parent pa
JOIN child ch
ON pa.x = REPLACE(ch.b, 'Ref:', '')::INT
WHERE REPLACE(ch.b, 'Ref:', '')::INT = 23; -- or WHERE pa.x = 23;
- db<>fiddle - first run 0.150 (vs. 150.00 home) ms and then ~ 0.130 (vs. ~ 140.00 home) on subsequent runs.
So, we can see that using the expression is far slower than using a GENERATED column with an index - in the home tests, we have a factor of 140/0.27 ~= 500 fold increase when using the expression compared to the indexed computed column. This is approximately linear (O(n)) with the number of records, which is rarely good with databases.
Interestingly, the difference between the fiddle and home has gone from 5x to 1000x - exactly linear with the number of records!
Solution 2 (fiddle):
However, I did try to use an expression index on the ch.b column, as follows:
CREATE INDEX ch_b_replace_ix_bis ON child (((REPLACE(b, 'Ref:', ''))::INT));
and reran the tests:
And, thanks to the index, the results for the db<>fiddle were approx. 2.5x, but with the home tests, the times with the expression index were approx. 1.5
- 2x those of the
GENERATED column - possibly because it is impossible to create a FOREIGN KEY link using an expression index?
Conclusions:
My choices in order of preference would be:
The reason for my preference is the speed factor and being able to have an FK is "cleaner" IMHO - YMMV! Furthermore Declarative Referential Integrity (DRI) will be enforced which is what RDBMSs are all about!
The expression index's advantage is because all you're doing is creating an index which will (?) take less space (is this even an issue nowadays?) than a new field + an index on that new field in solution 1.
Your choice may well depend on your version - GENERATED columns are only supported from versions >= 12 and expression indexes are supported on versions back to 8 (only versions >= 10 are now supported).
You say in one of your comments:
I won't be able to change this in my case
As the others have done, I would strongly recommend a complete refactoring of the codebase. Your problem looks like a tip-of-the-iceberg scenario and if the issue isn't dealt with now, it'll just come back bigger and nastier to bite you on the ...
You should make the stakeholders aware of this - the design is really appalling and should be fixed - but I understand that sometimes inertia is a very powerful force but tech debt will only increase over time unless paid - "Ya cannae beet th' laws o' physics, Jim".
However, you should, in the first instance, apply one of the solutions above - even add something sneakily if you have to! :-) p.s. welcome to dba.se and +1 for a question that got me thinking!