JOINing tables when PK and FK differ

Question

I'm new to SQL so this might be easy. I'm trying to join two tables where the first has a standard incremented primary key, but the second table has a column with values like "Ref:36736", where 36736 is the primary key from the first table but it's prefixed with "Ref:".

How do I formulate the JOIN? :)

score 3 · Accepted Answer · answered Nov 23 '21 at 08:59

3

You can join on an expression:

select *
from first_table ft
  join second_table st on ft.id = replace(st.the_column, 'Ref:', '')::int

However the better solution would be to fix your broken data model.

answered Nov 23 '21 at 08:59

score 2 · Answer 2 · edited Nov 23 '21 at 12:38

... two tables ... the first has a standard incremented primary key, but the second table has a column with values like "Ref:36736", where 36736 is the primary key from the first table but it's prefixed with "Ref:".

You've embedded a Presentational aspect of your Data into the Data itself and made a [big] problem for yourself in the process.

The second table should have a column of the same Data Type and, since we're tablking Foreign Keys, the same values as the first table. Once you've done this, you can enforce this Referential Integrity using a [proper] Foreign Key.

Add the missing "Ref:" literal back in when you select the Data, or with a View that adds it, or in the client Application itself.

-- Add new foreign key column
alter table2 add column fkey_column int ;
-- Convert values 
update table2 
set fkey_column = cast( substring( ref_column from 4 ) as int );
-- Add Foreign Key Constraint 
alter table2 
add constraint fkey_con foreign key ( fkey_column ) 
references table1 ( p_key );
-- Lose the old column
alter table table2 drop column ref_column ;
-- Retrieve values with leading prefix
select ..., 'Ref:' || fkey_column ref 
from table2 
where ...

Vérace · Answer 3 · 2021-11-25T22:42:21.870

You can try the following - all of the code below is available on the fiddles here (GENERATED column - solution 1) and here (expression index - solution 2).

As per convention, I'm calling your referenced table parent and the other table child with the Ref: string suffixed to the field referencing the PRIMARY KEY of parent.

Given the scenario described, you have (simplified):

Parent:

CREATE TABLE parent
(
  x INTEGER PRIMARY KEY, -- presumably IDENTITY or SERIAL...
  y TEXT NOT NULL        -- dummy field for this example, just the PK as TEXT
  -- other fields
  --
);

and then populate it with 100 INTs and the dummy text (the dummy is the PK as text concatenated with the string ' PK text' - makes reading results easier):

INSERT INTO parent 
SELECT s, s::TEXT || ' PK text' FROM GENERATE_SERIES(1, 100) AS r (s);

Child (initial):

So, your current child table looks something like:

CREATE TABLE child
(
  b TEXT,  -- i.e. 'RefXXXXX'
  --
  -- other fields
  --
  t TEXT  -- dummy field for this example
);

Solution 1:

Now, we add the GENERATED field:

ALTER TABLE child ADD COLUMN c INT
GENERATED ALWAYS AS (REPLACE(b, 'Ref:', '')::INT) STORED

and turn the new field into a FOREIGN KEY pointing at the parent's PK:

ALTER TABLE child
ADD CONSTRAINT ch_c_pa_x_fk FOREIGN KEY (c) REFERENCES parent (x);

and add an INDEX for efficient JOINing:

CREATE INDEX ch_c_ix ON child (c);

now, we populate the child table as follows:

INSERT INTO child
SELECT GENERATE_SERIES (1, 100) AS m, n.t::TEXT || ' text part!'
FROM GENERATE_SERIES(1, 5) AS n(t);

What the above does is to have the numbers 1 - 100 cycle through 5 times - so there are 5 records for each record in the parent table - the result of SELECT * FROM child; is:

b         t         c
1   1 text part!    1
2   1 text part!    2
3   1 text part!    3
4   1 text part!    4 
...
...  snipped for brevity
...

The Join:

SELECT
  pa.x, pa.y, ch.b, ch.c, ch.t
FROM 
  parent pa
JOIN child ch
  ON pa.x = ch.c
WHERE pa.x = 23;

Result:

x        y      b   c       t
23  23 PK text  23  23  1 text part!
23  23 PK text  23  23  2 text part!
23  23 PK text  23  23  3 text part!
23  23 PK text  23  23  4 text part!
23  23 PK text  23  23  5 text part!

So, we can see that for each parent record where the PK = 23, we have 5 child records with a b value of 23 (as expected). The inverse also works (WHERE ch.c = 23; - c is the new GENERATED column) in the same way.

Performance:

I decided to test this solution against the solution proposed by @a_horse_with_no_name - i.e. where the JOIN is on the non-sargeable b column expression:

replace(st.the_column, 'Ref:', '')::int  -- st.the_column is ch.b in this case!

The same parent table was used, but the child obviously didn't have the GENERATED column and there was no point putting an index on the ch.b field because its non-sargeability.

It should be remembered that with db<>fiddle, I (we) have no idea what else is going on at the same time on the server(s) which sit on that/those box/es, so the results should be taken with a pinch of salt. Having said that, they appear to be very consistent between runs.

all timings are in milliseconds.
on db<>fiddle the number of records is 100 in the parent table and 500 in the child table (5 per parent).
on my home machine¹, there are 10,000 parent records and 1,000,000 in the child (100 per parent).

^{1 - spec. of home machine - 8GB RAM, CPU 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 1.38 GHz, running W10, 228 GB NVMe disk}

Note: the important factor here is the difference between the two home queries with a) a relatively large dataset and b) knowledge that there's nothing else going on with the server!

These queries - note different WHERE clauses were run. The times were the same with either WHERE. (I found this post very helpful for timings):

EXPLAIN (ANALYZE, TIMING OFF)
SELECT
  pa.x, pa.y, ch.b, ch.c, ch.t
FROM 
  parent pa
JOIN child ch
  ON pa.x = ch.c
WHERE pa.x = 23;  -- or WHERE ch.c = 23;

db<>fiddle - first run 0.070 (vs. 0.450 home) ms and then ~ 0.045 - 0.050 (vs. ~ 0.270 home) on subsequent runs - approximately 5x between the fiddle and home.

However, as discussed above, these comparisons are largely irrelevant - it's the comparison between the queries above and below that is important! A point of note above is that the home test took ~ 5 times as long as the fiddle but has 2,000 times more records!

For these queries (non-sargeable):

EXPLAIN (ANALYZE, TIMING OFF)
SELECT
  pa.x, pa.y, ch.b, ch.t
FROM 
  parent pa
JOIN child ch
  ON pa.x = REPLACE(ch.b, 'Ref:', '')::INT
WHERE REPLACE(ch.b, 'Ref:', '')::INT = 23; -- or WHERE pa.x = 23;

db<>fiddle - first run 0.150 (vs. 150.00 home) ms and then ~ 0.130 (vs. ~ 140.00 home) on subsequent runs.

So, we can see that using the expression is far slower than using a GENERATED column with an index - in the home tests, we have a factor of 140/0.27 ~= 500 fold increase when using the expression compared to the indexed computed column. This is approximately linear (O(n)) with the number of records, which is rarely good with databases.

Interestingly, the difference between the fiddle and home has gone from 5x to 1000x - exactly linear with the number of records!

Solution 2 (fiddle):

However, I did try to use an expression index on the ch.b column, as follows:

CREATE INDEX ch_b_replace_ix_bis ON child (((REPLACE(b, 'Ref:', ''))::INT));

and reran the tests:

And, thanks to the index, the results for the db<>fiddle were approx. 2.5x, but with the home tests, the times with the expression index were approx. 1.5

2x those of the GENERATED column - possibly because it is impossible to create a FOREIGN KEY link using an expression index?

Conclusions:

My choices in order of preference would be:

have a GENERATED column and put an index on that, or
create an expression index as shown above.

The reason for my preference is the speed factor and being able to have an FK is "cleaner" IMHO - YMMV! Furthermore Declarative Referential Integrity (DRI) will be enforced which is what RDBMSs are all about!

The expression index's advantage is because all you're doing is creating an index which will (?) take less space (is this even an issue nowadays?) than a new field + an index on that new field in solution 1.

Your choice may well depend on your version - GENERATED columns are only supported from versions >= 12 and expression indexes are supported on versions back to 8 (only versions >= 10 are now supported).

You say in one of your comments:

I won't be able to change this in my case

As the others have done, I would strongly recommend a complete refactoring of the codebase. Your problem looks like a tip-of-the-iceberg scenario and if the issue isn't dealt with now, it'll just come back bigger and nastier to bite you on the ...

You should make the stakeholders aware of this - the design is really appalling and should be fixed - but I understand that sometimes inertia is a very powerful force but tech debt will only increase over time unless paid - "Ya cannae beet th' laws o' physics, Jim".

However, you should, in the first instance, apply one of the solutions above - even add something sneakily if you have to! :-) p.s. welcome to dba.se and +1 for a question that got me thinking!