2

In order to generate test data for a Postgres database, I want to update a foreign key in a table, let's say 'users' with random values from the referenced table, let's say 'shift'.

How to do it?

I have tried this way:

UPDATE users
SET id_shift = (
SELECT shift.id_shift
FROM shift
ORDER BY random()
LIMIT 1
);

I expected to have some users in one shift and others in another. But the result is that all users end up with the same shift.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633

2 Answers2

1

The problem is that your subquery is not “correlated” (it does not reference a column in the outer query). So it is executed only once. You can fake it:

UPDATE users
SET id_shift = (SELECT shift.id_shift
                FROM shift
                /* pointless, but references the outer table */
                WHERE users.id IS NOT NULL
                ORDER BY random()
                LIMIT 1);

Then the subquery is executed for each row in users, and you'll get different rows.

Laurenz Albe
  • 61,070
  • 4
  • 55
  • 90
0

Laurenz' answer fixes the technical issue in your query. I am addressing the issue with your query:

Ordering the whole table shift to pick a single random row is expensive. Doing that for every row in table users can get very expensive. Depending on what kind of "random" you actually need (there are many kinds!), there are much faster ways. Like:

UPDATE users u
SET    id_shift = s.id_shift
FROM  (SELECT row_number() OVER () % (SELECT count(*) FROM shift) + 1 AS rn, id FROM users) u1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, id_shift FROM shift) s USING (rn)
WHERE  u.id = u1.id;

db-fiddle

This orders the table shift only once. Faster by orders of magnitude. You get a flat, pseudo-random distribution of target values. If there are more users than shifts, we cycle through shifts repeatedly.

The subquery u1 is another instance of the table users, with a generated row number (rn). If there are more users than shifts, we want to re-cycle shift numbers, literally. That's achieved by taking the row number modulo the number of shifts, i.e. % (SELECT count(*) FROM shift). Plus 1 to make it 1-based.

The subquery s orders the target table shift randomly, and also generates a row number (rn).

Now we can join by rn and link back to the updated table u (users) via id.

Related:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633