How to retrieve the current dataset in a table function with RETURN QUERY

Question

I have an iterative function that returns results via RETURN QUERY statement. I would like to prevent possible duplicate by making sure none of the new results are part of the current dataset being returned. Is there any way to access the current dataset being generated to remove rows that are already included?

CREATE FUNCTION "table_function"
(
    _limit INTEGER,
    _iteration INTEGER
)
RETURNS SETOF "table"
LANGUAGE plpgsql
AS $$
DECLARE 
    _row_count INTEGER;
BEGIN
    WHILE _limit > 0 AND _iteration > 0
    LOOP
        -- Simplified query for example, this query is random and might return duplicate results when executed multiple times
        -- I would like to filter rows from this query that have already been returned
        RETURN QUERY SELECT * FROM "table" LIMIT _limit;
        GET DIAGNOSTICS _row_count := ROW_COUNT;
        _limit := _limit - _row_count;
        _iteration := _iteration - 1;
    END LOOP;
END $$;

This is my real problem: I have to query random rows from a potentially huge table (over millions records). Every rows have to have equal probability of being picked, and I must have exactly N rows picked. I have a working query that executes in almost constant time, but it's not guaranteed to give me N rows, so I have to do multiple iterations in some cases to have N rows. However, the subsequent iterations might give me the same random rows.

Laurenz Albe · Answer 1 · 2022-09-30T18:57:52.243

You could use a function similar to this one:

CREATE FUNCTION get_random_rows(n integer) RETURNS SETOF part
   LANGUAGE plpgsql AS
$$DECLARE
   max_id bigint;
   min_id bigint;
   i integer;
   some_id bigint;
BEGIN
   /* fast with a primary key index */
   SELECT max(id), min(id) INTO max_id, min_id
   FROM part;
/* the rows we already returned */
   CREATE TEMP TABLE ids (id bigint PRIMARY KEY) ON COMMIT DROP;
FOR i IN 1..n LOOP
      LOOP
         some_id := random() * (max_id - min_id) + min_id - 0.5;
     /* avoid duplicates */
     CONTINUE WHEN EXISTS (SELECT FROM ids WHERE id = some_id);

     RETURN QUERY SELECT * FROM part WHERE id = some_id;

     /* repeat if that &quot;id&quot; does not exist */
     EXIT WHEN FOUND;
  END LOOP;

END LOOP;
DROP TABLE ids;
END;$$;

How to retrieve the current dataset in a table function with RETURN QUERY

1 Answers1