I have an iterative function that returns results via RETURN QUERY statement. I would like to prevent possible duplicate by making sure none of the new results are part of the current dataset being returned. Is there any way to access the current dataset being generated to remove rows that are already included?
CREATE FUNCTION "table_function"
(
_limit INTEGER,
_iteration INTEGER
)
RETURNS SETOF "table"
LANGUAGE plpgsql
AS $$
DECLARE
_row_count INTEGER;
BEGIN
WHILE _limit > 0 AND _iteration > 0
LOOP
-- Simplified query for example, this query is random and might return duplicate results when executed multiple times
-- I would like to filter rows from this query that have already been returned
RETURN QUERY SELECT * FROM "table" LIMIT _limit;
GET DIAGNOSTICS _row_count := ROW_COUNT;
_limit := _limit - _row_count;
_iteration := _iteration - 1;
END LOOP;
END $$;
This is my real problem: I have to query random rows from a potentially huge table (over millions records). Every rows have to have equal probability of being picked, and I must have exactly N rows picked. I have a working query that executes in almost constant time, but it's not guaranteed to give me N rows, so I have to do multiple iterations in some cases to have N rows. However, the subsequent iterations might give me the same random rows.