SQL order by query results in any arbitrary (but reproducible) manner

Question

For testing purposes I need to get some data from a set of database tables in any arbitrary (but reproducible) order. The idea being that I can later compare two runs using a textual diff tool. Is there an idiom for that?

E.g. I can obviously do a:

SELECT * FROM table_with_N_columns ORDER BY column_1, ... , column_N

I am just asking if there is an idiomatic way to achieve the same effect (for my purposes) without bothering to list every column in the ORDER BY clause. Any ordering will do so long as it is reproducible with subsequent runs of the query.

Erwin Brandstetter · Answer 1 · 2024-06-09T23:31:24.637

Use PK columns

Sorting by all columns of a table is potentially expensive or sometimes simply impossible. Example: table with a json column or xml, or point or some other data type that cannot be used in ORDER BY, lacking the necessary comparison operators. You get an exception.

If it doesn't fail for you, it still doesn't guarantee a truly stable sort order. Without unique or primary key, completely duplicate rows are allowed, which are still sorted arbitrarily. Two concurrent sessions could both change the "first" in a set of dupes without conflict, resulting in two different rows being changed separately.

Use the primary key column(s) of each table. Unique constraint would work, too, but you really should have a pk on every table to begin with. Besides being safe & correct, it's also typically much faster.

Function

Query the system catalogs pg_constraint and pg_attribute to get columns for a PRIMARY KEY or UNIQUE constraint.

I wrapped the query in a handy SQL function:

CREATE OR REPLACE FUNCTION f_order_cols(_tbl regclass)
  RETURNS text
  LANGUAGE sql AS
$func$
SELECT string_agg(_tbl::text || '.' || quote_ident(attname), ',')
FROM  (
   SELECT conrelid AS attrelid, conkey
   FROM   pg_catalog.pg_constraint
   WHERE  conrelid = $1
   AND    contype = IN ('p', 'u')          -- primary or unique constraint
   ORDER  BY contype                       -- pk has priority ('p' < 'u')
           , array_length(conkey, 1)       -- else, fewest columns as quick tie breaker
   LIMIT  1                                -- ... but any 1 of them is good
   ) c
CROSS  JOIN unnest(c.conkey) attnum        -- unnest array, possibly mult. cols
JOIN   pg_catalog.pg_attribute a USING (attrelid, attnum)
UNION  ALL                                 -- Default if first query returns no row
SELECT _tbl::text || '::text COLLATE "C"'  -- See chapter "Default" below
LIMIT  1
$func$;

This is very fast.
The function takes the table name as parameter, type regclass to be precise. Resolves the name according to the current search_path and raises an exception if invalid. At the same time provides a properly quoted identifier in the result automatically.
Any PRIMARY KEY or UNIQUE constraint is good. PK has priority, else the unique constraint with the fewest columns is picked.
The constraint can consist of multiple columns, attribute numbers of which are stored in the array pg_constraint.conkey. Using an implicit JOIN LATERAL to unnest() that.
Result columns are table-qualified to get syntax ready for queries on multiple tables. This optional feature may have to be modified if you are using table aliases in queries.
Also quote column names correctly to avoid SQL injection. quote_ident() does that for you. See:
- SQL injection in Postgres functions vs prepared queries

Default

If no constraint is found, the function returns NULL. Wrap the result into COALESCE and use the text representation of the whole row as default: tbl::text.

This is not going to sort rows like original values would, but you get a consistent sort order that works for every table. Since you are going to compare text representations anyway.
Better still, add COLLATE "C" to ignore collation rules. Faster, and just as good for your purpose.

ORDER BY tbl::text COLLATE "C"

Still a potentially very inefficient measure of last resort. Use PKs if at all possible. I put the default into the function above. Everything put together, for a query:

SELECT * FROM u JOIN p USING (a) JOIN n ON n.b = p.b ORDER BY ?

This returns the complete string for ?:

SELECT concat_ws(', ', f_order_cols('u')
                     , f_order_cols('p')
                     , f_order_cols('n')) AS order_by;

_{Old sqlfiddle}

score 4 · Accepted Answer · answered Sep 06 '14 at 16:13

without bothering to list every column in the ORDER BY clause

For just that goal, ORDER BY tablename.* does work.

If there are several tables involved, we can always write:

SELECT * FROM (select ... complex query...) as T order by T.*;

If the resultset has many wide rows making the sort inefficient, sorting a hashed representation instead might be a better idea:

SELECT * FROM (select ... complex query...) as T order by md5(T.*::text);

SQL order by query results in any arbitrary (but reproducible) manner

2 Answers2

Use PK columns

Function

Default

Linked