Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)

Question

I have a table us_customers that looks like this (with hundreds of thousands of rows):

+----------+----------+
|    id    | us_state |
+----------+----------+
| 12345678 | MA       |
| 23456781 | AL       |
| 34567812 | GA       |
| 45678123 | FL       |
| 56781234 | AZ       |
| 67812345 | MA       |
| 78123456 | CO       |
| 81234567 | FL       |
+----------+----------+

... and I want to SELECT a sample of n customers from each us_state.

Is there a way to do that cleanly in PostgreSQL 9.3?

I can get one customer from each us_state easily with:

SELECT DISTINCT ON (us_state) id
FROM us_customers
ORDER BY us_state;

But if I want, say, three customers from each state, is there a way I can do this without running the same query multiple times?

Julien Vavasseur · Accepted Answer · 2016-03-25T18:26:33.400

You can number and order id by us_state using the ROW_NUMBER() Window Function and only keep the n first values:

SELECT * 
FROM (
  SELECT *
    , ROW_NUMBER() OVER(PARTITION BY us_state ORDER BY id) as n
  FROM data
) as ord
WHERE n <= 2
ORDER BY us_state
;

Or you can CROSS JOIN with a subquery:

SELECT l.*
FROM (
  SELECT DISTINCT us_state FROM data
) as s
CROSS JOIN LATERAL (
  SELECT * 
  FROM data d
  WHERE d.us_state = s.us_state
  ORDER BY id
  LIMIT 2
) as l
ORDER BY l.us_state
;

Sample SQL Fiddle here
I used a small sample of 1 to 3 rows per state. Therefore I limit only to 2 values
I ordered it by ids but you can change that and order by whatever works best for you

Output with my small sample:

       id | us_state | n
      123 |       AL | 1 
      456 |       AL | 2 
 56781234 |       AZ | 1 
 78123456 |       CO | 1 
 45678123 |       FL | 1 
 81234567 |       FL | 2 
 34567812 |       GA | 1 
      123 |       MA | 1 
      456 |       MA | 2

Note that n is the result of ROW_NUMBER and doesn't exist in the second query. On a big table, an index on the partition (us-state) and order (id here) columns will help.

Sample used:

CREATE TABLE data
    ("id" int, "us_state" varchar(2))
;

INSERT INTO data
    ("id", "us_state")
VALUES
    (12345678, 'MA'),
    (123, 'MA'),
    (456, 'MA'),
    (23456781, 'AL'),
    (123, 'AL'),
    (456, 'AL'),
    (34567812, 'GA'),
    (45678123, 'FL'),
    (56781234, 'AZ'),
    (67812345, 'MA'),
    (78123456, 'CO'),
    (81234567, 'FL')
;

Erwin Brandstetter · Answer 2 · 2021-11-09T22:12:52.650

With `us_states` table

If you don't have another table with a complete set of possible values for us_state then create it:

CREATE TABLE us_states (
  us_state varchar(2) PRIMARY KEY
  -- ... more columns?
);

For one, you may want to add a FK constraint on us_customers.us_state to enforce legal values. But more to the point, it allows for a substantially faster query:

SELECT u.us_state, d.id  -- more columns?
FROM   us_states u
LEFT   JOIN LATERAL (
   SELECT id             -- more columns?
   FROM   us_customers c
   WHERE  c.us_state = u.us_state
   LIMIT  3              -- 3 customers per state
   ) d ON true
ORDER  BY u.us_state, d.id;

Extracting 50 distinct values from "hundreds of thousands of rows" with DISTINCT every time can be more expensive than the rest of the query and would be a big waste of time.

While it's undefined which rows to pick, an arbitrary selection is good enough. So no ORDER BY in he subquery, that's cheaper.

Make that a LEFT JOIN to include every state at least once, even with no matching rows in us_customers.

Without `us_states` table

If you don't have a us_states table, there are still substantially faster ways than with the techniques suggested by @Julien. Emulate a loose index scan with a CTE. You need the index outlined below.

WITH RECURSIVE us_states AS (
   (  -- parentheses required
   SELECT us_state
   FROM   us_customers
   ORDER  BY 1
   LIMIT  1
   )
UNION ALL
   SELECT (SELECT c.us_state 
           FROM   us_customers c
           WHERE  c.us_state > u.us_state
           ORDER  BY 1
           LIMIT  1)                    -- correlated subquery
   FROM   us_states u
   WHERE  u.us_state IS NOT NULL
   )
SELECT us_state
FROM   us_states
WHERE  us_state IS NOT NULL;

This query can be a drop-in replacement for the missing us_states table, or it can be used to create the missing table.

WITH RECURSIVE us_states AS (
   (  -- parentheses required
   SELECT us_state
   FROM   us_customers
   ORDER  BY us_state
   LIMIT  1
   )
UNION ALL
   SELECT (SELECT c.us_state
           FROM   us_customers c
           WHERE  c.us_state > u.us_state
           ORDER  BY 1
           LIMIT  1)
   FROM   us_states u
   WHERE  u.us_state IS NOT NULL
   )

SELECT u.us_state, c.id  -- more columns?
FROM   us_states u
CROSS  JOIN LATERAL (
   SELECT c.id           -- more columns?
   FROM   us_customers c
   WHERE  c.us_state = u.us_state  -- eliminates NULL value from CTE
   LIMIT  3              -- 3 customers per state
   ) c
ORDER  BY u.us_state;

Using CROSS JOIN this time since the CTE only found existing values anyway and we don't need to add WHERE us_state IS NOT NULL this way.

Or, if you can guarantee there are at least 3 rows per us_state, this would be very fast:

WITH RECURSIVE us_states AS (
   (
   SELECT us_state, id
   FROM   us_customers3
   ORDER  BY us_state
   LIMIT  3
   )
UNION ALL
   SELECT c.*
   FROM  (SELECT us_state FROM us_states LIMIT 1) u
   ,      LATERAL (
      SELECT c.us_state, c.id
      FROM   us_customers3 c
      WHERE  c.us_state > u.us_state
      ORDER  BY c.us_state
      LIMIT  3
      ) c
   )
TABLE us_states
ORDER BY us_state;

The last query fails if any state has less than three rows.

Index

Either way, create a multicolumn index like this!

CREATE INDEX data_covering_idx ON data (us_state, id);

db<>fiddle here
_{Old sqlfiddle}

Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)

2 Answers2

With us_states table

Without us_states table

Index

With `us_states` table

Without `us_states` table