8

I have a table us_customers that looks like this (with hundreds of thousands of rows):

+----------+----------+
|    id    | us_state |
+----------+----------+
| 12345678 | MA       |
| 23456781 | AL       |
| 34567812 | GA       |
| 45678123 | FL       |
| 56781234 | AZ       |
| 67812345 | MA       |
| 78123456 | CO       |
| 81234567 | FL       |
+----------+----------+

... and I want to SELECT a sample of n customers from each us_state.

Is there a way to do that cleanly in PostgreSQL 9.3?

I can get one customer from each us_state easily with:

SELECT DISTINCT ON (us_state) id
FROM us_customers
ORDER BY us_state;

But if I want, say, three customers from each state, is there a way I can do this without running the same query multiple times?

Shaun Scovil
  • 341
  • 1
  • 5
  • 11

2 Answers2

10

You can number and order id by us_state using the ROW_NUMBER() Window Function and only keep the n first values:

SELECT * 
FROM (
  SELECT *
    , ROW_NUMBER() OVER(PARTITION BY us_state ORDER BY id) as n
  FROM data
) as ord
WHERE n <= 2
ORDER BY us_state
;

Or you can CROSS JOIN with a subquery:

SELECT l.*
FROM (
  SELECT DISTINCT us_state FROM data
) as s
CROSS JOIN LATERAL (
  SELECT * 
  FROM data d
  WHERE d.us_state = s.us_state
  ORDER BY id
  LIMIT 2
) as l
ORDER BY l.us_state
;
  • Sample SQL Fiddle here
  • I used a small sample of 1 to 3 rows per state. Therefore I limit only to 2 values
  • I ordered it by ids but you can change that and order by whatever works best for you

Output with my small sample:

       id | us_state | n
      123 |       AL | 1 
      456 |       AL | 2 
 56781234 |       AZ | 1 
 78123456 |       CO | 1 
 45678123 |       FL | 1 
 81234567 |       FL | 2 
 34567812 |       GA | 1 
      123 |       MA | 1 
      456 |       MA | 2 

Note that n is the result of ROW_NUMBER and doesn't exist in the second query. On a big table, an index on the partition (us-state) and order (id here) columns will help.

Sample used:

CREATE TABLE data
    ("id" int, "us_state" varchar(2))
;

INSERT INTO data
    ("id", "us_state")
VALUES
    (12345678, 'MA'),
    (123, 'MA'),
    (456, 'MA'),
    (23456781, 'AL'),
    (123, 'AL'),
    (456, 'AL'),
    (34567812, 'GA'),
    (45678123, 'FL'),
    (56781234, 'AZ'),
    (67812345, 'MA'),
    (78123456, 'CO'),
    (81234567, 'FL')
;
Julien Vavasseur
  • 10,180
  • 2
  • 28
  • 47
3

With us_states table

If you don't have another table with a complete set of possible values for us_state then create it:

CREATE TABLE us_states (
  us_state varchar(2) PRIMARY KEY
  -- ... more columns?
);

For one, you may want to add a FK constraint on us_customers.us_state to enforce legal values. But more to the point, it allows for a substantially faster query:

SELECT u.us_state, d.id  -- more columns?
FROM   us_states u
LEFT   JOIN LATERAL (
   SELECT id             -- more columns?
   FROM   us_customers c
   WHERE  c.us_state = u.us_state
   LIMIT  3              -- 3 customers per state
   ) d ON true
ORDER  BY u.us_state, d.id;

Extracting 50 distinct values from "hundreds of thousands of rows" with DISTINCT every time can be more expensive than the rest of the query and would be a big waste of time.

While it's undefined which rows to pick, an arbitrary selection is good enough. So no ORDER BY in he subquery, that's cheaper.

Make that a LEFT JOIN to include every state at least once, even with no matching rows in us_customers.

Without us_states table

If you don't have a us_states table, there are still substantially faster ways than with the techniques suggested by @Julien. Emulate a loose index scan with a CTE. You need the index outlined below.

WITH RECURSIVE us_states AS (
   (  -- parentheses required
   SELECT us_state
   FROM   us_customers
   ORDER  BY 1
   LIMIT  1
   )

UNION ALL SELECT (SELECT c.us_state FROM us_customers c WHERE c.us_state > u.us_state ORDER BY 1 LIMIT 1) -- correlated subquery FROM us_states u WHERE u.us_state IS NOT NULL ) SELECT us_state FROM us_states WHERE us_state IS NOT NULL;

This query can be a drop-in replacement for the missing us_states table, or it can be used to create the missing table.

WITH RECURSIVE us_states AS (
   (  -- parentheses required
   SELECT us_state
   FROM   us_customers
   ORDER  BY us_state
   LIMIT  1
   )

UNION ALL SELECT (SELECT c.us_state FROM us_customers c WHERE c.us_state > u.us_state ORDER BY 1 LIMIT 1) FROM us_states u WHERE u.us_state IS NOT NULL )
SELECT u.us_state, c.id -- more columns? FROM us_states u CROSS JOIN LATERAL ( SELECT c.id -- more columns? FROM us_customers c WHERE c.us_state = u.us_state -- eliminates NULL value from CTE LIMIT 3 -- 3 customers per state ) c ORDER BY u.us_state;

Using CROSS JOIN this time since the CTE only found existing values anyway and we don't need to add WHERE us_state IS NOT NULL this way.

Or, if you can guarantee there are at least 3 rows per us_state, this would be very fast:

WITH RECURSIVE us_states AS (
   (
   SELECT us_state, id
   FROM   us_customers3
   ORDER  BY us_state
   LIMIT  3
   )

UNION ALL SELECT c.* FROM (SELECT us_state FROM us_states LIMIT 1) u , LATERAL ( SELECT c.us_state, c.id FROM us_customers3 c WHERE c.us_state > u.us_state ORDER BY c.us_state LIMIT 3 ) c ) TABLE us_states ORDER BY us_state;

The last query fails if any state has less than three rows.

Index

Either way, create a multicolumn index like this!

CREATE INDEX data_covering_idx ON data (us_state, id);

db<>fiddle here
Old sqlfiddle

Related:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633