PostgreSQL - Create table as select with distinct on specific columns

Question

I need a script to create a table . The problem is my select for creating the table contains equal values in some rows. I need to use distinct for these columns. The other columns' values can come from any matching row.

My current result table has data like this:

|   CITY   |   STREET   |   STREET_NUM   |   VAL_X   |   VAL_Y   |
------------------------------------------------------------------
| CityA    | Street abc |   5            | 11.5      |   0.5     |
| CityA    | Street abc |   5            | 15.4      |   1.8     |
| CityA    | Street abc |   5            | 12.4      |   2.8     |
| CityB    | Street xyz |   18           |  5.4      |   1.9     |
| CityB    | Street xyz |   18           |  8.4      |   1.1     |
| CityC    | Street klm |   55           |  9.6      |   0.8     |

But I need data like this:

|   CITY   |   STREET   |   STREET_NUM   |   VAL_X   |   VAL_Y   |
------------------------------------------------------------------
| CityA    | Street abc |   5            | 11.5      |   0.5     |
| CityB    | Street xyz |   18           |  5.4      |   1.9     |
| CityC    | Street klm |   55           |  9.6      |   0.8     |

For columns city, street and street_num I need to apply distinct. val_x and val_y should be used anyone, for example first of that group with same city, street and street_num.

Can you give me advice how to edit this script?

joanolo · Accepted Answer · 2017-03-12T20:31:40.947

"VAL_X" and "VAL_Y" chosen through some aggregate function

You should consider using GROUP BY for the columns whose values you consider that should be "distinct" (as a group), and, for the rest of columns, choose an appropriate aggregate function (for instance, MIN):

CREATE TABLE my_result AS 
SELECT
  city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
  tableA
WHERE
  true /* your condition goes here */ 
GROUP BY
  city, street, streetnum

If you need to put together values from several tables, UNION ALL of them before you GROUP BY:

CREATE TABLE my_result AS 
SELECT
  city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
  (
  SELECT city, street, streetnum, val_x, val_y FROM tableA
  UNION ALL
  SELECT city, street, streetnum, val_x, val_y FROM tableB
  UNION ALL
  SELECT city, street, streetnum, val_x, val_y FROM tableC
  ) AS s0
WHERE
  true /* your condition goes here */ 
GROUP BY
  city, street, streetnum ;

Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW

If you need to make sure your values are always from the same row, the best way is to use a WINDOW in your query: PARTITION BY "CITY", "STREET", "STREET_NUM" and ORDER BY "VAL_X", "VAL_Y", and choose the first row of every partition.

You can do this with two steps:

1) Add the row_num() to every partition:

SELECT 
  *,   
  (row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
FROM 
  table_a

  |  CITY |     STREET | STREET_NUM | VAL_X | VAL_Y | rn |
  |-------|------------|------------|-------|-------|----|
  | CityA | Street abc |          5 |  11.5 |   0.5 |  1 |
  | CityA | Street abc |          5 |  12.4 |   2.8 |  2 |
  | CityA | Street abc |          5 |  15.4 |   1.8 |  3 |
  | CityB | Street xyz |         18 |   5.4 |   1.9 |  1 |
  | CityB | Street xyz |         18 |   8.4 |   1.1 |  2 |
  | CityC | Street klm |         55 |   9.6 |   0.8 |  1 |

2) At this point, choose only the rows WHERE rn=1 (and ORDER them, if necessary):

SELECT
   "CITY", "STREET", "STREET_NUM", "VAL_X", "VAL_Y"
FROM
  (
  SELECT 
    *,   
    (row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
  FROM 
    table_a
  ) AS table_a_grouped 
WHERE
  rn = 1
ORDER BY 
  "CITY", "STREET", "STREET_NUM"

The result is:

|  CITY |     STREET | STREET_NUM | VAL_X | VAL_Y |
|-------|------------|------------|-------|-------|
| CityA | Street abc |          5 |  11.5 |   0.5 |
| CityB | Street xyz |         18 |   5.4 |   1.9 |
| CityC | Street klm |         55 |   9.6 |   0.8 |

You can see the example at SQLFiddle

Erwin Brandstetter · Answer 2 · 2021-04-19T00:43:04.577

Assumptions:

Resulting rows shall be distinct on (city, street, street_num) i.e. the combination of these 3 columns shall be unique. Individual columns can have dupes.
val_x and val_y (and possibly more columns) shall come from the same source row (they make sense together, like coordinates / a geocode).
All columns are defined NOT NULL.
_{NULL values are considered equal in by DISTINCT or DISTINCT ON (per SQL standard). So the query works with NULL values all the same. But if you add ORDER BY to define which row to pick from each set of dupes, pay attention if expressions can be NULL. See linked answer below for details.}

UNION ALL multiple sources, then apply DISTINCT ON (city, street, street_num) on the derived table:

CREATE TABLE my_result AS 
SELECT DISTINCT ON (city, street, street_num)
       city, street, street_num, val_x, val_y     -- more columns?
FROM (
   SELECT city, street, street_num, val_x, val_y  -- more columns?
   FROM   tableA
   WHERE  ...
UNION ALL  -- here probably cheaper than UNION
   SELECT city, street, street_num, val_x, val_y  -- more columns?
   FROM   tableB
   WHERE  ...
   ) sub;

Results as desired.

dbfiddle here

It's probably cheaper to use a simple UNION ALL instead of UNION - which would not be wrong but impose a separate sort or hash operation on each SELECT of the UNION query. Since we do that (and more) in the outer query anyway, UNION ALL is probably cheaper. (Test to verify.)

Combining 2 SELECT queries and printing the results in PostgreSQL

I did not add ORDER BY, since you explicitly stated:

The other columns' values can come from any matching row.

You get an arbitrary pick from each set of dupes. The result can change with every invocation.

Works in basically any Postgres version. There are query techniques with LATERAL joins, CTE or window functions that require a more modern version.

Detailed discussion of the technique, indexes and possible alternatives:

Select first row in each GROUP BY group?

PostgreSQL - Create table as select with distinct on specific columns

2 Answers2

"VAL_X" and "VAL_Y" chosen through some aggregate function

Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW