DISTINCT ON vs ROW_NUMBER() for getting a row with a maximum value of a group

Question

I have the following table in PostgreSQL 9.4:

CREATE TABLE dpa(
  id serial NOT NULL,
  currency_id integer,
  amount numeric(14,3),
  date timestamp without time zone,
  plat_id integer,
  pl_id integer,
  p_id integer,
  CONSTRAINT dpa_pkey PRIMARY KEY (id), 
)

and settings:

work_mem = 128MB
table_size = 16 MB

And index:

CREATE INDEX idx1
  ON dpa
  USING btree
  (plat_id, p_id, pl_id, currency_id, date DESC NULLS LAST, amount)

The table consists of approximately 242K rows. I don't have NOT NULL constraints on the column, but they are actually NOT NULL.

Now, I'm measuring performance of the queries:

I

SELECT plat_id, p_id, pl_id, player_account player_account
FROM(
    SELECT plat_id, p_id, pl_id, 
    COALESCE(amount, 0) player_account,
    ROW_NUMBER() OVER (PARTITION BY plat_id, p_id, pl_id, currency_id
                       ORDER BY date DESC NULLS LAST) rn
    FROM dpa
) sub WHERE rn = 1;

Analyzed plan:

Subquery Scan on sub  (cost=0.42..25484.16 rows=1214 width=44) (actual time=0.044..296.810 rows=215274 loops=1)
  Filter: (sub.rn = 1)
  Rows Removed by Filter: 27556
  ->  WindowAgg  (cost=0.42..22448.79 rows=242830 width=28) (actual time=0.043..255.690 rows=242830 loops=1)
        ->  Index Only Scan using idx1 on dpa  (cost=0.42..16378.04 rows=242830 width=28) (actual time=0.037..91.576 rows=242830 loops=1)"
              Heap Fetches: 242830

II

SELECT DISTINCT ON(plat_id, p_id, pl_id, currency_id)
       plat_id, p_id, pl_id, currency_id, amount
FROM dpa
ORDER BY plat_id, p_id, pl_id, currency_id, date DESC NULLS LAST

Analyzed plan:

Unique  (cost=0.42..18794.73 rows=82273 width=28) (actual time=0.017..128.277 rows=215274 loops=1)
  ->  Index Only Scan using idx1 on dpa  (cost=0.42..16366.43 rows=242830 width=28) (actual time=0.016..72.110 rows=242830 loops=1)
        Heap Fetches: 242830

As can be seen, the second query is faster than the first one. But when I execute this queries in PGAdmin I got the following average statistics:

The query with ROW_NUMBER()(the first): 4999 ms

The query with DISTINCT ON(the second): 5654 ms

I understand that bandwith/latency overhead on such large result set is significant. All queries produce 215274 rows.

QUESTION: Why does it take more time to receive all rows in the second case than in the first one, although the planner shows that the second plan is more optimal?

András Váczi · Accepted Answer · 2016-01-25T19:15:14.100

The timing you see is given by pgAdmin (but it could be any other client) - that means it displays the time it needed to get and render the output. As you know the time needed for the database to produce the data (by using EXPLAIN ANALYZE), the difference you see must come from the transfer and/or rendering. You show less columns in the first query, that might be a reason, for example.

If you want to have an idea about how much time is consumed during the transfer, you can time the execution of the query from an application. If you only fetch the data but don't process it in any way (not to mention render it), you get a good approximation on the time needed for transferring the data. Just take the time needed for fetching the data and subtract the (already known) execution time of the database.

This way you will have an idea about the rendering time of pgAdmin, too, by comparing the above numbers to the one it gives you.

DISTINCT ON vs ROW_NUMBER() for getting a row with a maximum value of a group

I

II

1 Answers1

Linked