6

How do I tell PostgreSQL to return first encountered value instead of an aggregate column?

                          Table "public.cache"
  Column   |  Type   | Modifiers | Storage | Stats target | Description 
-----------+---------+-----------+---------+--------------+-------------
 user_id   | integer |           | plain   |              | 
 object_id | integer |           | plain   |              | 
 data      | integer |           | plain   |              | 
 data2     | boolean |           | plain   |              | 
Indexes:
    "cache_object_id_user_id_key" UNIQUE CONSTRAINT, btree (object_id, user_id)
    "cache_user_id_object_id_key" UNIQUE CONSTRAINT, btree (user_id, object_id)
Has OIDs: no

Query grouping by object_id and data2 will make hash aggregate, what I want to avoid.

SELECT object_id, data2 FROM cache GROUP BY object_id, data2;

I found bool_or() but it will scan all values in bad case.

SELECT object_id, bool_or(data2) FROM cache GROUP BY object_id;

Moreover, there is no such function for any datatype. What I want to do is to get any of the values from data2 column so that engine don't have to iterate all rows.

What about data column which is integer?

4 Answers4

8

There are various ways to do this, with different performance, depending on the distribution of data (number of distinct object_id values, etc).

The easiest query to write - but not necessary the most efficient, is of course using an aggregate, MIN() or MAX():

SELECT object_id, MIN(data2) AS data2 
FROM cache 
GROUP BY object_id ;

If you have an index on (object_id, data2) this will not be too bad in recent versions of Postgres that can use index-only-scan for the execution plan.


Another way would be using DISTINCT ON syntax. The same index as above would help:

SELECT DISTINCT ON (object_id)
    object_id, data2 
FROM cache 
ORDER BY object_id ;

If, compared to the table size, there is a small number of object_id values, a different approach would be much more efficient. Provided you also have another tables (say objects) that has object_id as its primary key:

SELECT o.object_id, c.data2 
FROM objects AS o
  CROSS JOIN LATERAL 
     ( SELECT data2
       FROM cache AS c 
       WHERE c.object_id = o.object_id
       ORDER BY c.data2 
       LIMIT 1
     ) AS c ;

The same index would be needed. The ORDER BY is not required but with the index in place, it won't hurt efficiency. If you don't have an objects table then that part would have to be replaced with:

---
FROM ( SELECT DISTINCT object_id FROM cache) AS o
  CROSS JOIN LATERAL 
---

But you'd lose some efficiency, especially in older versions. In that case, you could replace this subquery with a complicated recursive query that traverses the object_id index efficiently. See the Posgres docs for more details: Loose Index Scan.

Read also these great answers by Erwin in the related questions:


And last but not least, the main reason for the problems is this:

I have denormalized data and I want to avoid ... (I know that all values in that group are equal booleans).

Normalizing the table would lead to much more efficient queries.

ypercubeᵀᴹ
  • 99,450
  • 13
  • 217
  • 306
5

I've stepped the same problem few times but StackExchange was no help to me. You can create a user defined aggregation function.

Such function will be run over the rows in a group. At minimum you need to present an aggregation function. In my example it's func_first_value.

The functions below are variadic argument type functions as they work on any types which Postgres can deduce. Default value is NULL. I've added prefixes so it's easy to drop them if necessary.

CREATE OR REPLACE FUNCTION func_first_value(v0 anyelement, v1 anyelement) RETURNS anyelement AS $$
BEGIN
    IF v0 IS NOT NULL THEN
        RETURN v0;
    END IF;
    RETURN v1;
END;
$$ LANGUAGE plpgsql;

-- test function
SELECT func_first_value('text'::VARCHAR, NULL), func_first_value(NULL::VARCHAR, NULL), func_first_value(NULL, 'text'::VARCHAR);

CREATE AGGREGATE agg_first_value (anyelement)
(
    sfunc = func_first_value,
    stype = anyelement
);

In your case you would need to query:

SELECT object_id, agg_first_value(data2) FROM cache GROUP BY object_id;

To avoid scanning all values one would need to provide SORTOP or sort_operator as the Postgres documentation name it. I didn't found any example how to use it. Also writing own aggregate function was a lot slower (~6x) than using MIN or MAX built-in SQL aggregates.

Paweł Szczur
  • 151
  • 1
  • 3
2

Here's one attempt:

SELECT distinct object_id
     , first_value(data2) over (partition by object_id) 
FROM cache

first_value will work for - almost - any type.

Lennart - Slava Ukraini
  • 23,842
  • 3
  • 34
  • 72
1

Starting from PostgreSQL 16 there's an any_value aggregate function that you can use. If you want the first or last value in the group, there are some proposed implementations in simple SQL here.

Update: So using any_value the following query will give you what you want:

SELECT object_id, ANY_VALUE(data2) AS data2 
FROM cache 
GROUP BY object_id;

Defining your own any_value (for PostgreSQL versions prior to 16) is also do-able (using the code snippet I linked above):

-- Create a function that always returns any (actually the first) non-NULL value:
CREATE OR REPLACE FUNCTION public.any_value_agg (anyelement, anyelement)
  RETURNS anyelement
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1';

-- Then wrap an aggregate around it: CREATE AGGREGATE public.any_value (anyelement) ( SFUNC = public.any_value_agg , STYPE = anyelement , PARALLEL = safe );

And finally another option is to use array_agg but of course it's less efficient:

SELECT object_id, (ARRAY_AGG(data2))[1] AS data2 
FROM cache 
GROUP BY object_id;