0

I have a table structured as follows:

legal_entity_id (text) solution_string (text) total_score (bigint)
asjY4474dgd Server 9

I need to determine, for each solution and company provided by users, how many companies have expressed interest in that solution (you can assume that if a company is listed in this table, it has interest in at least one solution). Here’s an example of the desired output:

solution_string count
Server 315
Automotive 256

The table contains approximately 2.5 million rows, and I need the query to execute in under 100ms for 95% of the cases, when querying for 25,000 companies and 20 solutions. Indexes are already in place on both legal_entity_id and solution_string columns. However, there is no primary key for this table (edit: a primary key has been added to legal_entity_id, solution_name). The current query being used is as follows:

SELECT solution_string, COUNT(*)
FROM intent_signals
WHERE legal_entity_id IN (id1, id2, ...) AND solution_string in (solution_1, solution_2, ...)
GROUP BY solution_string;

The query is extremely slow, and we have no idea how to optimize it. Currently, it takes an average of 2.8 seconds to complete with 5000 legal_entity_ids.

Edit: additional info

PostgreSQL 10.22 (Ubuntu 10.22-0ubuntu0.18.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
QUERY PLAN
Finalize GroupAggregate (cost=4888296.66..4888317.51 rows=27 width=26) (actual time=5830.528..5848.222 rows=2 loops=1)
Group Key: solution_string
-> Gather Merge (cost=4888296.66..4888316.83 rows=81 width=26) (actual time=5830.303..5848.206 rows=8 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Partial GroupAggregate (cost=4887296.62..4887307.28 rows=27 width=26) (actual time=5807.789..5807.904 rows=2 loops=4)
Group Key: solution_string
-> Sort (cost=4887296.62..4887300.08 rows=1385 width=18) (actual time=5807.701..5807.759 rows=955 loops=4)
Sort Key: solution_string
Sort Method: quicksort Memory: 116kB
-> Parallel Seq Scan on intent_signals (cost=0.00..4887224.35 rows=1385 width=18) (actual time=0.149..5807.172 rows=955 loops=4)
Filter: ((solution_string = ANY ('{Automotive,"Banking Other",Financial}'::text[])) AND (legal_entity_id = ANY ('{0010800002nSsILAA0,}'::text[])))
Rows Removed by Filter: 600911
Planning time: 36.703 ms
Execution time: 5848.318 ms

1 Answers1

0

From here, you have an approximate COUNT(*)! Gotta keep your tables ANALYZE-ed OR VACUUM-ed though.

The query to run is:

SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name';

See also Erwin Brandstetter's post here for the ins and outs.

You also have the TABLE_SAMPLE sampling method technique explained here. Again Erwin Brandstetter to the rescue here. Query to run is:

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Depending on accuracy desired, increase 1 to a suitable value

Maybe of interest - bit of a monster post on sampling (by me), see here!

I would suggest testing on your own system(s) to see what works best for you!

Your times seem very high - is your IN clause very long - maybe try a sub-SELECT?

Vérace
  • 30,923
  • 9
  • 73
  • 85