More efficient accumulator in SQL?

Question

I'm writing a ledger system where every transaction can have multiple classifications. For example, if someone purchases a widget for $50, I can categorize that transaction as having an account of "Revenue" and an SKU as "SKU1".

Users can then select the dimensions they wish to report on, and I can generate aggregates.

When my database has 10M+ transactions, the following query is prohibitively slow. After about 10s I receive a Memory limit exceeded error on my 8GB laptop.

Thus the question: I don't actually care about the individual rows, I only care about the accumulation of these values. In my test, I only expect about 10 rows returned after aggregation.

Here is a fiddle: http://sqlfiddle.com/#!17/4a7d8/10/0

select
   year,
   sum(amount),
   t1.value as account,
   t2.value as sku
from 
    transactions 
left join
    tags t1 on transactions.id = t1.transaction_id and t1.name ='account'
left join
    tags t2 on transactions.id = t2.transaction_id and t2.name = 'sku'
group by
    year,
    t1.value,
    t2.value;

Here is the query plan:

Expression ((Projection + Before ORDER BY))
  Aggregating
    Expression (Before GROUP BY)
      Join (JOIN)
        Expression ((Before JOIN + (Projection + Before ORDER BY)))
          Join (JOIN)
            Expression (Before JOIN)
              ReadFromMergeTree (default.transactions)
            Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
              ReadFromMergeTree (default.tags)
        Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
          ReadFromMergeTree (default.tags)

And, finally, here is the schema:

CREATE TABLE default.transactions
(
    `id` Int32,
    `date` Date,
    `amount` Float32
)
ENGINE = MergeTree
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192
CREATE TABLE default.tags
(
    transaction_id Int32,
    name String,
    value String,
    INDEX idx_tag_value value TYPE set(0) GRANULARITY 4,
    INDEX idx_tag_name name TYPE set(0) GRANULARITY 4
)
ENGINE = MergeTree
PRIMARY KEY (transaction_id, name)
ORDER BY (transaction_id, name)
SETTINGS index_granularity = 8192

My questions are:

Is there a different schema, or different set of Clickhouse features I might use?
Should I instead pre-compute aggregates?
Is there a different DB which can perform this kind of calculation more efficiently?

score 1 · Answer 1 · answered Jun 10 '22 at 21:24

I found a solution that works with Clickhouse. Rather than having two tables (transactions and tags), one can create a column for tags with a Map type.

The new structure would look like this:

CREATE TABLE default.transactions
(
    `id` Int32,
    `date` Date,
    `amount` Float32,
    `tags` Map(String, String),
)
ENGINE = MergeTree
PRIMARY KEY id

From there, you could query:

SELECT
  year,
  SUM(amount),
  tags['account'],
  tags['sku']
FROM
  transactions
GROUP BY
  year,
  tags['account'],
  tags['sku']

When I run this against a table with 280M rows, it is able to return in 7 seconds.

score 0 · Answer 2 · answered Jun 10 '22 at 15:59

Clickhouse is notoriously memory-hungry, and not really suitable for running on a single node with just 8 Gi of memory. I suggest you choose something more suitable for your "infrastructure". For example, Postgres won't have any problem aggregating 10 million rows.

Also, your model looks suspiciously like EAV, which is a very bad thing for analytics queries. If you insist on using Clickhouse, denormalise your model.

More efficient accumulator in SQL?

2 Answers2