3

I'm writing a ledger system where every transaction can have multiple classifications. For example, if someone purchases a widget for $50, I can categorize that transaction as having an account of "Revenue" and an SKU as "SKU1".

Users can then select the dimensions they wish to report on, and I can generate aggregates.

When my database has 10M+ transactions, the following query is prohibitively slow. After about 10s I receive a Memory limit exceeded error on my 8GB laptop.

Thus the question: I don't actually care about the individual rows, I only care about the accumulation of these values. In my test, I only expect about 10 rows returned after aggregation.

Here is a fiddle: http://sqlfiddle.com/#!17/4a7d8/10/0

select
   year,
   sum(amount),
   t1.value as account,
   t2.value as sku
from 
    transactions 
left join
    tags t1 on transactions.id = t1.transaction_id and t1.name ='account'
left join
    tags t2 on transactions.id = t2.transaction_id and t2.name = 'sku'
group by
    year,
    t1.value,
    t2.value;

Here is the query plan:

Expression ((Projection + Before ORDER BY))
  Aggregating
    Expression (Before GROUP BY)
      Join (JOIN)
        Expression ((Before JOIN + (Projection + Before ORDER BY)))
          Join (JOIN)
            Expression (Before JOIN)
              ReadFromMergeTree (default.transactions)
            Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
              ReadFromMergeTree (default.tags)
        Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
          ReadFromMergeTree (default.tags)

And, finally, here is the schema:

CREATE TABLE default.transactions
(
    `id` Int32,
    `date` Date,
    `amount` Float32
)
ENGINE = MergeTree
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192

CREATE TABLE default.tags ( transaction_id Int32, name String, value String, INDEX idx_tag_value value TYPE set(0) GRANULARITY 4, INDEX idx_tag_name name TYPE set(0) GRANULARITY 4 ) ENGINE = MergeTree PRIMARY KEY (transaction_id, name) ORDER BY (transaction_id, name) SETTINGS index_granularity = 8192

My questions are:

  • Is there a different schema, or different set of Clickhouse features I might use?
  • Should I instead pre-compute aggregates?
  • Is there a different DB which can perform this kind of calculation more efficiently?
poundifdef
  • 141
  • 3

2 Answers2

1

I found a solution that works with Clickhouse. Rather than having two tables (transactions and tags), one can create a column for tags with a Map type.

The new structure would look like this:

CREATE TABLE default.transactions
(
    `id` Int32,
    `date` Date,
    `amount` Float32,
    `tags` Map(String, String),
)
ENGINE = MergeTree
PRIMARY KEY id

From there, you could query:

SELECT
  year,
  SUM(amount),
  tags['account'],
  tags['sku']
FROM
  transactions
GROUP BY
  year,
  tags['account'],
  tags['sku']

When I run this against a table with 280M rows, it is able to return in 7 seconds.

poundifdef
  • 141
  • 3
0

Clickhouse is notoriously memory-hungry, and not really suitable for running on a single node with just 8 Gi of memory. I suggest you choose something more suitable for your "infrastructure". For example, Postgres won't have any problem aggregating 10 million rows.

Also, your model looks suspiciously like EAV, which is a very bad thing for analytics queries. If you insist on using Clickhouse, denormalise your model.

mustaccio
  • 28,207
  • 24
  • 60
  • 76