We currently have a database under MySQL, storing aggregated statistics in different tables (recent hours, hours, days, months).
The tables are updated by workers running at different rates depending on the freshness required for the data.
Then those aggregates are queried by the applications with usually queries involving more even more aggregation.
This solution is showing limits in performance, scaling and flexibility when it comes to queries the data.
Our goal is to replace it with a system based on events sourcing.
Our first prototype use Dataflow (a bit like MapReduce but working in streaming) to pre-compute aggregates for a part of the data and put those in BigTable and put raw events (partitioned) in BigQuery for the aggregations we can’t pre-compute.
The system is globally working but its cost is prohibitive, with an estimated of 25K$/month for BigQuery only.
The cost is mainly due to the high number of queries for which we cannot pre-compute an efficient aggregate (usually we cannot pre-compute an aggregate because we need an earlier event that dataflow doesn’t have at the processing time of the event) .
As alternate back end of BigQuery we tested few other options like kudu, clickhouse , spanner…
So far only clickhouse actually performed very well with sub second response time, but maintaining a clickhouse cluster seems to be a bit hazardous.
From there where else could we look? Did we do a fundamental error in the design that would explain the poor performance? Is it even possible to balance cost usability of such system
I have the feeling that the most used solution is still big hadoop clusters.
Few technical info under normal load:
- ~300 event/s (with regular pick at 800 event/s)
- ~34000000 event/day
- an event encoded in protobuf weight ~200B
- an event has 20 properties
- computing the last state can require 3-10 events spread over a range time of 3 months