Replacing an aggregation system of statistics

Question

We currently have a database under MySQL, storing aggregated statistics in different tables (recent hours, hours, days, months).
The tables are updated by workers running at different rates depending on the freshness required for the data.
Then those aggregates are queried by the applications with usually queries involving more even more aggregation.

This solution is showing limits in performance, scaling and flexibility when it comes to queries the data.

Our goal is to replace it with a system based on events sourcing.

Our first prototype use Dataflow (a bit like MapReduce but working in streaming) to pre-compute aggregates for a part of the data and put those in BigTable and put raw events (partitioned) in BigQuery for the aggregations we can’t pre-compute.

The system is globally working but its cost is prohibitive, with an estimated of 25K$/month for BigQuery only.
The cost is mainly due to the high number of queries for which we cannot pre-compute an efficient aggregate (usually we cannot pre-compute an aggregate because we need an earlier event that dataflow doesn’t have at the processing time of the event) .

As alternate back end of BigQuery we tested few other options like kudu, clickhouse , spanner…
So far only clickhouse actually performed very well with sub second response time, but maintaining a clickhouse cluster seems to be a bit hazardous.

From there where else could we look? Did we do a fundamental error in the design that would explain the poor performance? Is it even possible to balance cost usability of such system
I have the feeling that the most used solution is still big hadoop clusters.

Few technical info under normal load:

~300 event/s (with regular pick at 800 event/s)
~34000000 event/day
an event encoded in protobuf weight ~200B
an event has 20 properties
computing the last state can require 3-10 events spread over a range time of 3 months

Misha Brukman · Answer 1 · 2019-07-01T14:34:49.390

Disclosure: I am the product manager of Google Cloud Bigtable.

Is it fair to assume that these statistics relate somehow to a globally-unique identifier (user id, device id, something else)?

If so, you might be able to do everything you need in Bigtable, depending on the specific set of queries you need to issue. For example WePay does something quite similar, and they presented their solution and schema design in Bigtable for flexible aggregation of metrics data.

Since Bigtable is a wide-column NoSQL database, you can create a number of columns for any given key and store raw data in one column (with time series for events), partial aggregations for minutes, days, weeks, months, etc. each in separate columns, and then when you need a particular aggregation over an arbitrary time interval, read the partial aggregations as needed.

You can also do this in two different modes, depending on the trade-offs between cost and time:

if only some of your aggregations are accessed, compute them lazily on first access, and write them to Bigtable for storage and querying
if many of the aggregations are needed, and latency is important even for first access, compute eagerly what you can with Dataflow live, and post-compute what cannot be computed live with a background process

As an additional data point, Fastly recently posted how they moved their historical statistics from MySQL to Bigtable, but I don't have any further information on their schema design; hopefully, they will post a follow-up to the announcement with more technical details of their implementations.

If you'd like to discuss your specific use case in more detail privately, let's chat further.

Replacing an aggregation system of statistics

1 Answers1