-1

One of the issues that I am encountering presently is that we have certain very large tables (>10 Million rows).When we reference these large tables or create joins, the speed of query is extremely slow.

One of the hypothesis for solving the issue is to create pre-computed tables, where the computation for the use cases will be done already and instead of referencing the raw data, we will query the pre-computed table instead

Are there any resources in order to implement this ? Do we only use mySQL or can we also use Pandas or other such modules in order to accomplish the same

Which is the optimal way?

1 Answers1

1

Yes.

See my blog on Summary Tables. It discusses their purpose (similar to what you describe), how to build them, some metrics on properly sizing them, etc.

Often I see upwards of 10-fold speedup.

A well-design Data Warehouse uses the "Fact" table only when you need to fetch individual entries, which is rare. Most queries can be done against the Summary table(s).

And, by using PARTITIONing, you can efficiently toss "old" Fact rows, while keeping the Summary data "forever". This makes disk space more manageable.

It is usually good to heavily 'normalize' the Fact table, saving disk space. Meanwhile, the Summary tables can be .denormalized', improving speed.

If you want more specifics, please divulge more info.

Rick James
  • 80,479
  • 5
  • 52
  • 119