How would you identify slow queries before reaching production?

Question

Let's say you are running a query on your database.

In development and staging, where the amount of data is small, everything looks good. Unit and integration tests are also fine and the logic is correct.

But when you get to production, where the database is much bigger, all of a sudden it turns out the query is not efficient and is running very slowly.

So you might add an index, or change the order of the parameters in the query or find a different solution.

How can you identify such a situation before getting to production?

Meaning—just like unit tests enable us to "shift left" the catch of problematic code, how can we also do it with queries? Are there known techniques or suggested frameworks for that?

Some more information:

you can not clone the production database into the development/staging environment
Load tests are less relevant here, as the problem is not due to load on the system (and thus on the database), but the problem is with a certain query that even one such query takes too long.

score 33 · Answer 1 · edited Jan 18 '24 at 12:45

You should have a test system using similar hardware and data sizes as your production database. Sometimes a copy of the production database with any personal/sensitive data anonymized. But it could also be completely generated data, if data is generated you need to be careful to ensure it follows similar patterns as the production database.

It may not be practical or price effective to have a totally identical system to production. But this becomes a tradeoff. The closer to a real production system, the lower the risk of bugs, but at a higher cost. Disk and memory is typically fairly cheap, so it may be feasible to use a realistic database size, but slower disks and fewer CPUs, and assume the number of queries scales linearly with better hardware. But keep in mind that this is just an assumption.

In addition to testing performance of individual queries you may want to test performance under load. So you may need some tools to generate a representative amount of queries.

This can be the staging server, or it could be named the "stress-test" system, or whatever you prefer.

score 20 · Answer 2 · answered Jan 16 '24 at 17:23

In my personal experience when I have queries that are slow in production, most are SELECT or read queries. When I am building out those queries and I am suspicious of their performance, usually because they have several joins, then I'll just run it manually in production (on a read slave using a read-only account) and then monitor its time and CPU usage.

If it takes more than a couple seconds, then I stop it. Or if I expect it to take a while, like a couple minutes for large aggregation reports, then I'll monitor the CPU usage and kill it if it is likely to affect production traffic. At greater cost, you could have read-slave that doesn't host production traffic.

I also like PostgreSQL's \timing and EXPLAIN and EXPLAIN ANALYZE commands where you can get actual timing information and execution plans. I guess in theory you could write a unit test that checks the execution plans and looks for patterns such as full table scans that are usually the cause of long running queries.

Another option.. Although you can't use production data (I'm guessing it is 100s of GB in size, making it impractical to generate similar quantities for a unit test), you could still generate 1000s of rows and then test the queries. If normal, fast, queries on production are 100 ms, but the same queries on your local machine or test machines test take 5 ms, then you could just require that queries take less than 5 ms.

Those are things I have personally done, but the number of new DB queries my team added were just a few a week and I had a good enough grasp of the database that I could catch potentially bad queries in code reviews before code was deployed. If a bad query was deployed, we had a pretty solid revert process.

If this is something you do constantly and absolutely need an automated process for, then the things I would try (theoretical at this point) is creating some static analysis tests that check queries for:

where clauses on foreign keys
where clauses on indexes
execution plans for full table scans

If you are doing tons of UPDATEs, INSERTs, and/or DELETEs.. then that is a little more complicated and I don't have as much experience in that. The little experience I do have are in situations where the processes take 20 minutes.

score 11 · Answer 3 · answered Jan 16 '24 at 09:01

How can you identify such a situation before getting to prod?

I would argue that isn't really possible to do safely, reliably and economically. Instead, you should try to minimize impact of query changes, quickly detect performance regressions and build a software engineering process that allows you to quickly fix issues and regressions.

Use feature flagging to limit new features or changes to just small sample of users. Possibly even to known beta users. You could slowly increase amount of users who have access to the feature, monitor for performance issues and automatically roll back if problems show up.

Modern telemetry systems make it trivial to implement performance monitoring and to identify which part of the code has performance issues.

And having solid automated deployment pipeline, with reliable automated regression test suite should make it possible to quickly deploy fixes to queries or schemas. This would reduce the need to catch issues before they hit production. And with canary releases and good monitoring should reduce need for up-front performance testing.

score 5 · Answer 4 · answered Jan 16 '24 at 10:31

In fact the querying behaviour of mainstream database engines is practically non-deterministic (from the perspective of a developer), so there is never a test that can decisively identify or anticipate all problematic operation.

The reason the behaviour is non-deterministic is to allow the database engine to automatically adapt to the workload it is applied to from time to time.

The alternative to this automatic adaptation would not only be a lot of additional detailed up-front work which the developer would have to perform, but potentially massive rework every time a change occurred (including not just changes in what we think of as the code or configuration, but even mere changes in data volumes, which can tip the balance of certain algorithms or require different approaches to storage).

It's also very difficult to predict, define, or emulate the exact pattern of concurrent loads that a database engine will experience in production use, and therefore extremely difficult to reproduce in testing how a database engine will react to it.

The timing of everything is often crucial, and brute-forcing an analysis and testing of all possible theoretical conflicts would be both computationally intensive (probably impossibly so) and would not be very informative about what matters.

It's impossible for a developer to predict exactly what the timings of everything will be in production use under every possible current and future circumstance, so even if you had a brute-force analysis of every possible problem, you wouldn't know which problems would actually apply to the real world use in production.

In reality, the design of established database engines incorporates a lot of accumulated experience in real production use, and refinement of the technology from that experience.

The problems themselves are considered fairly essential. That is, they don't arise from poor tooling or inexpert design choices (at least not always), but arise from the very nature of business information systems and from the circumstances they have to cope with.

The need for concurrent working on mutable shared data is really the root of all the difficulty.

bobflux · Answer 5 · 2024-01-18T14:08:10.123

Most databases are smart at planning queries and will use statistical information from the actual data to do so.

Therefore, if you use a test database with generated dummy data, it is absolutely essential that the size and statistical distribution of this dummy data be a good representation of real world data.

If you test for a small dataset that is not representative of your real data... you will get query plans optimized for your test dataset, which can completely miss the mark. Ironically, the smarter the database, the worse it gets, because query plans can be wildly different depending on data size and statistics.

Example 1: Geographic search

If you generate test data that with random uniform distribution, it is not representative. Searching for points within an area of a certain size will always give more or less the same number of results. In the real world, there are very dense cities and vast empty areas. A better statistical distribution modeling this would allow you to catch performance problems like using a bounding box that is too large on an area that is too densely populated with results.

Example 2: Forum

Likewise if test data is generated with an uniform distribution, with all topics having more or less the same number of posts, you will miss the typical runaway topic with thousands of pages. If it is paginated with a naive "ORDER BY timestamp OFFSET..." it has to scan the whole tens of thousands of posts and skip them in order to display the last one. This also kills your cache. So the most frequently displayed page (the last one) is the slowest, and the topic in question is the one with highest traffic, which is a recipe for bad performance.

In this case it could have been caught earlier by thinking a little bit more: the developer's efforts should be focused on making the most frequently displayed pages (first and last) the fastest, and it's okay if page 500 in a 1000 pages topic is slow to display because no-one will ever display it. In this real world case, the solution was to switch the ORDER BY from ASC to DESC depending on the page number, and tweak OFFSET/LIMIT values accordingly. So for the last page, it would scan the (topic_id, timestamp) index starting from the end, resulting in "OFFSET 0" for the last page, no rows skipped, no cache pollution, and much higher performance.

It's important to add some outliers in the test data, if you have groups or categories there should be one with a much larger size than the others.

If you have TEXT or binary fields allowing near-unlimited length, then these should not all contain "Hello, world" but instead a representative sample of text lengths. Especially if you want to test fulltext search.

The test server should not be overpowered. In fact, if you test with a single user, having an overpowered test server can be misleading, because in real use, server resources are shared between all users. So if it has too much RAM and SSDs, everything's going to be too fast!

If you work on a website, it really helps to have a "developer mode" which displays all the queries done for the page at the bottom, along with timings. The way I implemented it was to allow users with admin rights to impersonate a normal user, so they would see the website exactly as seen by users, plus the query log and other performance information at the bottom of the page.

krillgar · Answer 6 · 2024-01-18T12:57:21.617

I had a very similar situation a few times in my career. The worst was being told "we don't need to import old data into the new system" until two weeks before we went live, despite me asking for nearly a year. Our home page on that app went from loading in less than a second to taking a minute and forty-five seconds once we had 400k+ base records into the system.

Here's my checklist of what I look for as I build out a project:

How many tables are being joined together in a query?
Do we have any subqueries in our SELECT statements, especially within columns?
When using Entity Framework, or any other ORM, are we including children and collections of children in the queries?
Do we have any triggers on the tables when we perform INSERTs or UPDATEs?

In the example I mentioned earlier, the home page was displaying a table that gathered data from FOURTEEN different tables in the database. All of the JOINs were on Primary Key columns, and there weren't inefficient columns being used for those connections.

Their DBA suggested we start throwing a ton of Indexes at the tables. However, remember that having too many Indexes can also slow the tables down because of the complexity of the page structures (especially in SQL Server).

By applying all of their DBA's suggested Indexes, the query time went from 1:45 down to 1:35. Hardly the savings he expected. Instead of throwing indexes at multiple tables, you should take a look at creating a VIEW. What these do is allow the Database Engine to keep the Execution Plan in memory for far longer than the minor optimizations it performs when being sent the same command string repeatedly. Using a VIEW for that homepage to only grab the columns we wanted, and having the full execution plan cached dropped the load time of that homepage back to a single (full) second.

This approach would also apply if you have a complex or subqueried column in a SELECT.
If you happen to notice you're including multiple children, or collections of children, in your queries through an ORM you'll notice just how convoluted the SELECT statements it generates can be. This can cause massive duplication and explode the amount of data being sent back and forth.

In these cases, you'll be far better off to make a separate query for the children using the distinct IDs you get back from those parent objects. Then manually assign the children before returning out of your Data Access layer. It's a bit more work to do it this way, but if you're noticing lots of slowdown, you'll be saving time overall doing the logic yourself instead of letting the ORM handle it for you.

For the most part, you shouldn't need triggers on your tables. Especially if they wind up cascading to other tables that have their own triggers, you're going to drastically increase the time required to write data. If you have other code sending the data to your database, try and do that yourself before saving your data. Granted, this isn't always possible.

How would you identify slow queries before reaching production?

6 Answers6