Broken Query Store

Question

We have SQL Server 2022 Enterprise (16.0.4131.2) on Windows 2022 VM (8 vCPU cores) and there is an issue with the Query Store on one of the databases.

Below is the sequence of steps I can proceed to reproduce the issue on one specific database and observations I have made around it:

1 - The server is in an idle state with just a minor user generated load and negligible CPU utilization as a starting point.

2 - The Query Store is turned off and empty for the problematic database as shown from the query below.

3 - When I switch the Query Store to the READ_WRITE mode, everything is perfectly fine for about 10 - 15 minutes. However, after this period of time the CPU utilization of the server suddenly rises to 15-20% See the Activity Monitor screenshot below.

The wait statistics shows a high increase in SOS_SCHEDULLER_YELD and PREEMPTIVE_OS_QUERYREGISTRY waits - they jump into the top server wait types in that time period.

The CPU utilization generated seems to be generated by an internal background process (SPID < 51) related with a Service Broker activity over the problematic user database. The Service Broker feature is not turned on in the database itself.

4 - The high CPU utilization, the above stated waits and the above mentioned SPID activity are continuously presented until the Query Store on the database is turned off.

There are no signs of any potential "self-recovery", if I keep the Query Store on. Once turned off, the CPU utilization goes down, waits disappear and the SPID stops generating the load and become "sleeping". Switching the Query Store into the READ_ONLY mode does not help. The Query Store must be turned off to get rid off the issue.

There is not any significant amount of data in the Query Store when the issue triggers - see below (queries taken after the QS was turned off).

Other related observations

If I turn on (READ_WRITE) the Query Store and keep it running just a while but not long enough to trigger the issue, then switch it to READ_ONLY the issue is not triggered in the 10-15 minutes time period as described above. However, then it is triggered almost instantly if I switch the Query Store to READ_WRITE any time later.
When the user load increases at the time when the issue is triggered, it can be seen that use queries over the database that takes a few milliseconds to process under the standard conditions starts to slow down and shows also not negligible PREEMPTIVE_OS_QUERYREGISTRY waits (as per sp_WhoIsActive shown below). These queries definitely makes no queries to the Windows registry.

We have the same database (meaning same structure) on several other SQL Server instances with the same load pattern (and even much bigger data and query volumes there) and the Query Store works perfectly fine there. It seems that something is broken in the Query Store just for this specific database.
It can be important, that the Query Store for that database was originally full and "size based cleanup process" triggered and SQL Server (likely) restarted during that time. I am suspicious that there is something broken in the QS in that database in relation to that ... I have tried also the sys.sp_query_store_consistency_check procedure, but without success.
Once the issue is triggered and "suspended" by turning the QS off, then when the QS is turned on again (without prior cleanup), it seems the issue is triggered again almost instantly.

The Query Store is completely unusable for the database under such conditions. Any ideas of how to fix this?

UPDATE 2024-08-01

Based on hint from Paul White's comment, I have made some experiments with the CE_FEEDBACK database scoped option and it seems to be the true trigger. The behavior is as described below:

I have the uncleared Query Store from previous attempts (just a few hundreds of catch queries) as the starting point.

When I set CE_FEEDBACK = OFF and turn the QS to READ_WRITE the issue does not trigger.
When I set CE_FEEDBACK = ON and turn the QS to READ_WRITE the issue triggers almost instantly.
When I set CE_FEEDBACK = OFF and turn the QS to READ_WRITE the issue does not trigger. Then (with the running QS) I switch the CE_FEEDBACK = ON. The issue again triggers instantly. The interesting also is that when I set the CE_FEEDBACK = OFF again, it is not enough to stop the issue. The QS must be turned off to stop it as well.

WORKLOAD TYPE

As the issue seems to be related also with the workload type, I will try to describe it here, as the pattern is quite straightforward. It is almost 99% ad-hoc with the following query patterns:

About 85% of the queries executed during the experiment is a simple join to table variable. The number rows passed to the variable varies from +/- 10 to +/- 500. The Source DB table vary between queries but the execution plan pattern is same as shown below.

The rest of the workload are queries that follows the pattern as below. The problem with this portion is that due to a high variability of the parameters (the number of parameters often changes), then is is capable to fill a 2GB query store in a few days with "unique" queries, but it is not the case for the current scenario

I can imagine, that likely the first portion of the workload could be the problematic one for the CE_FEEDBACK.

score 13 · Accepted Answer · answered Aug 01 '24 at 10:08

The issue is with the CE feedback mechanism based on the workload and queries of that database coupled with how the CE Feedback QDS integration is designed.

Looking at the trace, the hot stack is indeed QDS but in RunCEFeedbackAnalysis which has a death by 1000 cuts of GetStmt which seems to be executed to get every item in QDS for that database.

I believe a little known fact is that UCS (which utilizes service broker under the covers) is used for task execution for QDS items, which is why you see that the thread looks like a service broker thread, though it isn't. You can also see this in the trace above at CSbTask::BeginTaskStart.

Since we know what the database scoped settings are and with the updates, you've already confirmed turning off CE Feedback does indeed stop the issue, which is what the stack suggests as well.

For now, you can leave CE Feedback turned off to not run into this issue in that database or go to a compat level that doesn't have it available. It's hard to say without a deep investigation of the database workload and QDS if this is something that could be more efficiently completed in SQL or if it's a edge case for database workload/QDS settings that doesn't fit well which is definitely out of scope of this site.

If you have an agreement you could open up a support ticket with Microsoft, though there would be no immediate changes or outcome and if anything it'd be changed far into the future.

Broken Query Store

1 Answers1

Linked

Related