Performance tuning for multi-tenant database

Question

The database, MS SQL 2014 Standard Edition, is becoming a bottleneck in our multi-tenant web application. Aware that most performance issues can be solved by making code more efficient, however looking beyond that now.

When load is high, the database seems to grind to a halt. At seemingly random times, it takes a massive slowdown (thinking table locking when one tenant inserts a load of data). Other times, we see deadlocked transactions. In the latest slowdown, a big chunk was deleting single rows from this table with queries like DELETE FROM tableX WHERE ID = x, resulting in a lot of Transaction (Process ID 183) was deadlocked on lock resources with another process and has been chosen as the deadlock victim errors.

Which of the below strategies would you recommend, if any?

Changing clustered indexes from the auto-incrementing int PK to being tenantID + id. (note all tables have tenantID column and already a non-clustered index on tenantID):

This would be an attempt to stop page/table locks due to activities from one tenant from affecting all other tenants. This seems recommended practice for multi-tenant DBs. Otherwise if a tenant inserts 4,000 rows or similar this seems to lock the table.
Changing from default isolation level to Read Committed Snapshot so writes don't block reads and vice versa
Moving to faster physical disks - currently on a RAID 10 array of SSDs, looking at the Intel NVMe PCI-e SSDs

We use SQL Mirroring with High Safety Mode as we cannot lose a single booking/row due to a hardware error. Was under the impression all writes are written to disk, and the transaction won't complete till written to disk on the mirror. Networking between the two servers is sub 1ms and no contention. There's plenty RAM in the machine (64GB and doubling to 128GB in a few days), but does host a second multi-tenant DB (~40GB) for a different product.

Erik Reasonable Rates Darling · Answer 1 · 2017-04-17T14:23:01.480

Your comment:

Possibly, but there's a lot of queries to one table of bookings with many people trying to get bookings at the same time. It's more complex than that, with queries needing to join other tables, with data being loaded in to these tables by other tenants at the same time, etc

So, it's a database! Great, let's answer some questions.

Changing clustered indexes from the auto-incrementing int PK to being tenantID + id. (note all tables have tenantID column and already a non-clustered index on tenantID)

You know that Simpsons episode where they bring in lizards to catch pigeons, snakes to catch the lizards, and snake eating gorillas to catch the snakes? That's what you're doing here. The identity column clustered index is not the problem.

Changing from default isolation level to Read Committed Snapshot so writes don't block reads and vice versa

This is a wonderful option that you should fully explore, but you have to make sure that you don't run into any race conditions with queueing type queries that depend on locking and blocking to assign work. Since your database is so small now (20 GB), you're in a great place to start exploring optimistic isolation levels. If parts of your code can't handle RCSI, Snapshot Isolation may be easier to approach because you can apply it to specific queries.

Moving to faster physical disks - currently on a RAID 10 array of SSDs, looking at the Intel NVMe PCI-e SSDs

That's silly all the way around. You have a 20 GB database. Buy 32 GB of RAM. If it's all in memory, disk doesn't matter, and 32 GB of RAM is way cheaper than all that SSD. Though I'm totally on board with you grabbing some NVMes for tempdb, memory is the better approach.

Of course, no change means anything without some basic troubleshooting. Try a free script like sp_BlitzFirst (full disclosure, I contribute to this OSS) to figure out what your queries are actually waiting on.

Run EXEC sp_BlitzFirst @SinceStartup = 1 and look at your wait stats. Pay attention to the avg ms per wait column.

If you have any questions about those, post back here with them.

Update

Cluster indexes to include tenantID - this was in an attempt to stop page/table locks due to activities from one tenant from affecting all other tenants. This seems recommended practice for multi-tenant DBs. Otherwise if a tenant inserts 4,000 rows or similar this seems to lock the table.

You likely won't get much different locking behavior with this over proper nonclustered indexes. If you really want one tenant to not bother another, you can try a partitioned view (or if you updgrade to 2016 SP1, actual partitioning), or separate tenants out to their own databases.

Lock escalation will occur regardless of if locks are against clustered or nonclustered indexes. I'm not sure what you've done for index tuning, but missing indexes can definitely be a factor if lock escalations are inappropriately frequent. Same full disclosure about contributing as above, sp_BlitzIndex can help you identify both aggressively locked indexes, as well as missing indexes.

Disk IO vs RAM. We use SQL Mirroring with High Safety Mode as cannot lose a single booking/row due to a hardware error. Was under the impression all writes are written to disk, and the transaction won't complete till written to disk on the mirror. Networking between the two servers is sub 1ms and no contention. There's plenty RAM in the machine (64GB and doubling to 128GB in a few days), but does host a second multi-tenant DB (~40GB) for a different product.

This is the sort of important information that should have been in your original question. If the network is fine, then your two culprits for Mirroring latency will likely be transaction size or transaction frequency. Some of that might be mitigated by putting the log file on faster disks, but ultimately controlling the size or frequency of transactions is a smarter approach.

Just to add, in the latest slowdown, a big chunk was deleting single rows from this table with queries like DELETE FROM tableX WHERE ID = x, and a lot of Transaction (Process ID 183) was deadlocked on lock resources with another process and has been chosen as the deadlock victim.

Theoretically two write queries won't deadlock, they'll just block each other until one is done. Are you using foreign keys or other constraints that touch other tables with cascading actions? Have you identified which queries they deadlocked with?

RCSI won't help with write query blocking, though Snapshot can if you're willing to spend a decent amount of time around error handling and query retries.

But again, this is all hand-wavey without wait stats or other diagnostic information.

Best of luck with everything.

score 3 · Answer 2 · answered Apr 17 '17 at 06:35

Not as a counter-point to anything stated in @sp_BlitzErik's answer, but to more fully address one specific question out of the several posed by the O.P.:

Changing clustered indexes from the auto-incrementing int PK to being tenantID + id. (note all tables have tenantID column and already a non-clustered index on tenantID)

Simply stated: Yes, yes, yes, and probably again yes :-). Absolutely cluster on TenantID (first) and then the ID / Identity column. Right now you have the clustered index on just the IDENTITY column, and that values gets copied into the non-clustered index on TenantID, in which case you have two indexes, and the Identity value is in both. Consider the following regarding the current setup:

It is taking up more disk space to have the Identity column duplicated
When you INSERT or DELETE rows, both indexes need to be modified to account for the new / removed rows. That increases the potential for getting deadlocks (and you mention in a comment on @sp_BlitzErik's answer that this is indeed happening).
If it is not obvious, maintaining two indexes takes more time and more transaction log space than maintaining just one.
Given that most queries should (hopefully!) have the TentantID in the WHERE clause, this forces most queries to use the non-clustered index instead of going straight to the clustered index. This is an unnecessary performance hit on most of your queries. Only back-end / maintenance queries (i.e. those that can operate across the Tenants) would use the Identity column only to find rows, and it is highly doubtful that those represent the majority of the queries being processed.

For more details as to why you should proceed with this change, and at least one problem you might run into and how you can get passed it, please see my answer to the following question, also here on DBA.StackExchange:

Composite Primary Key in multi-tenant SQL Server database

Performance tuning for multi-tenant database

2 Answers2