1

My application is centered around self-contianed "workspaces". For many really good reasons (everything from management to security), we have always had a one-database-per-workspace architecture. Each database has identical schema, stored procedures, triggers, etc. There is a "database of databases" that coordinates all of this. Works great.

The problem: scalability. It was recently proposed that a customer might want to have 100,000 workspaces. Obviously this is a non-starter for one SQL instance. Plus, each workspace might be rather small, but there'd also be a very wide size distribution - the biggest workspace could be 100x the size of the median. The top 1% of workspaces could easily constitute 90+% of the rows across all workspaces.

I'm looking for options for rearchitecting things to support this scenario, and here are some things I've considered and the issues I see with each.

  • Keep the multi-database architecture but spread across multiple SQL instances. The problem is cost (both administrative and infrastructure). If we stick to a limit of 1,000 DBs on each instance, that's still 100 instances, spread across who knows how many actual VMs. But since so many of the workspaces will be small (much smaller than our current average), the revenue won't nearly scale accordingly. So I think this is probably out of the question and I'm focusing now on single-database architectures.

  • Every workspace shares the same tables, indexed by workspace ID. So every table would need a new workspace ID column and every query needs to add the workspace condition in the WHERE clause (or more likely every real table is wrapped in an inline table-valued function that takes the WorkspaceID; anyway...) The primary key of every table would also have to be redefined to include the workspace ID since not every PK now is globally unique. Programming-wise this is all fine, but even with proper indexing and perfect query design (and no, not all our queries are perfect - the dreaded row scan still happens on occasion) is there any conceivable way this could perform as well - for everyone - as separate databases? More specifically can we guarantee that small projects won't suffer from the presence of big projects which could be taking up 100x more rows than the small ones? And what specific steps would need to be taken, whether it be the type of index to use or how to write queries to guarantee that the optimizer always narrows things down by workspace ID before it does literally anything else?

  • Partitioning - from what I've read, this doesn't help with query performance, and it appears MS recommends limiting tables or indexes to 1000 partitions so this also won't help.

  • Create the same set of tables but with a new schema for each workspace. I thought of this because there are no limits to the number of tables a database can have other than the overall 2G object limit. But I haven't explored this idea much. I'm wondering if there would be performance concerns with 100,000 schemas and millions of tables, views, stored procs, etc.

With all that, here is the specific question - What specific features of SQL Server, and/or general strategies, including but not limited to things I've considered, would be most useful for maintaining a large number of self-contained data sets with identical schemas in a single giant database? To reiterate, maintaining performance as close as possible to a multi-database architecture is of top priority.

And needless to say, if any part of my assessment above seems incorrect or misguided I'd be glad to be corrected. Many thanks.

Peter Moore
  • 113
  • 6

2 Answers2

1

More specifically can we guarantee that small projects won't suffer from the presence of big projects which could be taking up 100x more rows than the small ones? And what specific steps would need to be taken, whether it be the type of index to use or how to write queries to guarantee that the optimizer always narrows things down by workspace ID before it does literally anything else?

It's not all-or-nothing. You can retain your multi-database architecture while enabling multiple projects to share a database. Then you only store multiple workspaces in a database for smaller workspaces.

The normal indexing approach is to add the WorkspaceID as the leading column of all primary keys, which will physically co-locate the rows for a particular Workspace.

You'll need a procedure to delete a Workspace from a database. Then to split a database, just restore a new copy of it, and delete Workspaces from each one.

David Browne - Microsoft
  • 49,000
  • 3
  • 53
  • 102
0

Keep the multi-database architecture but spread across multiple SQL instances. The problem is cost (both administrative and infrastructure). If we stick to a limit of 1,000 DBs on each instance, that's still 100 instances, spread across who knows how many actual VMs. But since so many of the workspaces will be small (much smaller than our current average), the revenue won't nearly scale accordingly. So I think this is probably out of the question and I'm focusing now on single-database architectures.

Given the current architecture, this is the obvious way to scale this. You would need to find someway to balance the load. Maybe you could have up to 25,000 small workspace databases on some servers but not more than 20 large workspace databases on other servers. There is an administrative cost for this but the infrastructure costs should be about the same as SQL Server is licensed per core and the number of cores, as well as memory and storage, should be about the same even if spread out over a number of servers.

Any of the other solutions will reduce the isolation of the data which might be a problem.

Aardvark
  • 430
  • 2
  • 4