How to create an efficient, scalable, multi-tenant data layer using MongoDB?

Question

I'm working on the architecture for my upcoming Project Mangement app (as an example) and I'm seeking clarity on how best to design the MongoDB data layer, with specific regard to multi-tenancy. The app will have multiple 'sub-apps' (e.g. Calendar, Tasklist, Media, Team, etcetera) which would each map to a Collection in the database (either a centralized DB or its own Project DB).

DB Server == Replica Set.

The Questions

Should I use one giant, centralized database to store all the application data, or create an individual database for each Project that is created on the system?
If I choose the individual DB strategy, does that obviate the need to shard the data layer given that the DB's are 'naturally' dispersed across several servers, thus 'naturally' spreading the load across several servers? The application would contain logic that tells it which server to access the data for any given Project.
Would using individual DB's for each Project give me better performance (given that to find any given document, Mongo would only have to search at most a few thousand docs in the individual Project DB vs. potentially millions in a giant, centralized DB)?
Is it at all possible to reduce the 32M minimum footprint for a MongoDB database? I've read the documentation in the --smallfiles manual, but that didn't really answer my question. Is this a hard minimum?
If any given Project received a large amount of traffic, and became a 'noisy neighbour', would the solution just be to spin up a new DB Server and move that Project to the new server? or would it be a better approach to shard the DB Server that houses the noisy neighbour to increase performance on that server?
What 'maintenance' concerns would I have with regard to cleaning up space for any given deleted Project, and/or 'shrink-wrapping' each DB to minimize it's footprint as close to the actual amount of data stored in any given Project database?
What concerns should I be aware of with regard to future changes in the data 'schema' that would have to be 'rolled-out' across all the Project DB's? Given that Mongo is 'schema less', is it correct to assume that if I want to add a new 'field' to any given Collection that I would just do so in the app logic, without having to roll out any updates to the DB's themselves?
What MongoDB 'tools' would I use to get information about the current 'status' of any given DB Server?
Are there any limits to the number of DB's that can be housed on any given DB Server that I should be aware of?
How does the individual DB strategy effect back-ups? Are there any concerns I should be aware of when backing up (to S3 for disaster recovery) many DB's across many DB Servers?

The App Stack

Ubuntu 12.04 LTS
Nginx
node.js
express.js
MongoDB

Current Working Strategy

My current working strategy is to use one database to store the higher-level, 'global' data like Users, Notifications, Messages, Usage, and Preferences. And then create a new database for each project created on the system.

This seems like the ideal approach for many reasons: security (each DB has its own creds), catastrophic recovery (since if one DB Server goes down the entire app doesn't go down), and performance (I think, since Mongo would have to search far fewer docs to find the one it's looking for).

The application would contain logic that automatically detects available space on any given DB Server and creates the new Project database on the next available DB Server.

According to this article provide by MongoHQ, this is the 'best' strategy, although it consumes a large amount of storage. Especially since each DB takes up 32M even when it's empty. Which gets very expensive using a service like MongoHQ if you're offering a 'Freemium' app that gets Techcrunch'd.

So in a scenario where ProjectManager has three projects on the system my data layer would look like so:

ProjectManager
  Users
  Notifications
  Messages
  Usage
  Preferences

Project01
  Calendar
  Tasks
  Media
  Team

Project02
  Calendar
  Tasks
  Media
  Team

Project03
  Calendar
  Tasks
  Media
  Team

Each of the above ProjectXX DB's would be tiny. Each one storing about 2000-3000 documents each at most.

Thanks in advance for any insight.

score 8 · Accepted Answer · answered Mar 02 '14 at 13:25

A few things to keep in mind:

What is efficient at large scale is not always efficient at small scale.
What you think you'll need at large scale is not often what you actually need when you get to that scale.
Best performance is application specific, not generic. What performs best for your app may not be what performs best for my app.
All schema-less means is that the DB system won't fight you when you change it. Your application code still has a schema you'll have to engineer around.
- Adding a new field? Mongo don't care, only your app-logic does.
- Changing a field from single to multi-valued? Mongo don't care. But the functions in your code most definitely would. You'll need to build a data migration path, or engineer your code to handle both cases.
Operational limitations (what AWS instances you can afford, for example) will drive when you need to expand your server footprint.

Given that, there are a few design patterns to follow as you build your system right now. These are items that will make scaling later on easier depending on what you learn along the way.

Shard now
This forces you to start thinking about good shard keys, since the shard key is a piece of your non-schema that is hard to change down the road. You're not sharding for performance at this point, you're sharding now to ensure your code can handle it and to boot the performance question down the road a few milestones.

Engineer multi-database support in now
If you expect you'll need more than one database, or even more than one cluster of Mongo databases, building in data-locality at the early stages will ease putting it in later. Right now it may all be in one cluster, and all projects/tasks/calendar/users in the same three MongoD instances, but when you learn that the Calendars database is slowing everything down and needs to move to SSD-backed instances you can make that change a lot easier.

Database compaction only matters in some cases
Since the database files themselves are mmapped having a 2GB database file contain a 200MB database doesn't actually hurt performance so long as your storage subsystem handles random I/O well. Also, compaction takes nodes offline for a while which makes it a potentially significant impact to normal operations. Also, you don't have to worry about compaction if you never delete documents.

Know what you get with separated Collections and separated Databases
Collections in the same database share the same Database Lock, which they've been steadily whittling down with each version of MongoDB.
Databases in the same instance share I/O with each other and the very few Global Lock events still there.

Indexes matter a lot
If you don't have enough RAM to keep at least your indexes in memory, performance will be really bad. Depending on how large you get, you may end up sharding or splitting collections in order to get indexes that will fit into RAM again. This is one area where multi-tennancy can be an issue; if you've got a few large, unused tenants in a single collection all those indexes have to be kept in RAM to make the whole system run. If you split collections based on tenant then the unused indexes can get paged out with little penalty.

How to create an efficient, scalable, multi-tenant data layer using MongoDB?

1 Answers1