128

I am just confused about how Sharding and Replication work.

According to the definitions I found in the documentation:

Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.

Sharding: Sharding is a method for storing data across multiple machines.

As per my understanding if I have 75 GB of data then by using replication (3 servers), it will store 75GB data on each server means 75GB on Server-1, 75GB on server-2 and 75GB on server-3. (correct me if I am wrong).

And by using sharding, it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3. (Right?).

But then I encountered this line in the tutorial:

Shards store the data. To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set

As a replica set is 75GB in size, but shard is 25GB in size, then how can they be equivalent?

This makes me quite confused. I think I am missing something obvious. Please help me with this.

Saad Saadi
  • 1,411
  • 2
  • 10
  • 7

4 Answers4

181

A Replica-Set means that you have multiple instances of MongoDB which each mirror all the data of each other. A replica-set consists of one "Primary" and one or more "Secondaries".

All write-operations go to the primary and are then replicated to the secondaries. So writes won't get faster when you add more secondaries.

Read-operations, on the other hand, can be served by any secondary. So when you have a lot of read requests, then you can increase read-performance by adding more secondaries to the replica-set and having your clients distribute their requests to different members of the replica-set.

Replica-sets also offer fault-tolerance. When one of the members of the replica-set goes down, the others take over. When the primary goes down, the secondaries will elect a new primary. For that reason it is recommended for productive deployment to always use MongoDB as a replica-set of at least three servers, with at least two of them holding data. In that scenario, the third one is a data-less "arbiter" which serves no purpose except electing the remaining secondary as the new primary when the actual primary goes down.

A Sharded Cluster means that each shard of the cluster (which can also be a replica-set) takes care of a part of the data. Each request, both reads and writes, is served by the cluster where the data resides. This means that both read- and write performance can be increased by adding more shards to a cluster. Which document resides on which shard is determined by the shard key of each collection. It should be chosen in a way that the data can be evenly distributed on all clusters and so that it is clear for the most common queries where the shard-key resides (example: when you frequently query by user_name, your shard-key should include the field user_name so each query can be delegated to only the one shard which has that document).

The drawback is that the fault-tolerance suffers. When one shard of the cluster goes down, any data on it is inaccessible. For that reason each member of the cluster should also be a replica-set. This is not required. When you don't care about high-availability, a shard can also be a single mongod instance without replication. But for production-use you should always use replication.

So what does that mean for your example?

                            Sharded Cluster             
             /                    |                    \
      Shard A                  Shard B                  Shard C
        / \                      / \                      / \
+-------+ +---------+    +-------+ +---------+    +-------+ +---------+
|Primary| |Secondary|    |Primary| |Secondary|    |Primary| |Secondary|
|  25GB |=| 25GB    |    | 25 GB |=| 25 GB   |    | 25GB  |=| 25GB    |   
+-------+ +---------+    +-------+ +---------+    +-------+ +---------+

When you want to split your data of 75GB into 3 shards of 25GB each, you need at least 6 database servers organized in three replica-sets. Each replica-set consists of two servers who have the same 25GB of data.

You also need servers for the arbiters of the three replica-sets as well as the mongos router and the config server for the cluster. The arbiters are very lightweight and are only needed when a replica-set member goes down, so they can usually share the same hardware with something else. But Mongos router and config-server should be redundant and on their own servers.

Philipp
  • 2,015
  • 1
  • 12
  • 7
28
  • Sharding partitions the data-set into discrete parts.
  • Replication duplicates the data-set.

These two things can stack since they're different. Using both means you will shard your data-set across multiple groups of replicas. Put another way, you Replicate shards; a data-set with no shards is a single 'shard'.

A Mongo cluster with three shards and 3 replicas would have 9 nodes.

  • 3 sets of 3-node replicas.
  • Each replica-set holds a single shard.
sysadmin1138
  • 825
  • 5
  • 10
12

By sharding, you split your collection into several parts.
Replicating your database means you make mirrors of your data-set.

haper
  • 121
  • 2
7

In terms of functionality delivered. Sharding provides scalability and parallelism. Replication provides availability

Ashish Kumar
  • 71
  • 1
  • 1