We recently experienced a disruptive incident in a MongoDB 3.2 cluster related to index creation. Scenario:
- Several indexes were created on the primary using background: true.
- Indexes were automatically created on the secondary by MongoDB after finishing on the primary, also in background.
- This eventually led to several background index operations running in parallel on the secondary. So far, everything is happening more or less as expected.
- When 4 indexer threads ended up running in parallel, all read operations on the secondary appeared to block (edit: to clarify, this even includes read operations to databases on which no indexes were being built, which according to the docs shouldn't even happen for foreground builds). We perform some reads with the secondaryPreferred option, and these reads ended up hanging completely (see below).
- We immediately aborted further index creation as well as the background indexer thread still running on the primary. However, read actions on the secondary only started running again after all background index threads had finished. All in all, we were completely unable to read from the secondary for 20 minutes.
What could explain this, specifically the fact that read operations on the secondary eventually blocked completely? The fact that index creation can significantly impact performance is well known and we were taking this into account.
However, nothing in the documentation that we've been able to find explains the secondary hanging completely until index creation is completed. We were also unable to find anything suspicious in the logs on either the primary or the secondary: log entries simply showed the creation and progress of the individual index builds but nothing that would explain the blocking of read operations.
The situation caused some major issues for us as we perform certain reads with the ReadPreference.secondaryPreferred option of the Java driver. Rather than fall back to reading from the primary, these operations ended up hanging with no timeout while waiting for the secondary and consequently quickly building up a large queue of running (but hanging) requests on our application servers.
If anyone could shine some light on this situation it would be greatly appreciated.
Technical details:
- MongoDB version: 3.2.16 (not the very newest, but nothing in the .17/.18 changelogs sounds like it might affect this)
- Engine: WiredTiger
- Resources: Primary and secondary server VMs each configured with 6 CPUs and 128 GB memory, 1 GBit link between the two.
- Dataset: approx. 2 TB total of which ~100 GB is particularly active.