I have a replica set containing only two members:
test:PRIMARY> rs.config()
{
"_id" : "test",
"version" : 8,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 0,
"host" : "localhost:50000",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "localhost:50002",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : true,
"priority" : 0,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"catchUpTimeoutMillis" : 2000,
"catchUpTakeoverDelayMillis" : 30000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("591de3c57f979533a0e4defa")
}
}
test:PRIMARY>
Maybe you'll get triggered by the second member being on localhost:50002, that is because that port is a ssh tunnel to another machine, it is irrelevant in this context (see MongoDB replica set over ssh tunnels)
The issue is when localhost:50002 gets rebooted, sometimes the primary on localhost:50000, the "main machine" that is supposed to be the primary, steps down to secondary. Update: Turns out it does this always after the secondary has been down for some time (seconds to minutes).
This should not happen, since localhost:50002 is hidden:true with a priority:0, and localhost:50000 is still the only one which could get elected (should an election even be taking place).
The clients cease to be able to read/write to the database (because it's a secondary).
Both servers are v3.6.7.
The log shows the following entries (grepped to "repl")
2018-09-13T23:34:15.886+0200 I REPL [replexec-149] Member localhost:50002 is now in state RS_DOWN
2018-09-13T23:34:15.886+0200 I REPL [replexec-149] can't see a majority of the set, relinquishing primary
2018-09-13T23:34:15.886+0200 I REPL [replexec-149] Stepping down from primary in response to heartbeat
2018-09-13T23:34:15.886+0200 I REPL [replexec-149] transition to SECONDARY from PRIMARY
2018-09-13T23:34:15.886+0200 I NETWORK [replexec-149] Skip closing connection for connection # 12889
2018-09-13T23:34:15.886+0200 I NETWORK [replexec-149] Skip closing connection for connection # 12858
2018-09-13T23:34:15.886+0200 I NETWORK [replexec-149] Skip closing connection for connection # 12857
2018-09-13T23:34:15.887+0200 I NETWORK [replexec-149] Skip closing connection for connection # 11417
2018-09-13T23:34:16.779+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host localhost:50002 due to bad connection status; 0 connections to that host remain open
2018-09-13T23:34:16.780+0200 I REPL_HB [replexec-150] Error in heartbeat (requestId: 1509228) to localhost:50002, response status: NetworkInterfaceExceededTimeLimit: Operation timed out
2018-09-13T23:34:17.280+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:50002
2018-09-13T23:34:26.503+0200 I REPL [replexec-150] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2018-09-13T23:34:27.280+0200 I REPL_HB [replexec-149] Error in heartbeat (requestId: 1509230) to localhost:50002, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2018-09-13T23:34:36.771+0200 I REPL [replexec-149] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2018-09-13T23:34:37.280+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:50002 - NetworkInterfaceExceededTimeLimit: Operation timed out
2018-09-13T23:34:37.280+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:50002
2018-09-13T23:34:37.780+0200 I REPL_HB [replexec-150] Error in heartbeat (requestId: 1509232) to localhost:50002, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2018-09-13T23:34:47.307+0200 I REPL [replexec-150] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2018-09-13T23:34:48.280+0200 I REPL_HB [replexec-150] Error in heartbeat (requestId: 1509234) to localhost:50002, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2018-09-13T23:34:52.353+0200 I NETWORK [conn12897] received client metadata from 127.0.0.1:36478 conn12897: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.6.7" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "16.04" } }
2018-09-13T23:34:57.280+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:50002 - NetworkInterfaceExceededTimeLimit: Operation timed out
2018-09-13T23:34:57.280+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:50002
2018-09-13T23:34:58.012+0200 I REPL [replexec-150] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2018-09-13T23:34:58.780+0200 I REPL_HB [replexec-150] Error in heartbeat (requestId: 1509235) to localhost:50002, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2018-09-13T23:35:06.309+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:50002 - HostUnreachable: Connection reset by peer
2018-09-13T23:35:06.309+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:50002 due to failed operation on a connection
2018-09-13T23:35:06.309+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
2018-09-13T23:35:06.309+0200 I REPL_HB [replexec-150] Error in heartbeat (requestId: 1509237) to localhost:50002, response status: HostUnreachable: Connection reset by peer
...
2018-09-13T23:35:09.322+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:50002
2018-09-13T23:35:09.322+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:50002 - HostUnreachable: Connection refused
2018-09-13T23:35:09.322+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:50002 due to failed operation on a connection
2018-09-13T23:35:09.322+0200 I REPL_HB [replexec-151] Error in heartbeat (requestId: 1509276) to localhost:50002, response status: HostUnreachable: Connection refused
2018-09-13T23:35:09.355+0200 I REPL [replexec-150] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2018-09-13T23:35:09.823+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:50002
2018-09-13T23:35:09.823+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:50002 - HostUnreachable: Connection refused
2018-09-13T23:35:09.823+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:50002 due to failed operation on a connection
2018-09-13T23:35:09.823+0200 I REPL_HB [replexec-150] Error in heartbeat (requestId: 1509278) to localhost:50002, response status: HostUnreachable: Connection refused
What could be causing this?