I am building a 6 node cluster spread across 2 racks (This will eventually be configured as a stretch cluster with a separate Witness). Everything is ESXi 8.0U1. Each rack is in a different L3 subnet. I setup rack 1 and configured a vSAN cluster with 3 nodes in that cluster, deployed a few test VMs and everything is working fine. When I start to add nodes from rack 2 I ended up with the following 3 issues:
- SAN Cluster Partition
- vMotion: Basic (unicast) connectivity check
- vMotion: MTU check
I'm parking the vSAN Cluster Partition error for now as the vMotion one should be sipmly solvable. If I SSH to the nodes in Rack 1 I can vmkping all vMotion interfaces (using -S vMotion) in Rack 1 but cannot ping the vMotion interface of the nodes in Rack 2. It fails with the error:
sendto() failed (Network is unreachable)
I have confirmed that the default gateway is configured for the vMotion interface and is correct. Furthermore, if I add a new VMK Adapter to each node in both racks using the same vMotion Distributed Port Group with IPs in the same subnet and configured with the same gateway (but using the default IP stack!) I can ping between nodes in both racks.
It appears as if there's some issue with the routing logic of the vMotion interface but other then configuring the default gateway I'm not sure what else there is to configure here.
I'm focusing on this vMotion error as I'm wondering if it is the same root problem as the vSAN partition.
Can anyone point me in any debugging directions?
For clarity:
Rack 1:
- Management Subnet: 10.73.8.0/25 (GW: 10.73.8.126)
- vMotion Subnet: 10.73.10.0/25 (GW: 10.73.10.126)
- vSAN Subnet: 10.73.11.0/25 (GW: 10.73.11.126)
Rack 2:
- Management Subnet: 10.73.8.128/25 (GW: 10.73.8.254)
- vMotion Subnet: 10.73.10.128/25 (GW: 10.73.10.254)
- vSAN Subnet: 10.73.11.128/25 (GW: 10.73.11.254)
Answer: I finally figured it out for anyone that stumbles up on this. The gateway had to be set on the vMotion TCP/IP Stack in addition to being set on the VMK adapter itself...