We have a problem of VSAN hosts dropping off our cluster, and I'm not getting anywhere with the regular troubleshooting methods. I've had a ticket open with GSS and the VSAN team since July, but we've gotten nowhere on the cause.
We have a 4 host VSAN cluster that was perfect for the first year until we upgrade the VSAN from 6.1 to 6.2. In July we had host #3 drop off the cluster. The VMs are still running on it, but the clients briefly lost their connection to the VMS running on it.The host showed disconnected from the cluster. Attempts to reconnect it to the cluster fail with timeouts. Attempts to connect to it with a web browser fail, as do attempts with the original VSphere client.
I am able to connect to it via SSH, which proves VMKernel IP availability, yet attempts to restart the services hang. I looked at the VPXA logs from the client from around the failure time as well as the VPXD logs from VCenter, but I'm unable to determine anything unusual.
All servers are running on R730XDs and have the latest firmware/drivers from the VMWare HCL.
In August Host #4 had the exact same issue, then 3 days ago, host #2 now has the same issue.
The only way to get the cluster up and working again was to RDP into the guests, do a graceful shutdown of each, open them on a different host, then reboot the failed host, join the cluster, vmotion the guests back onto it.
Even the shutdown cannot happen gracefully as attempting to shut down via SSH or DCUI hangs as well.
Hosts are joined by Brocade VDX 10G switches which have no errors on any ports, only discards on the receiving ends. There are two distributed switches set up, one for VSAN (dedicated ports) and one for VM traffic/VMotion traffic. The VMWare managment network is separate VLAN, separate physical switches, on regular vswitches. The fact that the VSAN and management networks on only one host dropped off at the same time seems to point to the host.
I think that it is a bug with the Hosts' OS. Build number is 3825889
Anyone experienced this? Troubleshooting route?
Thanks,,
B