Quantcast
Channel: VMware Communities : Discussion List - ESXi
Viewing all articles
Browse latest Browse all 8132

Uplink teaming/failover on dvSwitch when upstream switch stops passing traffic but stays up

$
0
0

We had a recent failure incident where we had significant disruption in our environment when a core (physical) network switch failed. Essentially, the failure mode was that the core switch remained live (port carriers stayed up) but didn't pass any traffic or provide any L3 gateway functionality[*]. The access switch between the core and the ESXi hosts did not shut down the downlinks to the hosts because it did not detect the failure in the uplink. This has led me to analyse similar failure modes in any single component eg. what happens if the access switch stays up but stops passing traffic? (I may be able to change the access switch configuration to shut down its downlinks based on an assessment of link quality to the core switch, but what happens if the access switch itself stops responding?)

 

I cannot see any straightforward method (short of rolling my own link quality assessment scripts, which would be unsupported) on an ESXi host of detecting high/complete packet loss on an uplink. The dvSwitches concerned only have two uplinks therefore beacon probing is not useful. If I were able to specify an IP target for each uplink which ESXi would ping (via that uplink) to assess uplink packet loss that would resolve the issue - but it looks like there is no such option (interestingly storage connectivity does seem to have a more proactive approach to path management).

 

Does anyone have any suggestions - is there a feature I'm not seeing, or is this actually a scenario not accounted for in ESXi network uplink teaming?

 

I can provide switch/host hardware models etc if it makes any difference, but I think this is a generic question really.

 

Thanks

 

John

 

[*] I use "route based on originating virtual port" which I suspect means that half of the VMs were cut off when the fault occurred, and it was several hours before manual intervention could resolve it (we might have managed a faster support response, but automation would be even better). This led to significant business disruption.


Viewing all articles
Browse latest Browse all 8132

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>