Imaging the following situation exists :-) ...
We have iSCSI Storage Connections in our ESXi V6.5U1 environment as follows:
Fabric1, iSCSI1: vswitch1 (mtu1500) -> VMKernel Port1 on vmnic1 (mtu1500) -> physical iSCSI Switch1 (Support mtu9000 enabled) -> storage nic1 (mtu1500)
Fabric2, iSCSI2: vswitch2 (mtu1500) -> VMKernel Port2 on vmnic2 (mtu1500) -> physical iSCSI Switch2 (Support mtu9000 enabled) -> storage nic2 (mtu1500)
Both Adapters are configured for iSCSI (SW Initiator) and ESXi is load balancing over the 2 vSwitches (PSP=Nimble NCM Plugin, which is Kind of round Robin...). Everything was working fine in this config (even we restartet a physical Switch for maintenance).
Now we shut down iSCSI Switch1 in Fabric1 -> everything still works fine, because all storage traffic goes via iSCSI2. While iSCSI Switch1 was shut down, we reconfigured all iSCSI1 Interfaces (VMKernel Port1 on vmnic1 and storage nic1) to MTU=9000. We did not change mtu for vswitch1 (yes, this is a mistake... so don't do it, just imagine...).
So we had the following conditions:
iSCSI1: vswitch1 (mtu1500) -> VMKernel Port1 on vmnic1 (mtu9000) -> physical iSCSI Switch1 (Support mtu9000 enabled) -> storage nic1 (mtu9000) ->poweredOff
iSCSI2: vswitch2 (mtu1500) -> VMKernel Port2 on vmnic2 (mtu1500) -> physical iSCSI Switch2 (Support mtu9000 enabled) -> storage nic2 (mtu1500) ->poweredOn
After powering on again iSCSI Switch1 in Fabric1 all iSCSI1 Interfaces were up again and we had a complete production down... ESXi on 100%CPU, unavailable in vCenter, VMs online but "freezed", etc...
I'm not sure if HA events caused a "problem chain" - do not think it is the root cause... seems this was more an action maybe because of a APD (All-Path-Down) condition.
But how can this happen? What is ESXi doing when a load balanced iSCSI SW Initiator sees 2 fabrics (miss-)configured like described above? What happens on Fabric1 if vSwitch1 is configured with MTU=1500 and all Interfaces on MTU=9000? Will this result in not seeing the storage at all, even Fabric2 is still configured correctly? Is ESXi fine with the storage path balancing over a missconfigured Network or could this generate a all-path-down (APD) condition? What is the ESXi doing in this condition?
Any Input much appreciated!