Quantcast
Channel: VMware Communities : Discussion List - ESXi
Viewing all articles
Browse latest Browse all 8132

Need some help troubleshooting network problems - ESXi 5.5

$
0
0

We are having some very odd networking problems and working with the network team we are running out of ideas.

The problem:
VMs on standard vSwitches are experiencing problems talking to other systems on the same vLAN resulting in dropped packets and RPC errors even when on the same VM Host.

Here is a quick and dirty how the network is laid out: http://imgur.com/a/iJIFJ

ESXi NICs are hooked up to Nexus1 and Nexus2. When VMs attempt to communicate the path will often go to the wrong physical address, fail to communicate, and then update ARP tables and go to the other NIC.

The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.

We have noticed this problem only exists on the Dell cluster which is hooked directly into the nexus environment. The UCS cluster which is plugged into FIs does not share the problem. I suspect this is because the UCS FIs share ARP tables and it never makes it back to the Nexus 5ks.

My suspicion is that there is a problem between the 2 Nexus 5k switches and they are not sharing ARP tables properly, but the network team is insisting that the problem lies with either the ESXi or Windows OS layer within the VMs. I'm not versed enough in low level network operations to argue this, but I'm at a loss for how to troubleshoot this further and get a definitive answer.

 

Some solutions we've tried:
1. Setting up a dvSwitch on a test box and enabling LACP. This saw no change.
2. Dropping 1 NIC on the vSwitch and force paths to go up Nexus1. This caused the problem to stop, but is unacceptable as a solutions as it removes our path redundancy.

 

Some things network team wants us to try but we haven't done yet:
1. Manually changing the VMs' "reachable time" on the NICs to a lower value.
2. Changing out the VMXNET3 interfaces for E1000
3. Enabling LACP on the Standard vSwitches (This isn't supported)
4. Upgrading to ESXi 6 (we're note ready for this migration yet)

 

We're really pulling our hair out over this one...has anyone ever encountered these problems before?


Viewing all articles
Browse latest Browse all 8132

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>