Quantcast
Channel: VMware Communities : Discussion List - ESXi
Viewing all articles
Browse latest Browse all 8132

VM utilization cause high CPU in a different VM

$
0
0

Hello vmWare community,

 

Please note, I have placed all of the config information at the end of this post so I can jump right into the issue discussion.

 

Basically, the situation is that at certain times, when one VM (VM-bad) becomes highly active, the CPU utilization on another VM (VM-good) climbs despite not having any additional load.  In extreme situations, the CPU load on VM-good gets so high as to incur a DoS event and traffic through VM-good ceases.  VM-good does eventually recover but only after VM-bad has calmed down.  The issue is intermittent in nature and only seems to occur when VM-bad is booting or at other select periods of high-activity.

 

After much troubleshooting and stat collection (using esxtop --> CSV files; parsed w/ perfmon), it appears that disk access\utilization is the root cause.  I say this because the stats paint a picture of very high IOPS (peak > 150) and long read\write times (peak > 2500ms) for both VMs.  The core apps running on VM-bad require near constant disk access due to it performing full packet capture as well as pretty constant updates to several mySQL databases.  Now I'm sure those reading this are thinking "Well DUH - yeah, that's your issue" and I'm pretty sure I don't disagree but I do have some questions re: how ESXi handles shared resources and where my expectations are inaccurate.

 

Additionally, I think I was able to validate this as being a disk access issue because after migrating the packet capture destination and mySQL databases to a second mechanical HDD, VM-good's CPU utilization and DoS issues seem to have pretty much gone away.  Even when VM-bad's disk latency times run high (i.e. 600+ ms), CPU utilization on VM-good remains nearly constant and packets flow unimpeded.  I have not captured a new set of esxtop stats post-migration but I may if VM-good starts acting up again.

 

As an aside, prior to above-mentioned migration, I did tinker with the various tuning options such as IOPS, CPU allocation (both p- and v- CPUs), CPU cycles, etc.  None of those seemed to help.  However, if considering this in the context of the suspected root cause (disk latency), this probably makes sense.  My belief is that most disk latency issues - and especially those in ESXi - are not really solved as much by tuning as they are by adding more spindles, SSDs, or implementing RAID (i.e. RAID 10 or 0).

 

So, with all of the above as backstory, what I'm really looking for here is some clarification and validation\correction about my findings & expectations:

 

  1. Overall, does it make sense for high-activity in one VM to directly cause a measurable CPU utilization increase in a different VM?  I think it does if I'm correct about disk latency being the root cause but need some validation and\or elaboration here.
  2. If #1 does make sense, I expected at least some improvement by tuning the VMs via IOPS, etc. reservation and\or limitation but none of those seemed to have any real effect at all...but maybe this is expected if the root cause is disk latency.
  3. Also, despite VM-bad's high resource demands, I expected ESXi to do a better job balancing the load & resources especially given how much overhead this system has (please see specs below for more details).  But based on my observations & testing, it appears my expectation was inaccurate so please fill me in here.
  4. My theory on why there is a directly measurable CPU utilization increase on VM-good is that, due to increased disk latency, VM-good's CPU is getting bogged down by having to mange it's own set of resource issues such as buffers filling, etc.  Does this make sense?
  5. But...if #4 does make sense, then why doesn't VM-bad also have issues???  That's part of what makes this kinda strange - the actual VM that is bogged down still gets it's own job done...packets are not dropped, packet capture is flawless, and the DBs all seem to have the expected set of data and do not get corrupted.
  6. Finally, if the root cause is disk latency, is there any other tuning that can be done in order to not require the disk migration I performed?  I am pretty sure the answer is no and that more spindles, faster drives, RAID, etc. are the only true remedies to disk latency issues.

 

Please comment where appropriate and thanks in advance for taking the time to read and respond.

 

Platform

  • Dell PowerEdge R710.
  • 1x 1TB WD Red 3.5" e-SATA mechanical HDD - NO RAID.
  • 32GB RAM.
  • 8x GE nics.
  • 2x Xeon X5550 4-core CPUs - total of 16 vCPUs (due to HT).
  • Hyper-threading & other virtualization settings are enabled in system BIOS.
  • ESXi v6.5.
  • NO OVER-SUBSCRIPTION is done on this host.

 

ESXi base config

  • 2x VMs - both running xNIX - VM-bad = Ubuntu 14.04; VM-good = FreeBSD variant.

 

VM-bad config

  • 16GB RAM, 8GB swap; 500GB disk space; 4 vCPUs across 2 cores; 2x GE NICs.
  • CPU utilization varies widely from 2-99%; the latter occurs infrequently during periods of high-activity such as boot, DB cleanup, PCAP trimming, etc.
  • RAM in use hovers between 13-16GB w\ very little swap being used (< 300MB).
  • All resource allocations meet or exceed those specified by the devs.
  • All resource tuning settings are at default (no limitations OR reservations).
  • open-vm-tools & daemon v9.4.0.25793 (build-1280544) installed; daemon is running.

 

VM-good config

  • 4GB RAM; 0 swap; 150GB disk space; 2 vCPU across 2 cores; 3x GE NICs.
  • CPU utilization prior to migratin varied widely from 2-99% in lock step with VM-bad's disk latency times; post migration utilization varies between 2-35% and appear independent of VM-bad's behavior.
  • RAM in use hovers between 1-2GB; no swap.
  • All resource allocations meet or exceed those specified by the devs.
  • All resource tuning settings are at default (no limitations OR reservations).
  • vmWare tools daemon v10.0.5.52125 (build-3227872) installed & running.

Viewing all articles
Browse latest Browse all 8132

Trending Articles