Last week I patched our (7) lab ESXi 6.5 servers to the current patch level (Build 5146846 - from March 9, 2017). The (5) Dell servers have been running flawlessly on this new patch level. However, the (2) Cisco UCS B200 M4 servers have been crashing constantly with Purple Screen of Death (PSOD). I verified that the Cisco UCS servers are running the current firmware (BIOS & VIC) supported by Cisco with ESXi 6.5, and I also used Update Manager to apply the current Cisco Image Profile ("Vmware-ESXi-6.5a.0-4887370-Custom-Cisco-6.5.0.2-Bundle.zip" from 2017-03-14), to get the current drivers, but that did not help -- PSODs continued. The PSODs keep occurring on both ESXi hosts in the 2-node Cisco cluster, rendering a complete cluster outage. First, one ESXi host PSODs. Then a little while later, the second ESXi host PSODs. Sometimes the PSOD takes 5 hours to occur, sometimes it only takes 25 minutes. (Another PSOD occurred while I was typing this post, after a reboot of the blade just about 25 minutes earlier.) I've captured dozens of PSOD screen shots, and every single one contains the following lines:
NOT_IMPLEMENTED bora/vmkernel/sched/cpusched.c:9581
On each PSOD, a different VM name is listed two lines below the above error code, so there is no consistency as to which VM triggers this panic.
The interesting thing is that the (5) Dell servers are humming along without issue on this patch level. The Cisco servers are using Intel(R) Xeon(R) CPU E5-2670 v3 CPUs, while the Dell servers are using earlier generation CPUs (Sandy Bridge or Westmere). Both the Dell servers and Cisco servers are using vSphere Replication 6.5. The Cisco servers are running a heavier load of VMs, many of which are using multiple vCPUs, so perhaps there is another vCPU scheduling bug that is being triggered?
We have vRealize Log Insight running, and the following are some of the last messages sent by the ESXi host after PSOD:
[Originator@6876 sub=VpxaHalCnxHostagent opID=WFU-491dcb0b] Applying updates from 215636 to 215637 (at 215636)
[Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 45. Sent notification immediately.
The vRealize Log Insight VM is running in the Cisco cluster, so I might be missing some of the most important log entries right at the time of the PSOD.
Anyone else running into this?? (Note that this is a "lab" environment, so it's not production impacting. This is exactly why we have a "lab" environment!!!)
Below is one screen shot example of the PSOD. They all pretty much look like this: