Hello All
I encountered strange issue in the servers running latest Vsphere 6.7 U3 on AMD Rome CPUs.
I enabled on those servers CCX as NUMA setting due to the performance recommendation from the AMD whitepaper and Vsphere Vroom blog articles.
After 30 days of running very light workload I started to see on them strange warnings like failure to create state.tgz or failure to write to bootbank.
I found the relevant KB (VMware Knowledge Base ) and it seemed at first that I can ignore these issues.
Unfortunately in my case these warnings are persistent and as my boot devices are the pair of Intel S4610 Sata SSDs in RAID1, the suggested cause of the storage overutilization seems not correct.
I put the affected hosts into MM, opened the SR and together with assigned engineer dug through the logs.
There were following issues there:
inability to set FC queues on the default level
failures to open files
failures to create files
heap memory allocation errors
Host is empty, it has 15 GB RAM used out of 1024 GB available.
VMKmem rsvrd is around 33k
NUMA node memory size is 32k
Unfortunately my SR was closed after 3 weeks without properly resolving my case.
As I could not wait for GSS, I reconfigured the NUMA settings on majority of my AMD servers.
They are now running DPS=4 and CCX as NUMA disabled with much bigger load than before and no issues so far.
I'm writing this as a public warning and as the inquiry have any of you seen something like that before.