I am running ESXi on 3 different machines at high load (cpu + disk), and I am encountering the following error event:
Lost access to volume GUID (XXX) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 3/21/2015 3:48:57 PM
shortly thereafter, access is restored:
Successfully restored access to volume GUID (XXX) following connectivity issues. 3/21/2015 3:49:24 PM
Interestingly, these events occur exactly every 6 hours on each affected machine:
2015-03-21T13:39:01.303Z cpu30:32857)HBX: 270: Reclaimed heartbeat for volume GUID (XXX): [Timeout] Offset 3796992
2015-03-21T19:39:13.824Z cpu20:32855)HBX: 270: Reclaimed heartbeat for volume GUID (XXX): [Timeout] Offset 3796992
2015-03-22T01:39:09.569Z cpu0:32856)HBX: 270: Reclaimed heartbeat for volume GUID (XXX): [Timeout] Offset 3796992
Most of the search results (and the VMware KB) discuss issues related to FC/iSCSI/network connected datastores. However, these datastores are local disks connected to a MegaRAID SAS controller.
The fact that this occurs every 6 hours made me think there was some sort of cronjob or the like that was running and causing a whole bunch of disk churn, which, combined with an already high disk load, was clogging up the controller. However, I can't find any such cronjob in /var/spool/cron/crontabs/root. I've checked several logs in /var/log, and nothing is jumping out anywhere around those time frames. I've updated to the latest ESXi patches, but that didn't help. Any ideas?
FWIW, my hw/sw is:
ESXi 5.5.0, 2456374
SuperMicro X10-DRHC
MegaRaid SAS Invader Controller (LSI 3108 SAS3)
3 consumer SSD drives in RAID0 (for the affected datastore)