Hello,
I'm investigating a strange issue we experienced following the automated deletion of ~110 VMs.
Our infra is made of 30 Dell R740 ESXi 6.7 P01 hosts, one vCenter 6.7 U3b and one Kaminario K2 all-flash storage array. Connectivity is done with iSCSI multipath.
On April 16th, between 1:04pm and 1:06pm my colleague initiated the removal of ~110 VMs via gitlab/terraform.
A moment later, I start seeing these errors in vmkernel logs and hostd logs of our ESXi hosts:
Lots of "lost access to volume" and "Successfully restored access to volume" for all our LUNs.
2020-04-16T13:07:51.044Z info hostd[2100730] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14158 : Lost access to volume 5e0e34c7-fe079ae4-880d-b02628657b90 (fake_vol_name_001) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2020-04-16T13:07:51.611Z info hostd[2100680] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14159 : Successfully restored access to volume 5e0e34c7-fe079ae4-880d-b02628657b90 (fake_vol_name_001) following connectivity issues.
2020-04-16T13:08:04.051Z info hostd[2258728] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14508 : Lost access to volume 5e846ee7-9eeaf25e-c425-b02628c83b80 (fake_vol_name_002) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2020-04-16T13:08:04.668Z info hostd[2256191] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14509 : Successfully restored access to volume 5e846ee7-9eeaf25e-c425-b02628c83b80 (fake_vol_name_002) following connectivity issues.
Lots of failed 0x42 and 0x89 (UNMAP and COMPARE AND WRITE) scsi commands:
2020-04-16T13:08:43.433Z cpu42:2097285)ScsiDeviceIO: 3449: Cmd(0x459a969307c0) 0x89, CmdSN 0x9e2bfd from world 2097233 to dev "eui.0024f4008148000e" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0
2020-04-16T13:08:47.358Z cpu47:2098510)ScsiDeviceIO: 3399: Cmd(0x45a29262e7c0) 0x42, CmdSN 0x75a7d7 from world 3558716 to dev "eui.0024f4008148000d" failed H:0x8 D:0x0 P:0x0
2020-04-16T13:08:48.694Z cpu63:2098510)NMP: nmp_ThrottleLogForDevice:3802: Cmd 0x42 (0x45a28efa72c0, 3559157) to dev "eui.0024f400814801be" on path "vmhba64:C1:T0:L3" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
Those errors repeated and lasted for more than an hour until 2:27pm.
I think the storage array can handle the deletion of 110 VMs almost all at once.
According to my network colleague we had no network outage during this timeframe.
I have a support case opened at Kaminario and I just created one today at VMware.
Any idea what could have happened?