Quantcast
Channel: VMware Communities : Discussion List - ESXi
Viewing all articles
Browse latest Browse all 8132

Datastore connectivity issues after big VM delete job

$
0
0

Hello,

I'm investigating a strange issue we experienced following the automated deletion of ~110 VMs.

Our infra is made of 30 Dell R740 ESXi 6.7 P01 hosts, one vCenter 6.7 U3b and one Kaminario K2 all-flash storage array. Connectivity is done with iSCSI multipath.

On April 16th, between 1:04pm and 1:06pm my colleague initiated the removal of ~110 VMs via gitlab/terraform.

A moment later, I start seeing these errors in vmkernel logs and hostd logs of our ESXi hosts:

Lots of "lost access to volume" and "Successfully restored access to volume" for all our LUNs.

2020-04-16T13:07:51.044Z info hostd[2100730] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14158 : Lost access to volume 5e0e34c7-fe079ae4-880d-b02628657b90 (fake_vol_name_001) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

2020-04-16T13:07:51.611Z info hostd[2100680] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14159 : Successfully restored access to volume 5e0e34c7-fe079ae4-880d-b02628657b90 (fake_vol_name_001) following connectivity issues.

2020-04-16T13:08:04.051Z info hostd[2258728] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14508 : Lost access to volume 5e846ee7-9eeaf25e-c425-b02628c83b80 (fake_vol_name_002) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

2020-04-16T13:08:04.668Z info hostd[2256191] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14509 : Successfully restored access to volume 5e846ee7-9eeaf25e-c425-b02628c83b80 (fake_vol_name_002) following connectivity issues.

Lots of failed 0x42 and 0x89 (UNMAP and COMPARE AND WRITE) scsi commands:

2020-04-16T13:08:43.433Z cpu42:2097285)ScsiDeviceIO: 3449: Cmd(0x459a969307c0) 0x89, CmdSN 0x9e2bfd from world 2097233 to dev "eui.0024f4008148000e" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0

2020-04-16T13:08:47.358Z cpu47:2098510)ScsiDeviceIO: 3399: Cmd(0x45a29262e7c0) 0x42, CmdSN 0x75a7d7 from world 3558716 to dev "eui.0024f4008148000d" failed H:0x8 D:0x0 P:0x0

2020-04-16T13:08:48.694Z cpu63:2098510)NMP: nmp_ThrottleLogForDevice:3802: Cmd 0x42 (0x45a28efa72c0, 3559157) to dev "eui.0024f400814801be" on path "vmhba64:C1:T0:L3" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

Those errors repeated and lasted for more than an hour until 2:27pm.

I think the storage array can handle the deletion of 110 VMs almost all at once.

According to my network colleague we had no network outage during this timeframe.

I have a support case opened at Kaminario and I just created one today at VMware.

Any idea what could have happened?


Viewing all articles
Browse latest Browse all 8132

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>