We had a serious incident last month that originated on our Netapp. We have several large generic iSCSI VMFS volumes hosted on a Netapp cluster on which our VMs live. The Netapp decided to hang one of these volumes for a little over 8 seconds. We are still looking for finer grained detail but it appears that for this 8+ second time period all IO to the Netapp volume froze. At this point all the filesystems (all are ext3 or 4) on all the VMs on the volume went into read-only mode. The VMs are all running CentOS (mostly 6 but some 7.)
It would have been easy to fix by rebooting all the VMs but there was file corruption on the root filesystems of all the VMs and each one had to be booted into single user mode and fsck run.
We think ESXi put the VM's VMDK files into read only mode since the VM's filesystems are all mounted with the default "Continue" error behavior. However the file systems behaved as if the error behavior was set to "remount-ro" and in addition there was filesystem corruption.
I've found nothing in the vcenter logs except log entries from vmkernel showing ESXi did in fact lose contact with the volume and SCSI sense data H0x5/ABORT (rest is all zeros).
Has anyone seen this happen? What I am really looking for a description of the tweaks to ESXi that I can make to change the was it handles iSCSI timeouts / SCSI Aborts? I would rather have all the VMs panic in this situation than go readonly. The VMs are in a cloud and spread out over more volumes. In our new cloud-based world, maintaining VM uptime is *not* always the best strategy,