Issue with clustered RDM's and storage outages

Hi all

We have a number of clusters that each contain about 15 hosts. We utilise RDM's for Microsoft failover clusters quite heavily in our environment as well - up to 70 RDM's. Our SAN array is a VNX 7500. All hosts within each cluster are defined in a host group on the array.

ESXi hosts are Dell M620's M630's and R730's. Running ESXi 5.5 Update 3.

All works well on a day-to-day basis however we have been having issues with random clusters experiencing a failure/failover whenever we add a new host to the host group on the SAN array. It appears that when the host is added to the storage group it automatically kicks off a storage scan (as i can see because the Datastores on the host start appearing automatically). Some time after the host is added to the storage group, sometimes 15 minutes, sometimes up to 5 hours, some of the clusters start failing due to the physical disks which they use being unavailable. Errors we are seeing in the event log:

Cluster resource 'INST01_Log' of type 'Physical Disk' in clustered role 'SQL Server (clustername\INST01)' failed.

Ownership of cluster disk 'INST02_Data' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

In most cases the cluster will successfully fail over to the passive node. In other instances I'll need to manually bring the disk resource back online if it hasnt automatically recovered.

The reason for the extremely long time it is taking before it causes an issue is seen is because as the RDM's are being scanned for the first time, there is a SCSI reservation on them which does not allow them to be read. It waits until it times out before move onto the next device. As good practise we perennially reserve all of our cluster RDM's however its not possible to do this until the disk has been added for the first time. If we happen to reboot a host that hasnt had the disk perennially reserved yet it can take up to 6 hours for it to start responding.

We logged a job with VMware however they came back saying that the issue is being caused by the array and we should contact EMC. I dont necessarily agree with this as things operate fine usually - its just when a host is added for the first time and a scan takes place does it cause some sort of lock on the RDM that prevents the MSCS cluster from being able to read/write to it. No issue with the VMFS data-stores themselves has been seen.

Has anyone else seen this or know what could be causing the issue? Should a host performing a scan on an RDM being used in an MSCS cluster cause it to fail?

Cheers
Brady

Issue with clustered RDM's and storage outages

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112