Quantcast
Channel: VMware Communities : Discussion List - ESXi
Viewing all articles
Browse latest Browse all 8132

Problem with Microsoft Failover Cluster Shared Disks

$
0
0

Weird one. Well, weird to me. 

 

I have a 2 node MS cluster. Tried both 2008 R2 and now 2012.

 

ESX 5.1 on 2 hosts, 5 others still on 4.1. 4.1 hosts are out of scope for now (pretty sure!).

 

2 individual Windows Cluster nodes, VM-A on Host-A, VM-B on Host-B, both 5.1. OS drives are each hosted on separate VMFS5 Thick provisioned drives. RDM's are set for physical, I have 2 presented and attached.

 

SAN is Oracle Openstorage 7320 2-node storage cluster, ALUA type. Each node has a pool of storage it hosts off as optimized/primary owner. All paths show correctly in vCenter.

 

Primary reason for using RDM is that I need a file partition of 4TB in size for use in a medium-sized pool for roaming profiles in a Citrix implementation. Prefer not to get crazy with DFS and different volumes, because of high intra-breeding (ha!) of people across departments, locations and functions. Very challenging to try to find a logical structure that sticks for too long, so rather go for 1 large store.

 

2 RDM's presented to all hosts. Device attachment has perennially-attached flag set to true.

 

VM-A does raw mapping of 2GB LUN meant for quorum. Device attaches just fine, set to Physical and Physical SCSI Bus Sharing done, set for 1:0. Next VM-A does RDM to the 4TB LUN, same deal, set for 2:0. Each map file stored with VM (I've tried placing in an "RDM Mapping LUN" which didn't work - problem described below).

 

Go over to VM-B and map to existing drives, physical, physical, 1:0... yadda yadda.

 

Turn on VM-A. Drives seen, great. Turn on server VM-B. Drives there. Great.

 

Create cluster. Run tests, passes everything 100%. Set up the cluster, all seems well.

 

Then I reboot a node. Drives fail over and that goes well. Then I notice they bump to offline, then quorum (the 2GB LUN) comes back online, 4TB LUN goes to Failed. Manually restarting works every time.

 

So I thought maybe it was a failover thing. All resources owned by VM-B, I reboot VM-A. Drives all get bumped offline. Quorum drive tends to come back almost immediately. 4TB fails quick, but never comes back. If I shut down instead of reboot, same behavior. Once VM-A is powered off, I can manually start any disk which failed and it'll run forever. If I turn VM-A back on while everything running on VM-B, drives fail during POST of VM-A almost immediately. Again, I have to manually bring online.

 

On Windows 2012, I've tried using Scale Out File Sharing - thinking maybe I just need to get both head reading/writing at the same time to maintain some kind of connection. Works great, by the way - but still fails in the same way. 2008 R2, same thing in exact same ways.

 

Have I missed something obvious here?

 

We aren't talking about a SCSI time out setting here. This isn't a time out, it's almost like a lock conflict happens and the last one in immediately loses. Timeout values are set to 60, but this is not a timeout. As soon as I shut down, or turn on, a VM, the other guy flips out almost immediately.

 

I've had no issues at all with storage with any of these hosts on this SAN setup, pretty damn fast, too. I can sling data and LUNs around and everything sees and connects to whatever I have, where ever I put it, never an issue with pathing, ownership, zoning, etc.

 

Anyone ever see anything like this?


Viewing all articles
Browse latest Browse all 8132

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>