Weird one. Well, weird to me.
I have a 2 node MS cluster. Tried both 2008 R2 and now 2012.
ESX 5.1 on 2 hosts, 5 others still on 4.1. 4.1 hosts are out of scope for now (pretty sure!).
2 individual Windows Cluster nodes, VM-A on Host-A, VM-B on Host-B, both 5.1. OS drives are each hosted on separate VMFS5 Thick provisioned drives. RDM's are set for physical, I have 2 presented and attached.
SAN is Oracle Openstorage 7320 2-node storage cluster, ALUA type. Each node has a pool of storage it hosts off as optimized/primary owner. All paths show correctly in vCenter.
Primary reason for using RDM is that I need a file partition of 4TB in size for use in a medium-sized pool for roaming profiles in a Citrix implementation. Prefer not to get crazy with DFS and different volumes, because of high intra-breeding (ha!) of people across departments, locations and functions. Very challenging to try to find a logical structure that sticks for too long, so rather go for 1 large store.
2 RDM's presented to all hosts. Device attachment has perennially-attached flag set to true.
VM-A does raw mapping of 2GB LUN meant for quorum. Device attaches just fine, set to Physical and Physical SCSI Bus Sharing done, set for 1:0. Next VM-A does RDM to the 4TB LUN, same deal, set for 2:0. Each map file stored with VM (I've tried placing in an "RDM Mapping LUN" which didn't work - problem described below).
Go over to VM-B and map to existing drives, physical, physical, 1:0... yadda yadda.
Turn on VM-A. Drives seen, great. Turn on server VM-B. Drives there. Great.
Create cluster. Run tests, passes everything 100%. Set up the cluster, all seems well.
Then I reboot a node. Drives fail over and that goes well. Then I notice they bump to offline, then quorum (the 2GB LUN) comes back online, 4TB LUN goes to Failed. Manually restarting works every time.
So I thought maybe it was a failover thing. All resources owned by VM-B, I reboot VM-A. Drives all get bumped offline. Quorum drive tends to come back almost immediately. 4TB fails quick, but never comes back. If I shut down instead of reboot, same behavior. Once VM-A is powered off, I can manually start any disk which failed and it'll run forever. If I turn VM-A back on while everything running on VM-B, drives fail during POST of VM-A almost immediately. Again, I have to manually bring online.
On Windows 2012, I've tried using Scale Out File Sharing - thinking maybe I just need to get both head reading/writing at the same time to maintain some kind of connection. Works great, by the way - but still fails in the same way. 2008 R2, same thing in exact same ways.
Have I missed something obvious here?
We aren't talking about a SCSI time out setting here. This isn't a time out, it's almost like a lock conflict happens and the last one in immediately loses. Timeout values are set to 60, but this is not a timeout. As soon as I shut down, or turn on, a VM, the other guy flips out almost immediately.
I've had no issues at all with storage with any of these hosts on this SAN setup, pretty damn fast, too. I can sling data and LUNs around and everything sees and connects to whatever I have, where ever I put it, never an issue with pathing, ownership, zoning, etc.
Anyone ever see anything like this?