Where do I begin.
I feel like I am always a newbie with VMware, despite working in it for a few years. We are running a VSAN environment on 6.5, performing backups with Veeam. About 2 weeks ago, some of our vm's in the backup started throwing errors. Due to some events outside of my control, I just started looking at this today. Veeam support said the error was because the VMX file was corrupt, recommended solution was to shutdown machine, remove from inventory, create new machine using the existing disks, bring it back up. We performed this solution on a non critical machine, and it worked great. Did it to a semi-critical machine, and worked great again. Did it to our Exchange server and.. it wasn't great.
The server came back up, however after a few hours of operation, a large amount of people reported missing about 2 weeks of email. We had the machine up for about 5 hours poking around at logs before I shut it down to focus on the VMware side of things. After a ton of digging on the guest as well as in the host environment, I figured out the root cause- despite there being no snapshots in the snapshot manager, the system was running off of a snapshot due to the failed backup. I made the mistake of mounting the original vmdk files on booting rather than the 000001.vmdk file. My own mistake of making assumptions, thinking those files were somehow orphaned since the snapshot manager listed no snapshots. The previous, successful machines either didn't have a snapshot file, or historical data didn't matter on that guest.
After talking with VMware support, they basically said since the original vmdk's were booted, the damage is done, consider the data lost. They did say I can try to remove the drives from the guest, and try to re-add the snapshot versions, but had little faith that it would work, and warned of a high chance of corruption of both the vmdk and the snapshot vmdk. Since the last shutdown, I've kept the server powered off and have been seeking any type of option to try and get this machine back to life with its current data, and have ran into a brick wall every time. Mostly being cautious on any steps tried from this point due to the corruption warnings, I've copied out all files save for the snapshot files from the original location of the datastore to a different location to mitigate risk of further corruption. The snapshot files however, will simply not budge. Web client copy, SSH copy, vmkfstools -i, nothing will get those files to somewhere else in their original size (though I can download what looks to be the header with WinSCP).
I'm desperately trying to safeguard the snapshot data before doing something that may corrupt the whole guest and get this thing back in an up to date, running condition. Since this is an Exchange server, the files are quite large. Just copying out the files took 3hrs. I'm now attempting a clone as I've read a clone may merge snapshot files automatically, with the hope that it won't impact the original files. If the clone doesn't work, I'd be at the last straw to try to boot off of the snapshots, knowing I may lose everything. Finally I've landed here, seeing some users get success by some of you truly amazing experts here. The final kick in the rear, is our management is getting ready to suffer the data loss just to get the server back on and email flowing, so their patience is thin. Casting out a bottle in the sea here, hoping it comes back with some much needed help in time. Attaching relevant info that I've seen requested in other posts:
Directory ls -lh of original files:
-rw-r--r-- 1 root root 92 Oct 24 2018 CAKEXK01-8d4db6ef.hlog
-rw------- 1 root root 32.6K Nov 15 08:02 CAKEXK01-Snapshot557.vmsn
-rw-r--r-- 1 root root 13 May 8 2019 CAKEXK01-aux.xml
-rw------- 1 root root 8.5K Nov 14 08:12 CAKEXK01.nvram
-rw------- 1 root root 45 Nov 14 08:12 CAKEXK01.vmsd
-rwx------ 1 root root 4.6K Dec 6 21:22 CAKEXK01.vmx
-rw------- 1 root root 3.3K May 17 2018 CAKEXK01.vmxf
-rw------- 1 root root 5.0M Dec 6 21:22 CAKEXK01_3-000001-ctk.vmdk
-rw------- 1 root root 408 Nov 15 08:02 CAKEXK01_3-000001.vmdk
-rw------- 1 root root 600 Dec 7 04:12 CAKEXK01_3.vmdk
-rw------- 1 root root 5.9M Dec 6 21:22 CAKEXK01_4-000001-ctk.vmdk
-rw------- 1 root root 409 Nov 15 08:02 CAKEXK01_4-000001.vmdk
-rw------- 1 root root 576 Dec 7 04:12 CAKEXK01_4.vmdk
-rw------- 1 root root 2.0M Dec 6 21:22 CAKEXK01_5-000001-ctk.vmdk
-rw------- 1 root root 407 Nov 15 08:09 CAKEXK01_5-000001.vmdk
-rw------- 1 root root 598 Dec 7 04:12 CAKEXK01_5.vmdk
drwxr-xr-x 1 root root 280 Dec 7 06:38 bak
-rw------- 1 root root 299.5K May 17 2018 vmware-3.log
-rw------- 1 root root 15.2M Sep 21 2018 vmware-4.log
-rw------- 1 root root 3.0M Oct 18 2018 vmware-5.log
-rw------- 1 root root 393.2K Oct 22 2018 vmware-6.log
-rw------- 1 root root 467.3K Oct 24 2018 vmware-7.log
-rw------- 1 root root 244.0K Oct 24 2018 vmware-8.log
-rw------- 1 root root 45.4M Dec 6 21:22 vmware.log
Directory ls -lh of newly created machine that is pointing to the above vmdk's:
-rw-r--r-- 1 root root 295 Dec 6 21:35 CAKEXK01-35be335f.hlog
-rw------- 1 root root 8.5K Dec 7 05:25 CAKEXK01.nvram
-rw-r--r-- 1 root root 0 Dec 6 21:35 CAKEXK01.vmsd
-rwxr-xr-x 1 root root 3.8K Dec 7 05:25 CAKEXK01.vmx
-rw------- 1 root root 3.1K Dec 6 21:45 CAKEXK01.vmxf
-rw-r--r-- 1 root root 1.0M Dec 7 03:08 vmware-1.log
-rw-r--r-- 1 root root 322.3K Dec 7 05:25 vmware.log