Hi all-- wanted to describe progress on an update to my former threadripper system.
Starting point: 4 Vm's on a threadripper 1950, each with GPU passthrough (1 x 2080, 3 x 2070). 64 GB RAM, esxi 6.7 U3. System was quite stable (see prior thread).
Target: Threadripper 3970x (double the cores), 128 GB RAM, on an Asrock Creator TRX40 motherboard.
I started by validating the new hardware under a temporary (non-virtualized) windows build. Stuff worked.
BIOS settings used: defaults, except:
Changed some fan settings to make them quieter
Turned on XMP
Turned on SR-IOV
left PBO OFF (default, but I changed it to disabled. PBO sucks up huge amounts of power for little performance benefit, to say nothing of validation!)
Used current BIOS version, not the beta for 3990x.
esxi installation: Used previous installation. This had passthru.map entries for AMD and NVIDIA as detailed in my last post. It also had the epyc-recommended configuration change previously recommended (which I removed) and preinstalled aquantia NIC driver.
Moved 2 x m.2 SSd's from old into new system. System booted nicely into esxi. All hardware passthrough vanished as expected. Of note, *neither* of the NIC's on this board has a native driver. I used the aquantia driver and live off the 10Gb aquantia. I have no idea if there is a realtek Dragon 2.5g driver out there.
Redid the hardware passthrough. All GPU's passed back through to their VM's, yay. Only two of them would boot, boo. Eventually, after much gnashing of teeth, remade three VM's from scratch: they would all keep crashing immediately upon booting windows.
This was interesting. Esxi would report that the GPU's had violated memory access and advised adding a passthrough.map entry, which didn't fix the problem. Changing BIOS on the host to remove CSM support and enable 4G support, and enabling 64 bit MMIO in the vm didn't fix it either. A new vm with fresh windows install worked.
There were several other interesting changes from previous system:
disabling msi on the gpu made them keep crashing , unlike previously when it fixed stuttering
No cpu pinning or NUMA settings were used or needed
The mystical cpuid.hypervisor setting remains required to avoid error 43
With these caveats, I got 4 bootable VM's each using the Nvidia card's own USB-c connector for keyboard/mouse. 8 cpus/vm. Which led to the next problem, which I haven't been able to solve:
The mice/keyboards would all intermittently freeze for moments to minutes, and sometimes not come back. Lots of testing inside of windows showed no cause. Interestingly, the problem was 1)worse with a highend G502 mouse, and 2) much worse inside of windows UI -- and never happened for example in demanding real time full screen apps. I was sure it was going to be some bizarre windows problem. Rarely (every few hours) systems would crash completely (while idle!) with the same memory access violation. Also rebooting one of the vm's would make other vm's momentarily stutter. None of this ever happened on the 1950x system where these controllers were reliable.
I eventually worked around the problem with the motherboard's USB controllers. There are 5: 2 x Matisse, 2 x Starship, and 1 x asmedia. The Matisse ones are lumped in the same IOMMU group and won't pass through (They are perpetually "reboot needed.") The Asmedia chip worked with no problems (usb-c port on the back of the motherboard). The Starship USB 3.0 controllers both worked IF you had a passthru.map entry moving them to d3d0 reset method. Otherwise, booting a VM with one of these controllers failed AND crashed a different VM with a GPU memory access violation and the controller then permanently disappeared until the system was then powered down (not just a reboot). Wow, talk about bad crashes.
Using these 3 motherboard controllers an 3 vms appears rock stable (I haven't tested the fourth yet.) One of them has 64but mmo enabled which probably isn't needed.
Things I haven't gotten around to testing yet:
1. Does isolating the vm to one ccx fix anything?
2. If only one vm is running, does the usb-c nvidia controller become reliable?
3. Does turning off XMP or using the latest beta BIOS change anything?
Other advice -- I'm obviously waaay off of HCL here-- but don't even try DRAMless SSD's. Datastore *vanishes* under high load. Bad. Same thing happens with my OEM samsung until I updated the firmware, but that's another story well documented elsewhere.
I'm really puzzled by the nvidia usb-c thing. Would also be nice if the Matisse controllers worked. Otherwise mostly pleased-- many of the kludges needed on older esxi versions and the 1950x with its wacky NUMA configuration are no longer needed and the new system is *much* faster.
Hope this helps someone else. If anyone can tell me what's going on (or at least that it's not just me) would be much appreciated. I speculate it's a BIOS bug.
Thanks LT