We’re bringing up a new Veeam environment on new hardware and aligning our backup strategy to better/recommend practices.Since the hardware is new and we are not under a particular go-live schedule, I have a good opportunity to test/tune. The testing of the new backup server has uncovered a storage bottleneck that I don’t understand, so I’m looking for some coaching. There is some Veeam heavy content, but it is relevant, so please bear with me.
At this location we’ve got 3 HP DL380 G9’s running ESXi 6.0 Update 3d Express Patch 12 (initial Spectre patch) and have a 6-node HP StoreVirtual/Lefthand P4300 SAN. Networking for all of the above is 10Gbps via dual Cisco 4500-X and SFP+ twinax cabling.
This new backup server is physical, an HP DL380G10, dual connected to the same C4500-X. Windows Server 2016 and 64GB RAM. Jumbo frames are not configured in the environment, as both the backup server and the vmware hosts have trunked network adapters and jumbo’s would apply to all L2 frames.
We are using software iSCSI initiators both on the ESXi hosts to the HP/Lefthand storage and on the physical backup server. When we deployed this generation environment, there was a bug in the firmware/driver for the Emulex/HP 556 10Gb hardware that caused decreased performance and eventually disconnects when using the hardware iSCSI HBA mode. (I suppose I could test on the new physical backup server and its Intel/HP 10Gbps card in HBA mode, but I have not). The Microsoft iSCSI initiator is setup on the physical backup host, not in multi-path mode, the HP/Lefthand LUNs are presented read-only and are present in disk manager.
Veeam gets installed and a test job configured. When I did this, I missed that the trial license allows the Storage snapshot option to be enabled by default. I hadn’t even looked at the tab, as I ‘knew’ we weren’t going to use it. Turns out it was enabled by default and revealed great performance bypassing ESXi altogether. No real surprises yet.
Just for reference, the Veeam throughput and the job stats showed:
Overall processing rate of 907 MB/s, Load: Source 60% > Proxy 71% > Network 82% > Target 29%
I pretty quickly figured out it had done the storage snapshot, and unchecked the job’s storage integration box. Re-running the exact same job again as an active full in san mode.
This is the part I don’t understand:
Overall processing rate: 431MB/s, Load: Source 99% > Proxy 16% > Network 6% > Target 1%
Seeing the processing rate is half, and the load is Source, I set about trying to find the bottleneck.
We already know that the backup server direct from storage is fast from the storage network snapshot job. I cannot measure this with iperf, but that first Veeam job run shows the good rate outside of ESXi. 907MB/s is about 7.2Gbps
iperf2 (part of ESXi) from the backup server storage network NIC to the vmware host storage vmk port IP shows ~7.5Gbps. I ran this with 4, 8 and 20 streams to get the 7.5 number. All runs were pretty close together. 7.5Gbps is in the range of that initial veeam job. Now we know that both ends of the backup server to ESXi connections are good.
iperf3 on a vmware guest to the backup server shows ~8Gbps. Now we know the trunking, general network access and NIC performance is good from ESXi.
iperf3 from the old backup server to the new shows hi-8Gbps. This is mostly irrelevant, just did it to reinforce the environment is capable.
So then I run diskspd from inside a guest to test the storage disk speed access. It came out at 410MB/s, or about the same rate as the san mode Veeam job with ESXi involved.
If the rate from Veeam backup server to storage is good, and Veeam to ESXi host storage NIC is good, and ESXi to general network is good.....
What concept am I missing that explains the decreased rate from ESXi to the storage?