I just solved a long-standing storage performance issue when using cheap consumer SATA disks for ESXi 5.5.0 datastores through an LSI 9201-16i SAS HBA. Hopefully this helps somebody else.
Symptoms:
- Sudden, extreme disk latency during I/O heavy operations, like:
- Copying a large file in a VM with a freshly created thick-lazy_zero VMDK
- Creating thick-eager_zero VMDKs
- Creating storage pools in Server 2012's "Storage Spaces" feature
- SMB shares under heavy write load would "disappear"
- Windows resource monitor reporting 100% Disk Active Time but zero MB/sec
- Using SSH/SCP to copy files to datastores
- Disk I/O errors and degraded performance messages in /var/log/vmkernel.log
- Disks will "disappear" completely from ESXi during high I/O, then eventually re-appear when the I/O stops
- Only occurs with cheap SATA spinning disks (not SSDs or enterprise SAS)
- Same disks work fine while connected to onboard AHCI (ex. Intel ICH) SATA, but choke when connected via the LSI HBA.
- Controller and disks work fine when used with non-ESX (ex. Windows Server) on the bare metal.
Finally after a lot of pain I discovered how to fix it. As with so many things in IT, when you find the root cause it's very satisfying.
Root Cause:
- The VAAI (vStorage APIs for Array Integration) storage acceleration feature in ESXi uses a special SCSI command 0x93 WRITE_SAME.
- Cheap SATA disks often do not support WRITE_SAME.
- When the 0x93 WRITE_SAME command hits the SATA disk, it hiccups, flushes its buffer, and causes a huge latency.
- For whatever reason, SAS HBAs pass the 0x93s through to the disks (which start choking them) but AHCI SATA controllers do not. (Not sure why)
- If the 0x93s come fast & heavy, the disk will "disappear" momentarily from ESXi, and eventually be discovered again when the 0x93s stop.
- I suspect if the disks were connected with hardware RAID, the controller might think they're "failed" and start populating a hot spare.
Solution:
Disable VAAI on the ESXi host - Configuration -> Advanced Settings:
- Set DataMover -> DataMover.HardwareAcceleratedInit = 0
- Set DataMover -> DataMover.HardwareAccelerated Move = 0
- Set VMFS3 -> VMFS3.HardwareAcceleratedLocking = 0
Depending on your setup, maybe only some of these features may need to be set. A reboot of the host is not required. See KB1033665.
Hopefully somebody will find this useful - Or perhaps somebody will tell me something I missed...
Regards,
Dave