We have been working to get a real-time, latency sensitive communications application to run without errors in a virtual environment under ESXi 6.5.0 Hypervisor. Unfortunately to date we have had limited success in that endeavor. The communications systems consists of 2 main parts:
1) A linux (RHEL6.8) comms/audio process we wrote that runs on a dedicated CPU. This is our RT CPU and runs at a 1,000 Hz frame rate and also talks to the Ethernet Interface (#2 below).
2) An Ethernet interface (Gigabit) to one or more audio distribution devices that runs on a closed LAN that is very time sensitive (i.e. SYNC’d clock network)
Our customer base has historically used dedicated hardware for each instance of the above. That hardware is a quad core or better system that runs RHEL6.8 natively plus our software. We do not have any audio issues in this configuration.
Our Host for Virtualization
We have a Supermicro SYS-7038A-I, Intel Xeon CPU E5-2660 v3 2.6 GHz, 10 CPU platform with 32 GB RAM and 2 Gigabit NICs. The NIC for item #2 above is setup in Passthrough enabled mode for the best results and also based on some recommendations in your white-paper on latency sensitive applications.
VM Instance Definition
Note: For all the info that follows we are only running a single instance VM of our application. Hence there is no interference from ‘other’ VMs on the same host. Also for all of the below results Latency Sensitivity is set to HIGH and full CPU resources are reserved.
VM Working Configuration
We are able to successfully get our comms application to work in a VM instance, however it is only when Affinity is either ‘undefined’ or when it is set to all CPUs. Which is ‘0-19’ in this case as hyper-threading is enabled. When this is working you will see a chart like the one in:
ESXi_MONITOR_PCPU_CONSTANT.png
If you look at this chart you will see that the PCPU#7 is the CPU running our RealTime core process (RT CPU). PCPU#10 is running our Non-Real-Time Core (NRT CPU) process. However the important part here is that once I start our software (power on the server) the RT CPU stays on PCPU#10 FOREVER and it never moves.
VM Non-Working Configuration
This is the same as the working configuration except instead of ‘undefined’ or ‘0-19’ for the affinity we set affinity to ‘0-17’ Hence as its still the only VM running it still has the full access to those 18 CPUs (as opposed to 20), yet it moves around all the time as can be seen in the chart:
ESXi_MONITOR_PCPU_MOVING#.png
It is this movement of our RT CPU process from PCPU to PCPU that ‘may’ cause an issue in audio break-up. I say ‘may’ because the act of the transition does not guarantee our system will break-up, however it has to date been the only cause for break-up. Hence if I can lock our RT CPU to a PCPU I can get this all working!!! I have tried other affinity settings like '0-3' or '10, 12, 14, 15’ or …. with the same basic results.
So I guess that gets me to my more specific question:
If I have a VM instance defined with say 2 vCPUs and one of those CPUs will be running a time-sensitve process is there a mechanism to lock that RT CPU to a PCPU for the duration of that VM instances uptime (i.e. Until powerdown)?
Thanks
Paul