For some time, we have had performance problems with one database server. Under higher load, we are seeing latency going up and queueing happening. We tweaked with queues in Guest OS and ESXi with no results. Esxtop showed high KAVG/cmd.
For testing purposes, I created a Windows VM with an iometer and tried to replicate the io pattern of the database. Workload pattern – 64K io size – 65% read/35% write – 50% random. Since I did not have any previous baseline, I didn’t know what to make out from the numbers. I saw about 6000 IOPS, about 550MB/sec throughput, and around 40ms read and write latency. After some testing, I migrated my test VM to a different host with Emulex HBA. Immediately I noticed the difference when I started the test – 40000+ IOPS, around 2300MB/sec throughput, and around 5ms latency. I did several other tests on both hosts, and I saw that maximum throughput on problematic host never reached above 600MB/sec.
We opened a case with a vendor, and the support pointed out that while we were running an officially supported qedf driver version 220.127.116.11, it is prone to performance issues under load. We upgraded some of our hosts to driver version 18.104.22.168 and some to driver version 22.214.171.124. We saw a dramatic improvement in performance in both cases – around 45000 IOPS, about 3000 MB/sec throughput, and around 5ms latency.
I also found a host with a much older driver (126.96.36.199). The performance issue was present on that host also.
According to qedf release notes there is a performance related fix in driver version 188.8.131.52. My guess is if you running a driver version below 184.108.40.206 you are most likely affected by this performance issue.
After upgrading to ESXi 7 U2 (17630552) some of my hosts started dying after some time. All the affected hosts had one thing in common. ESXi is installed on to SD-card. Hosts, where ESXi was installed to SSD, do not seem to have this issue.
06.05.2021 update – the VMware ESXi 7.0.2 build 17867351 seems to be also affected with same problem.
I saw the following error messages:
Bootbank cannot be found at path ‘/bootbank’ hostd performance has degraded due to high system latency
A lot of NMP warnings about vmhba32 and vmhba33
As of now, I have rolled those hosts back to ESXi 7 U1. I will have to see if this error is related to U2 or I have bad SD cards.
I was patching some hosts to VMware ESXi 7.0.2 build 17630552 with Lifecycle Manager and some of the hosts failed to boot. I was seeing the message “Failed to load crypto64.efi. Fatal error: 15 (Not found). The issue happened with HPE Gen9 and Dell servers.
In my lab I discovered a host that had a root user password something else that I usually use. I was not able to figure it out. Since the host was connected to the vCenter I used a PowerShell script to change it. NB! Make sure the new password meets the complexity requirements.
We were looking into amount of ESXi logs we were collecting and we discovered that two applications in ESXi were on verbose logging level although we had set “config.HostAgent.log.level” to info. Those applications were rhttpproxy and fdm. They we generating millions of lines per day.
To reduce rhttpproxy log level you need to edit /etc/vmware/rhttpproxy/config.xml and replace the “verbose” value in log level section with “info” for example. After this restart the rhttpproxy service.
fdm (HA agent)
To change HA agent log level you need to modify the cluster advanced settings and add option “das.config.log.level” with value “info”. After this disable High Availability on the cluster and reenable the High Availability.
Powercli lines to do this: New-AdvancedSetting -Entity $cluster -Type ClusterHA -Name ‘das.config.log.level’ -Value info Set-Cluster -Cluster $cluster -HAEnabled:$false Set-Cluster -Cluster $cluster -HAEnabled:$true
We noticed that NTP service is not starting after ESXi 7 patching although it’s configured to “Start and Stop with host”. According to VMware KB article (link) there is no fix for this issue at the moment.
I wrote a small script which I run every 5 minutes to check and start NTP service if it’s stopped.
Recently when we were solving the USB/SD-card boot issue (link) we noticed that “Scan entity” tasks on some hosts took more than 15 minutes. In VMware Community there is a thread describing similar issue – https://communities.vmware.com/thread/470949. One post there mentioned that the problem only exists on hosts with FC storage.
Based on that information we did a test and results were following. Scan entity task took about 16 minutes on a host with 100+ FC LUNs. Scan entity task took about 1 minute on a same host with all FC LUNs disconnected (FC ports disabled). We also tested on host with about 10 LUNs – scan entity tasks took about 3 minutes. I have created a case to VMware to understand how is LUN count related with scanning for patches.
Update: I never got any good solution from VMware support. The problem has disappeared for now.
We had several Dell servers that have ESXi installed on SD card which still showed missing patches after installing the 7.0.1 build 16850804. When we restarted the ESXi host it reverted back to 7.0.0. After some investigation we noticed issues with /bootbank and /altbootbank. We also noticed this issue on a freshly installed ESXi 7.0.1 build 16850804.
We have run into a issue where 2 of our VMs frozed and after investigation we discovered it was issue with VMFS6 heap size on ESXi 7. The error in the ESXi is “WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.”
Error from VM side: There is no more space for virtual disk ‘vm_1.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.