Reduce amount of ESXi logs

We were looking into amount of ESXi logs we were collecting and we discovered that two applications in ESXi were on verbose logging level although we had set “config.HostAgent.log.level” to info. Those applications were rhttpproxy and fdm. They we generating millions of lines per day.

rhttpproxy

To reduce rhttpproxy log level you need to edit /etc/vmware/rhttpproxy/config.xml and replace the “verbose” value in log level section with “info” for example. After this restart the rhttpproxy service.

fdm (HA agent)

To change HA agent log level you need to modify the cluster advanced settings and add option “das.config.log.level” with value “info”. After this disable High Availability on the cluster and reenable the High Availability.

Powercli lines to do this:
New-AdvancedSetting -Entity $cluster -Type ClusterHA -Name ‘das.config.log.level’ -Value info
Set-Cluster -Cluster $cluster -HAEnabled:$false
Set-Cluster -Cluster $cluster -HAEnabled:$true

NTP service not starting on ESXi 7 after restart.

We noticed that NTP service is not starting after ESXi 7 patching although it’s configured to “Start and Stop with host”. According to VMware KB article (link) there is no fix for this issue at the moment.

I wrote a small script which I run every 5 minutes to check and start NTP service if it’s stopped.

$esx_hosts = Get-VMHost -State Maintenance,Connected
$esx_hosts | Get-VMHostService | Where-Object {$_.key -eq “ntpd” -and $_.Running -ne “True”} | Start-VMHostService 

Scan entity (Check Compliance) task takes a long time.

Recently when we were solving the USB/SD-card boot issue (link) we noticed that “Scan entity” tasks on some hosts took more than 15 minutes. In VMware Community there is a thread describing similar issue – https://communities.vmware.com/thread/470949. One post there mentioned that the problem only exists on hosts with FC storage.

Based on that information we did a test and results were following. Scan entity task took about 16 minutes on a host with 100+ FC LUNs. Scan entity task took about 1 minute on a same host with all FC LUNs disconnected (FC ports disabled). We also tested on host with about 10 LUNs – scan entity tasks took about 3 minutes. I have created a case to VMware to understand how is LUN count related with scanning for patches.

ESXi 7.0.1 looses access to USB/SD-card.

We had several Dell servers that have ESXi installed on SD card which still showed missing patches after installing the 7.0.1 build 16850804. When we restarted the ESXi host it reverted back to 7.0.0. After some investigation we noticed issues with /bootbank and /altbootbank. We also noticed this issue on a freshly installed ESXi 7.0.1 build 16850804.

Some links about the issue:
https://www.reddit.com/r/vmware/comments/j92b40/fix_for_usb_booted_esxi_7_hosts_losing_access_to/
https://kb.vmware.com/s/article/2149444

To fix the issue is to add a new parameter to boot.cfg kernelopt line. The parameter is devListStabilityCount=10. We also added this line to 6.7.0 before upgrading it to 7.0.1.

VMFS6 heap size issue on ESXi 7 affecting running VMs – Updated

We have run into a issue where 2 of our VMs frozed and after investigation we discovered it was issue with VMFS6 heap size on ESXi 7. The error in the ESXi is “WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.”

Error from VM side: There is no more space for virtual disk ‘vm_1.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

VMware has KB article about it: VMFS-6 heap memory exhaustion on vSphere 7.0 ESXi hosts (80188)

This issue is fixed in ESXi 7 Update 1.

To workaround this issue follow the steps below (from KB article):

Create Eager zeroed thick disk on all of the mounted VMFS6 datastores.
vmkfstools -c 10M -d eagerzeroedthick /vmfs/volumes/datastore/eztDisk

Delete the Eager zeroed thick disk created in step 1.
vmkfstools -U /vmfs/volumes/datastore/eztDisk

Workaround has to be done for each datastore on each host.

Checking ESXi firewall status via PowerCLI

I discovered that some hosts did not had firewall enabled in our environment. So I wrote a small powershell script to check the status of firewall and enable firewall if not enabled.

The script:

$esx_hosts = Get-VMHost -State Maintenance,Connected

foreach ($esx_host in $esx_hosts)
{
Write-Host $esx_host checking
$esxcli= get-esxcli -VMHost $esx_host -V2
$fw_status = ($esxcli.network.firewall.get.invoke()).Enabled
Write-Host $esx_host – $fw_status

if ($fw_status -eq “false”) {
Write-Host Enabling FW -ForegroundColor Green
$arguments = $esxcli.network.firewall.set.CreateArgs()
$arguments.enabled = “true”
$esxcli.network.firewall.set.invoke($arguments)
}
}

Upgrading to ESXi 7.0 – The upgrade has VIBS that are missing dependencies

I was upgrading a cluster with HPE servers from ESXi 6.7 to ESXi 7.0 and one host was complaining about missing VIB dependencies – “The upgrade has VIBS that are missing dependencies”. That’s normal when you upgrade ESXi hosts and you have installed 3rd party tools and drivers. I encountered similar issues while upgrading my home lab (blog post about it). The easy fix for that is to remove those vibs. In this case what threw me off was that the vibs reported to have missing dependencies were – dell-configuration-vib and dellemc_osname_idrac.

Since it was an HPE server the Dell vibs made no sense. I was not able to remove them as they did not appear in the list when I listed installed VIBs and the remove command also failed.

What had happened with this server is that I had mistakenly installed this server using custom ISO for Dell servers and not the custom ISO for HPE servers.

To fix the issue I’m going to be performing a clean installation of ESXi 7.0 using custom HPE ISO this time.

Migration option grayed out for VM

Couple of days ago I had to do move several VMs from one datastore to another. For two VMs the migration button was grayed out. I had seen this before where backup software during backup disables vMotion/Storage vMotion and does not remove the lock after the backup is finished. Until now I always went into to the database and deleted those “locks” from vpx_disabled_methods table but it seems that it is also possible to clear them using a MOB (Managed Object Browser). I followed the instructions on the KB article and removed the “locks” from database – https://kb.vmware.com/s/article/1029926.

ESXi host fails to enter maintenance mode

Recently I was doing some host patching and several hosts in different clusters caused issues to me. At closer look I noticed that vMotion did not migrate VMs away from host. vMotion tasks failed with message:  A general system error occurred: Invalid state

The solution to fix it was to disconnect the host from vCenter and re-connect the host. After that vMotion worked and I was able to patch the host.

Home lab upgraded to vSphere 7 … almost. Updated!!

I have updated my home lab to vSphere 7 for exception of one host. Currently I have following hardware in my home lab – two HPE Proliant DL380 Gen8 (E5-2600 v2 series CPU), SuperMicro SYS-2028R-C1R4+ (E5-2600 v3 series CPU) and HPE Proliant DL380 G7 (X5600 series CPU). I used the VMware original ISO to perform the upgrades.

Supermicro SYS-2028R-C1R4+

Started with Supermicro. It was complaining about unsupported devices -> “Unsupported devices [8086:10a7 15d9:10a7] [8086:10a7 15d9:10a7] found on the host.”. During remediation I checked “Ignore warnings about unsupported hardware devices” and after some time the host was upgraded.

HPE Proliant DL380 Gen8

The HPE Proliant DL380 Gen8 servers also had unsupported devices detected -> “Unsupported devices [8086:105e 103c:7044] [8086:105e 103c:7044] found on the host.”

They also had some VIBs installed that were missing dependencies:

QLC_bootbank_qfle3f_1.0.68.0-1OEM.670.0.0.8169922
HPE_bootbank_scsi-hpdsa_5.5.0.68-1OEM.550.0.0.1331820
QLC_bootbank_qedi_2.10.15.0-1OEM.670.0.0.8169922
HPE_bootbank_scsi-hpdsa_5.5.0.68-1OEM.550.0.0.1331820
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedi_2.10.15.0-1OEM.670.0.0.8169922

I used following commands to remove them:

esxcli software vib remove –vibname qedf
esxcli software vib remove –vibname qedi
esxcli software vib remove –vibname qfle3f
esxcli software vib remove –vibname scsi-hpdsa

After this I upgraded the hosts while again checking the “Ignore warnings about unsupported hardware devices” option.

HPE Proliant DL380 G7

The HPE Proliant DL380 G7 has an unsupported X5650 CPU and I was not able to update it. I guess it needs to be replaced with something newer.

I used “AllowLegacyCPU=true” option to enable upgrade on HPE DL380 G7 with X5650 CPU. More info – https://www.virtuallyghetto.com/2020/04/quick-tip-allow-unsupported-cpus-when-upgrading-to-esxi-7-0.html