Reduce amount of ESXi logs

We were looking into amount of ESXi logs we were collecting and we discovered that two applications in ESXi were on verbose logging level although we had set “config.HostAgent.log.level” to info. Those applications were rhttpproxy and fdm. They we generating millions of lines per day.

rhttpproxy

To reduce rhttpproxy log level you need to edit /etc/vmware/rhttpproxy/config.xml and replace the “verbose” value in log level section with “info” for example. After this restart the rhttpproxy service.

fdm (HA agent)

To change HA agent log level you need to modify the cluster advanced settings and add option “das.config.log.level” with value “info”. After this disable High Availability on the cluster and reenable the High Availability.

Powercli lines to do this:
New-AdvancedSetting -Entity $cluster -Type ClusterHA -Name ‘das.config.log.level’ -Value info
Set-Cluster -Cluster $cluster -HAEnabled:$false
Set-Cluster -Cluster $cluster -HAEnabled:$true

NTP service not starting on ESXi 7 after restart.

We noticed that NTP service is not starting after ESXi 7 patching although it’s configured to “Start and Stop with host”. According to VMware KB article (link) there is no fix for this issue at the moment.

I wrote a small script which I run every 5 minutes to check and start NTP service if it’s stopped.

$esx_hosts = Get-VMHost -State Maintenance,Connected
$esx_hosts | Get-VMHostService | Where-Object {$_.key -eq “ntpd” -and $_.Running -ne “True”} | Start-VMHostService 

ESXi 7.0.1 looses access to USB/SD-card.

We had several Dell servers that have ESXi installed on SD card which still showed missing patches after installing the 7.0.1 build 16850804. When we restarted the ESXi host it reverted back to 7.0.0. After some investigation we noticed issues with /bootbank and /altbootbank. We also noticed this issue on a freshly installed ESXi 7.0.1 build 16850804.

Some links about the issue:
https://www.reddit.com/r/vmware/comments/j92b40/fix_for_usb_booted_esxi_7_hosts_losing_access_to/
https://kb.vmware.com/s/article/2149444

To fix the issue is to add a new parameter to boot.cfg kernelopt line. The parameter is devListStabilityCount=10. We also added this line to 6.7.0 before upgrading it to 7.0.1.

VMFS6 heap size issue on ESXi 7 affecting running VMs – Updated

We have run into a issue where 2 of our VMs frozed and after investigation we discovered it was issue with VMFS6 heap size on ESXi 7. The error in the ESXi is “WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.”

Error from VM side: There is no more space for virtual disk ‘vm_1.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

VMware has KB article about it: VMFS-6 heap memory exhaustion on vSphere 7.0 ESXi hosts (80188)

This issue is fixed in ESXi 7 Update 1.

To workaround this issue follow the steps below (from KB article):

Create Eager zeroed thick disk on all of the mounted VMFS6 datastores.
vmkfstools -c 10M -d eagerzeroedthick /vmfs/volumes/datastore/eztDisk

Delete the Eager zeroed thick disk created in step 1.
vmkfstools -U /vmfs/volumes/datastore/eztDisk

Workaround has to be done for each datastore on each host.

Checking ESXi firewall status via PowerCLI

I discovered that some hosts did not had firewall enabled in our environment. So I wrote a small powershell script to check the status of firewall and enable firewall if not enabled.

The script:

$esx_hosts = Get-VMHost -State Maintenance,Connected

foreach ($esx_host in $esx_hosts)
{
Write-Host $esx_host checking
$esxcli= get-esxcli -VMHost $esx_host -V2
$fw_status = ($esxcli.network.firewall.get.invoke()).Enabled
Write-Host $esx_host – $fw_status

if ($fw_status -eq “false”) {
Write-Host Enabling FW -ForegroundColor Green
$arguments = $esxcli.network.firewall.set.CreateArgs()
$arguments.enabled = “true”
$esxcli.network.firewall.set.invoke($arguments)
}
}

Upgrading to ESXi 7.0 – The upgrade has VIBS that are missing dependencies

I was upgrading a cluster with HPE servers from ESXi 6.7 to ESXi 7.0 and one host was complaining about missing VIB dependencies – “The upgrade has VIBS that are missing dependencies”. That’s normal when you upgrade ESXi hosts and you have installed 3rd party tools and drivers. I encountered similar issues while upgrading my home lab (blog post about it). The easy fix for that is to remove those vibs. In this case what threw me off was that the vibs reported to have missing dependencies were – dell-configuration-vib and dellemc_osname_idrac.

Since it was an HPE server the Dell vibs made no sense. I was not able to remove them as they did not appear in the list when I listed installed VIBs and the remove command also failed.

What had happened with this server is that I had mistakenly installed this server using custom ISO for Dell servers and not the custom ISO for HPE servers.

To fix the issue I’m going to be performing a clean installation of ESXi 7.0 using custom HPE ISO this time.

ESXi host fails to enter maintenance mode

Recently I was doing some host patching and several hosts in different clusters caused issues to me. At closer look I noticed that vMotion did not migrate VMs away from host. vMotion tasks failed with message:  A general system error occurred: Invalid state

The solution to fix it was to disconnect the host from vCenter and re-connect the host. After that vMotion worked and I was able to patch the host.

HPE ProLiant Gen9 servers loose connection to SD-card

In resent months we have had several issues with different HPE ProLiant BL460c Gen9 servers where we have seen errors in ESXi when it needs to access OS disk which in this case has been SD-card. In some cases when we have restarted ESXi the server has no longer booted after that since the OS SD-card is no longer visible to BIOS. Initially we thought that our SD-cards were dead, but when we replaced some of them and checked the failed cards they appeared to be OK. So next time when we had a failed SD-card we did a E-fuse restart for the server though Onboard Administrator and it booted up correctly. SD-card was again visible for the BIOS and ESXi booted correctly.

Command to perform e-fuse reset from Onboard Administrator -> server reset <bay_number>

Change LUN queue depth in ESXi 6.7

Some time ago I had to change default queue depth for all LUNs in cluster.

First I needed to determine which driver (module) my HBA is using. I used following script for that.

### Script start 
$esx_hosts = get-vmhost esxi1*

foreach ($esx_host in $esx_hosts) {
Write-Host $esx_host
$esxcli = Get-EsxCli -VMhost $esx_host -V2
$esxcli.storage.core.adapter.list.invoke() |select HBAName, Driver, Description
}
### Script end

Output looks like this

To change the LUN queue depth parameter I used following script

### Script start
$esx_hosts = get-vmhost esx1*

foreach($esx_host in $esx_hosts){
Write-Host $esx_host
$esxcli=get-esxcli -VMHost $esx_host -V2
$args1 = $esxcli.system.module.parameters.set.createArgs()
$args1.parameterstring = “lpfc_lun_queue_depth=128”
$args1.module = “brcmfcoe”
$esxcli.system.module.parameters.set.invoke($args1)
}
### Script end

After running this you need to restart ESXi host.

After that I used following script to set “Maximum Outstanding Disk Requests for virtual machines”

### Script start
$esx_hosts = Get-VMHost esx1*

foreach ($esx_host in $esx_hosts)
{
$esxcli=get-esxcli -VMHost $esx_host -V2
$devices = $esxcli.storage.core.device.list.invoke()
foreach ($device in $devices)
{
 if ($device.device -imatch “naa.6”)
 {
  $arguments3 = $esxcli.storage.core.device.set.CreateArgs()
  $arguments3.device = $device.device
  $arguments3.schednumreqoutstanding = 128
  Write-Host $device.Device
  $esxcli.storage.core.device.set.invoke($arguments3)
  }
 }
}
### Script end

To check the LUN queue depth I use following script

### Script start
$esx_hosts = Get-VMHost esx1*
foreach($esx_host in $esx_hosts){
$esxcli=get-esxcli -VMHost $esx_host -V2
$ds_list = $esxcli.storage.core.device.list.invoke()

foreach ($ds1 in $ds_list) {
 $arguments3 = $esxcli.storage.core.device.list.CreateArgs()
 $arguments3.device = $ds1.device
 $esxcli.storage.core.device.list.Invoke($arguments3) | select Device,DeviceMaxQueueDepth,NoofoutstandingIOswithcompetingworlds
 }
}
### Script end

 

 

ESXi 6.7 Update 3 (14320388) repeating Alarm ‘Host hardware sensor state’

Update: This issue has been fixed in ESXi 6.7 build 15018017.

After upgrading some of our ESXi hosts to ESXi 6.7 U3 we started seeing a lot of repeating alarm ‘Host hardware sensor state’.

Example:

Alarm ‘Host hardware sensor state’ on <esxi_hostname> triggered by event 13762875 ‘Sensor -1 type , Description Intel Corporation Sky Lake-E Ubox Registers #8 state assert for . Part Name/Number N/A N/A Manufacturer N/A’
Alarm ‘Host hardware sensor state’ on <esxi_hostname> triggered by event 13762874 ‘Sensor -1 type , Description Intel Corporation Sky Lake-E IOAPIC #5 state assert for . Part Name/Number N/A N/A Manufacturer N/A’
Alarm ‘Host hardware sensor state’ on <esxi_hostname> triggered by event 13762873 ‘Sensor -1 type , Description Intel Corporation Sky Lake-E RAS #5 state assert for . Part Name/Number N/A N/A Manufacturer N/A’

We see this issue on HPE Gen9 and Gen10 servers. Other have reported that this is also issue on other hardware (Reddit thread) . Currently we disabled to the alarm since it was spamming our events and also our syslog.