Reduce amount of ESXi logs

We were looking into amount of ESXi logs we were collecting and we discovered that two applications in ESXi were on verbose logging level although we had set “config.HostAgent.log.level” to info. Those applications were rhttpproxy and fdm. They we generating millions of lines per day.

rhttpproxy

To reduce rhttpproxy log level you need to edit /etc/vmware/rhttpproxy/config.xml and replace the “verbose” value in log level section with “info” for example. After this restart the rhttpproxy service.

fdm (HA agent)

To change HA agent log level you need to modify the cluster advanced settings and add option “das.config.log.level” with value “info”. After this disable High Availability on the cluster and reenable the High Availability.

Powercli lines to do this:
New-AdvancedSetting -Entity $cluster -Type ClusterHA -Name ‘das.config.log.level’ -Value info
Set-Cluster -Cluster $cluster -HAEnabled:$false
Set-Cluster -Cluster $cluster -HAEnabled:$true

NTP service not starting on ESXi 7 after restart.

We noticed that NTP service is not starting after ESXi 7 patching although it’s configured to “Start and Stop with host”. According to VMware KB article (link) there is no fix for this issue at the moment.

I wrote a small script which I run every 5 minutes to check and start NTP service if it’s stopped.

$esx_hosts = Get-VMHost -State Maintenance,Connected
$esx_hosts | Get-VMHostService | Where-Object {$_.key -eq “ntpd” -and $_.Running -ne “True”} | Start-VMHostService 

ESXi 7.0.1 looses access to USB/SD-card.

We had several Dell servers that have ESXi installed on SD card which still showed missing patches after installing the 7.0.1 build 16850804. When we restarted the ESXi host it reverted back to 7.0.0. After some investigation we noticed issues with /bootbank and /altbootbank. We also noticed this issue on a freshly installed ESXi 7.0.1 build 16850804.

Some links about the issue:
https://www.reddit.com/r/vmware/comments/j92b40/fix_for_usb_booted_esxi_7_hosts_losing_access_to/
https://kb.vmware.com/s/article/2149444

To fix the issue is to add a new parameter to boot.cfg kernelopt line. The parameter is devListStabilityCount=10. We also added this line to 6.7.0 before upgrading it to 7.0.1.

VMFS6 heap size issue on ESXi 7 affecting running VMs – Updated

We have run into a issue where 2 of our VMs frozed and after investigation we discovered it was issue with VMFS6 heap size on ESXi 7. The error in the ESXi is “WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.”

Error from VM side: There is no more space for virtual disk ‘vm_1.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

VMware has KB article about it: VMFS-6 heap memory exhaustion on vSphere 7.0 ESXi hosts (80188)

This issue is fixed in ESXi 7 Update 1.

To workaround this issue follow the steps below (from KB article):

Create Eager zeroed thick disk on all of the mounted VMFS6 datastores.
vmkfstools -c 10M -d eagerzeroedthick /vmfs/volumes/datastore/eztDisk

Delete the Eager zeroed thick disk created in step 1.
vmkfstools -U /vmfs/volumes/datastore/eztDisk

Workaround has to be done for each datastore on each host.

Checking ESXi firewall status via PowerCLI

I discovered that some hosts did not had firewall enabled in our environment. So I wrote a small powershell script to check the status of firewall and enable firewall if not enabled.

The script:

$esx_hosts = Get-VMHost -State Maintenance,Connected

foreach ($esx_host in $esx_hosts)
{
Write-Host $esx_host checking
$esxcli= get-esxcli -VMHost $esx_host -V2
$fw_status = ($esxcli.network.firewall.get.invoke()).Enabled
Write-Host $esx_host – $fw_status

if ($fw_status -eq “false”) {
Write-Host Enabling FW -ForegroundColor Green
$arguments = $esxcli.network.firewall.set.CreateArgs()
$arguments.enabled = “true”
$esxcli.network.firewall.set.invoke($arguments)
}
}

Migration option grayed out for VM

Couple of days ago I had to do move several VMs from one datastore to another. For two VMs the migration button was grayed out. I had seen this before where backup software during backup disables vMotion/Storage vMotion and does not remove the lock after the backup is finished. Until now I always went into to the database and deleted those “locks” from vpx_disabled_methods table but it seems that it is also possible to clear them using a MOB (Managed Object Browser). I followed the instructions on the KB article and removed the “locks” from database – https://kb.vmware.com/s/article/1029926.

ESXi host fails to enter maintenance mode

Recently I was doing some host patching and several hosts in different clusters caused issues to me. At closer look I noticed that vMotion did not migrate VMs away from host. vMotion tasks failed with message:  A general system error occurred: Invalid state

The solution to fix it was to disconnect the host from vCenter and re-connect the host. After that vMotion worked and I was able to patch the host.

HPE ProLiant Gen9 servers loose connection to SD-card

In resent months we have had several issues with different HPE ProLiant BL460c Gen9 servers where we have seen errors in ESXi when it needs to access OS disk which in this case has been SD-card. In some cases when we have restarted ESXi the server has no longer booted after that since the OS SD-card is no longer visible to BIOS. Initially we thought that our SD-cards were dead, but when we replaced some of them and checked the failed cards they appeared to be OK. So next time when we had a failed SD-card we did a E-fuse restart for the server though Onboard Administrator and it booted up correctly. SD-card was again visible for the BIOS and ESXi booted correctly.

Command to perform e-fuse reset from Onboard Administrator -> server reset <bay_number>

Invalid CPU reservation for the latency-sensitive VM

Recently some VMs went down during regular patching. When I checked they were not powered on and when I tried to power them on I got an error – “Invalid CPU reservation for the latency-sensitive VM, (sched.cpu.min) should be at least 6990 MHz.”.

What had happened is that someone had changed this VM latency sensitivity to “High” without doing proper CPU and RAM reservations. It would not have been a problem during a restart of a VM but I had set advanced setting called “vmx.reboot.PowerCycle” to TRUE since I needed VM to get some new CPU features. This setting causes VM to power cycle during normal OS reboot. And since the reservations were not properly done VM failed to power on. After fixing the reservations VM successfully powered on. The error message about RAM reservation looks like this – “Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(4096)”.

To check the VM latency sensitivity using PowerCLI I use Virten.net.VimAutomation module written by Florian Grehl (his blog) – https://www.powershellgallery.com/packages/Virten.net.VimAutomation/1.3.0. This module has 3 useful commands for viewing and changing VM latency sensitivity – Get-VMLatencySensitivity, Get-VMLatencySensitivityBulk and Set-VMLatencySensitivity.

Change LUN queue depth in ESXi 6.7

Some time ago I had to change default queue depth for all LUNs in cluster.

First I needed to determine which driver (module) my HBA is using. I used following script for that.

### Script start 
$esx_hosts = get-vmhost esxi1*

foreach ($esx_host in $esx_hosts) {
Write-Host $esx_host
$esxcli = Get-EsxCli -VMhost $esx_host -V2
$esxcli.storage.core.adapter.list.invoke() |select HBAName, Driver, Description
}
### Script end

Output looks like this

To change the LUN queue depth parameter I used following script

### Script start
$esx_hosts = get-vmhost esx1*

foreach($esx_host in $esx_hosts){
Write-Host $esx_host
$esxcli=get-esxcli -VMHost $esx_host -V2
$args1 = $esxcli.system.module.parameters.set.createArgs()
$args1.parameterstring = “lpfc_lun_queue_depth=128”
$args1.module = “brcmfcoe”
$esxcli.system.module.parameters.set.invoke($args1)
}
### Script end

After running this you need to restart ESXi host.

After that I used following script to set “Maximum Outstanding Disk Requests for virtual machines”

### Script start
$esx_hosts = Get-VMHost esx1*

foreach ($esx_host in $esx_hosts)
{
$esxcli=get-esxcli -VMHost $esx_host -V2
$devices = $esxcli.storage.core.device.list.invoke()
foreach ($device in $devices)
{
 if ($device.device -imatch “naa.6”)
 {
  $arguments3 = $esxcli.storage.core.device.set.CreateArgs()
  $arguments3.device = $device.device
  $arguments3.schednumreqoutstanding = 128
  Write-Host $device.Device
  $esxcli.storage.core.device.set.invoke($arguments3)
  }
 }
}
### Script end

To check the LUN queue depth I use following script

### Script start
$esx_hosts = Get-VMHost esx1*
foreach($esx_host in $esx_hosts){
$esxcli=get-esxcli -VMHost $esx_host -V2
$ds_list = $esxcli.storage.core.device.list.invoke()

foreach ($ds1 in $ds_list) {
 $arguments3 = $esxcli.storage.core.device.list.CreateArgs()
 $arguments3.device = $ds1.device
 $esxcli.storage.core.device.list.Invoke($arguments3) | select Device,DeviceMaxQueueDepth,NoofoutstandingIOswithcompetingworlds
 }
}
### Script end