VMFS6 heap size issue on ESXi 7 affecting running VMs – Updated

We have run into a issue where 2 of our VMs frozed and after investigation we discovered it was issue with VMFS6 heap size on ESXi 7. The error in the ESXi is “WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.”

Error from VM side: There is no more space for virtual disk ‘vm_1.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

VMware has KB article about it: VMFS-6 heap memory exhaustion on vSphere 7.0 ESXi hosts (80188)

This issue is fixed in ESXi 7 Update 1.

To workaround this issue follow the steps below (from KB article):

Create Eager zeroed thick disk on all of the mounted VMFS6 datastores.
vmkfstools -c 10M -d eagerzeroedthick /vmfs/volumes/datastore/eztDisk

Delete the Eager zeroed thick disk created in step 1.
vmkfstools -U /vmfs/volumes/datastore/eztDisk

Workaround has to be done for each datastore on each host.

Checking ESXi firewall status via PowerCLI

I discovered that some hosts did not had firewall enabled in our environment. So I wrote a small powershell script to check the status of firewall and enable firewall if not enabled.

The script:

$esx_hosts = Get-VMHost -State Maintenance,Connected

foreach ($esx_host in $esx_hosts)
{
Write-Host $esx_host checking
$esxcli= get-esxcli -VMHost $esx_host -V2
$fw_status = ($esxcli.network.firewall.get.invoke()).Enabled
Write-Host $esx_host – $fw_status

if ($fw_status -eq “false”) {
Write-Host Enabling FW -ForegroundColor Green
$arguments = $esxcli.network.firewall.set.CreateArgs()
$arguments.enabled = “true”
$esxcli.network.firewall.set.invoke($arguments)
}
}

Upgrading to ESXi 7.0 – The upgrade has VIBS that are missing dependencies

I was upgrading a cluster with HPE servers from ESXi 6.7 to ESXi 7.0 and one host was complaining about missing VIB dependencies – “The upgrade has VIBS that are missing dependencies”. That’s normal when you upgrade ESXi hosts and you have installed 3rd party tools and drivers. I encountered similar issues while upgrading my home lab (blog post about it). The easy fix for that is to remove those vibs. In this case what threw me off was that the vibs reported to have missing dependencies were – dell-configuration-vib and dellemc_osname_idrac.

Since it was an HPE server the Dell vibs made no sense. I was not able to remove them as they did not appear in the list when I listed installed VIBs and the remove command also failed.

What had happened with this server is that I had mistakenly installed this server using custom ISO for Dell servers and not the custom ISO for HPE servers.

To fix the issue I’m going to be performing a clean installation of ESXi 7.0 using custom HPE ISO this time.

Migration option grayed out for VM

Couple of days ago I had to do move several VMs from one datastore to another. For two VMs the migration button was grayed out. I had seen this before where backup software during backup disables vMotion/Storage vMotion and does not remove the lock after the backup is finished. Until now I always went into to the database and deleted those “locks” from vpx_disabled_methods table but it seems that it is also possible to clear them using a MOB (Managed Object Browser). I followed the instructions on the KB article and removed the “locks” from database – https://kb.vmware.com/s/article/1029926.

ESXi host fails to enter maintenance mode

Recently I was doing some host patching and several hosts in different clusters caused issues to me. At closer look I noticed that vMotion did not migrate VMs away from host. vMotion tasks failed with message:  A general system error occurred: Invalid state

The solution to fix it was to disconnect the host from vCenter and re-connect the host. After that vMotion worked and I was able to patch the host.

Home lab upgraded to vSphere 7 … almost. Updated!!

I have updated my home lab to vSphere 7 for exception of one host. Currently I have following hardware in my home lab – two HPE Proliant DL380 Gen8 (E5-2600 v2 series CPU), SuperMicro SYS-2028R-C1R4+ (E5-2600 v3 series CPU) and HPE Proliant DL380 G7 (X5600 series CPU). I used the VMware original ISO to perform the upgrades.

Supermicro SYS-2028R-C1R4+

Started with Supermicro. It was complaining about unsupported devices -> “Unsupported devices [8086:10a7 15d9:10a7] [8086:10a7 15d9:10a7] found on the host.”. During remediation I checked “Ignore warnings about unsupported hardware devices” and after some time the host was upgraded.

HPE Proliant DL380 Gen8

The HPE Proliant DL380 Gen8 servers also had unsupported devices detected -> “Unsupported devices [8086:105e 103c:7044] [8086:105e 103c:7044] found on the host.”

They also had some VIBs installed that were missing dependencies:

QLC_bootbank_qfle3f_1.0.68.0-1OEM.670.0.0.8169922
HPE_bootbank_scsi-hpdsa_5.5.0.68-1OEM.550.0.0.1331820
QLC_bootbank_qedi_2.10.15.0-1OEM.670.0.0.8169922
HPE_bootbank_scsi-hpdsa_5.5.0.68-1OEM.550.0.0.1331820
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedi_2.10.15.0-1OEM.670.0.0.8169922

I used following commands to remove them:

esxcli software vib remove –vibname qedf
esxcli software vib remove –vibname qedi
esxcli software vib remove –vibname qfle3f
esxcli software vib remove –vibname scsi-hpdsa

After this I upgraded the hosts while again checking the “Ignore warnings about unsupported hardware devices” option.

HPE Proliant DL380 G7

The HPE Proliant DL380 G7 has an unsupported X5650 CPU and I was not able to update it. I guess it needs to be replaced with something newer.

I used “AllowLegacyCPU=true” option to enable upgrade on HPE DL380 G7 with X5650 CPU. More info – https://www.virtuallyghetto.com/2020/04/quick-tip-allow-unsupported-cpus-when-upgrading-to-esxi-7-0.html

HPE ProLiant Gen9 servers loose connection to SD-card

In resent months we have had several issues with different HPE ProLiant BL460c Gen9 servers where we have seen errors in ESXi when it needs to access OS disk which in this case has been SD-card. In some cases when we have restarted ESXi the server has no longer booted after that since the OS SD-card is no longer visible to BIOS. Initially we thought that our SD-cards were dead, but when we replaced some of them and checked the failed cards they appeared to be OK. So next time when we had a failed SD-card we did a E-fuse restart for the server though Onboard Administrator and it booted up correctly. SD-card was again visible for the BIOS and ESXi booted correctly.

Command to perform e-fuse reset from Onboard Administrator -> server reset <bay_number>

Invalid CPU reservation for the latency-sensitive VM

Recently some VMs went down during regular patching. When I checked they were not powered on and when I tried to power them on I got an error – “Invalid CPU reservation for the latency-sensitive VM, (sched.cpu.min) should be at least 6990 MHz.”.

What had happened is that someone had changed this VM latency sensitivity to “High” without doing proper CPU and RAM reservations. It would not have been a problem during a restart of a VM but I had set advanced setting called “vmx.reboot.PowerCycle” to TRUE since I needed VM to get some new CPU features. This setting causes VM to power cycle during normal OS reboot. And since the reservations were not properly done VM failed to power on. After fixing the reservations VM successfully powered on. The error message about RAM reservation looks like this – “Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(4096)”.

To check the VM latency sensitivity using PowerCLI I use Virten.net.VimAutomation module written by Florian Grehl (his blog) – https://www.powershellgallery.com/packages/Virten.net.VimAutomation/1.3.0. This module has 3 useful commands for viewing and changing VM latency sensitivity – Get-VMLatencySensitivity, Get-VMLatencySensitivityBulk and Set-VMLatencySensitivity.

Change LUN queue depth in ESXi 6.7

Some time ago I had to change default queue depth for all LUNs in cluster.

First I needed to determine which driver (module) my HBA is using. I used following script for that.

### Script start 
$esx_hosts = get-vmhost esxi1*

foreach ($esx_host in $esx_hosts) {
Write-Host $esx_host
$esxcli = Get-EsxCli -VMhost $esx_host -V2
$esxcli.storage.core.adapter.list.invoke() |select HBAName, Driver, Description
}
### Script end

Output looks like this

To change the LUN queue depth parameter I used following script

### Script start
$esx_hosts = get-vmhost esx1*

foreach($esx_host in $esx_hosts){
Write-Host $esx_host
$esxcli=get-esxcli -VMHost $esx_host -V2
$args1 = $esxcli.system.module.parameters.set.createArgs()
$args1.parameterstring = “lpfc_lun_queue_depth=128”
$args1.module = “brcmfcoe”
$esxcli.system.module.parameters.set.invoke($args1)
}
### Script end

After running this you need to restart ESXi host.

After that I used following script to set “Maximum Outstanding Disk Requests for virtual machines”

### Script start
$esx_hosts = Get-VMHost esx1*

foreach ($esx_host in $esx_hosts)
{
$esxcli=get-esxcli -VMHost $esx_host -V2
$devices = $esxcli.storage.core.device.list.invoke()
foreach ($device in $devices)
{
 if ($device.device -imatch “naa.6”)
 {
  $arguments3 = $esxcli.storage.core.device.set.CreateArgs()
  $arguments3.device = $device.device
  $arguments3.schednumreqoutstanding = 128
  Write-Host $device.Device
  $esxcli.storage.core.device.set.invoke($arguments3)
  }
 }
}
### Script end

To check the LUN queue depth I use following script

### Script start
$esx_hosts = Get-VMHost esx1*
foreach($esx_host in $esx_hosts){
$esxcli=get-esxcli -VMHost $esx_host -V2
$ds_list = $esxcli.storage.core.device.list.invoke()

foreach ($ds1 in $ds_list) {
 $arguments3 = $esxcli.storage.core.device.list.CreateArgs()
 $arguments3.device = $ds1.device
 $esxcli.storage.core.device.list.Invoke($arguments3) | select Device,DeviceMaxQueueDepth,NoofoutstandingIOswithcompetingworlds
 }
}
### Script end

 

 

vCenter 6.7 Update 3 issue with rsyslog

Update: VMware has released a patch for vCenter which fixes the issue – release notes.

Recently we noticed that logs from vCenter did not reach our log servers. When I restarted rsyslog on the vCenter appliance they started to work again but after some time it stopped again. There is a thread in VMware community about this issue – https://communities.vmware.com/thread/618178. In short the fix most likely will be available in next patch for vCenter. There is also described  in the thread a workaround how to manually update rsyslog package on the appliance.