Migration option grayed out for VM

Couple of days ago I had to do move several VMs from one datastore to another. For two VMs the migration button was grayed out. I had seen this before where backup software during backup disables vMotion/Storage vMotion and does not remove the lock after the backup is finished. Until now I always went into to the database and deleted those “locks” from vpx_disabled_methods table but it seems that it is also possible to clear them using a MOB (Managed Object Browser). I followed the instructions on the KB article and removed the “locks” from database – https://kb.vmware.com/s/article/1029926.

ESXi host fails to enter maintenance mode

Recently I was doing some host patching and several hosts in different clusters caused issues to me. At closer look I noticed that vMotion did not migrate VMs away from host. vMotion tasks failed with message:  A general system error occurred: Invalid state

The solution to fix it was to disconnect the host from vCenter and re-connect the host. After that vMotion worked and I was able to patch the host.

Home lab upgraded to vSphere 7 … almost.

I have updated my home lab to vSphere 7 for exception of one host. Currently I have following hardware in my home lab – two HPE Proliant DL380 Gen8 (E5-2600 v2 series CPU), SuperMicro SYS-2028R-C1R4+ (E5-2600 v3 series CPU) and HPE Proliant DL380 G7 (X5600 series CPU). I used the VMware original ISO to perform the upgrades.

Supermicro SYS-2028R-C1R4+

Started with Supermicro. It was complaining about unsupported devices -> “Unsupported devices [8086:10a7 15d9:10a7] [8086:10a7 15d9:10a7] found on the host.”. During remediation I checked “Ignore warnings about unsupported hardware devices” and after some time the host was upgraded.

HPE Proliant DL380 Gen8

The HPE Proliant DL380 Gen8 servers also had unsupported devices detected -> “Unsupported devices [8086:105e 103c:7044] [8086:105e 103c:7044] found on the host.”

They also had some VIBs installed that were missing dependencies:

QLC_bootbank_qfle3f_1.0.68.0-1OEM.670.0.0.8169922
HPE_bootbank_scsi-hpdsa_5.5.0.68-1OEM.550.0.0.1331820
QLC_bootbank_qedi_2.10.15.0-1OEM.670.0.0.8169922
HPE_bootbank_scsi-hpdsa_5.5.0.68-1OEM.550.0.0.1331820
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedf_1.3.36.0-1OEM.600.0.0.2768847
QLC_bootbank_qedi_2.10.15.0-1OEM.670.0.0.8169922

I used following commands to remove them:

esxcli software vib remove –vibname qedf
esxcli software vib remove –vibname qedi
esxcli software vib remove –vibname qfle3f
esxcli software vib remove –vibname scsi-hpdsa

After this I upgraded the hosts while again checking the “Ignore warnings about unsupported hardware devices” option.

HPE Proliant DL380 G7

The HPE Proliant DL380 G7 has an unsupported X5650 CPU and I was not able to update it. I guess it needs to be replaced with something newer.

HPE ProLiant Gen9 servers loose connection to SD-card

In resent months we have had several issues with different HPE ProLiant BL460c Gen9 servers where we have seen errors in ESXi when it needs to access OS disk which in this case has been SD-card. In some cases when we have restarted ESXi the server has no longer booted after that since the OS SD-card is no longer visible to BIOS. Initially we thought that our SD-cards were dead, but when we replaced some of them and checked the failed cards they appeared to be OK. So next time when we had a failed SD-card we did a E-fuse restart for the server though Onboard Administrator and it booted up correctly. SD-card was again visible for the BIOS and ESXi booted correctly.

Command to perform e-fuse reset from Onboard Administrator -> server reset <bay_number>

Invalid CPU reservation for the latency-sensitive VM

Recently some VMs went down during regular patching. When I checked they were not powered on and when I tried to power them on I got an error – “Invalid CPU reservation for the latency-sensitive VM, (sched.cpu.min) should be at least 6990 MHz.”.

What had happened is that someone had changed this VM latency sensitivity to “High” without doing proper CPU and RAM reservations. It would not have been a problem during a restart of a VM but I had set advanced setting called “vmx.reboot.PowerCycle” to TRUE since I needed VM to get some new CPU features. This setting causes VM to power cycle during normal OS reboot. And since the reservations were not properly done VM failed to power on. After fixing the reservations VM successfully powered on. The error message about RAM reservation looks like this – “Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(4096)”.

To check the VM latency sensitivity using PowerCLI I use Virten.net.VimAutomation module written by Florian Grehl (his blog) – https://www.powershellgallery.com/packages/Virten.net.VimAutomation/1.3.0. This module has 3 useful commands for viewing and changing VM latency sensitivity – Get-VMLatencySensitivity, Get-VMLatencySensitivityBulk and Set-VMLatencySensitivity.

Change LUN queue depth in ESXi 6.7

Some time ago I had to change default queue depth for all LUNs in cluster.

First I needed to determine which driver (module) my HBA is using. I used following script for that.

### Script start 
$esx_hosts = get-vmhost esxi1*

foreach ($esx_host in $esx_hosts) {
Write-Host $esx_host
$esxcli = Get-EsxCli -VMhost $esx_host -V2
$esxcli.storage.core.adapter.list.invoke() |select HBAName, Driver, Description
}
### Script end

Output looks like this

To change the LUN queue depth parameter I used following script

### Script start
$esx_hosts = get-vmhost esx1*

foreach($esx_host in $esx_hosts){
Write-Host $esx_host
$esxcli=get-esxcli -VMHost $esx_host -V2
$args1 = $esxcli.system.module.parameters.set.createArgs()
$args1.parameterstring = “lpfc_lun_queue_depth=128”
$args1.module = “brcmfcoe”
$esxcli.system.module.parameters.set.invoke($args1)
}
### Script end

After running this you need to restart ESXi host.

After that I used following script to set “Maximum Outstanding Disk Requests for virtual machines”

### Script start
$esx_hosts = Get-VMHost esx1*

foreach ($esx_host in $esx_hosts)
{
$esxcli=get-esxcli -VMHost $esx_host -V2
$devices = $esxcli.storage.core.device.list.invoke()
foreach ($device in $devices)
{
 if ($device.device -imatch “naa.6”)
 {
  $arguments3 = $esxcli.storage.core.device.set.CreateArgs()
  $arguments3.device = $device.device
  $arguments3.schednumreqoutstanding = 128
  Write-Host $device.Device
  $esxcli.storage.core.device.set.invoke($arguments3)
  }
 }
}
### Script end

To check the LUN queue depth I use following script

### Script start
$esx_hosts = Get-VMHost esx1*
foreach($esx_host in $esx_hosts){
$esxcli=get-esxcli -VMHost $esx_host -V2
$ds_list = $esxcli.storage.core.device.list.invoke()

foreach ($ds1 in $ds_list) {
 $arguments3 = $esxcli.storage.core.device.list.CreateArgs()
 $arguments3.device = $ds1.device
 $esxcli.storage.core.device.list.Invoke($arguments3) | select Device,DeviceMaxQueueDepth,NoofoutstandingIOswithcompetingworlds
 }
}
### Script end

 

 

vCenter 6.7 Update 3 issue with rsyslog

Update: VMware has released a patch for vCenter which fixes the issue – release notes.

Recently we noticed that logs from vCenter did not reach our log servers. When I restarted rsyslog on the vCenter appliance they started to work again but after some time it stopped again. There is a thread in VMware community about this issue – https://communities.vmware.com/thread/618178. In short the fix most likely will be available in next patch for vCenter. There is also described  in the thread a workaround how to manually update rsyslog package on the appliance.

ESXi 6.7 Update 3 (14320388) repeating Alarm ‘Host hardware sensor state’

Update: This issue has been fixed in ESXi 6.7 build 15018017.

After upgrading some of our ESXi hosts to ESXi 6.7 U3 we started seeing a lot of repeating alarm ‘Host hardware sensor state’.

Example:

Alarm ‘Host hardware sensor state’ on <esxi_hostname> triggered by event 13762875 ‘Sensor -1 type , Description Intel Corporation Sky Lake-E Ubox Registers #8 state assert for . Part Name/Number N/A N/A Manufacturer N/A’
Alarm ‘Host hardware sensor state’ on <esxi_hostname> triggered by event 13762874 ‘Sensor -1 type , Description Intel Corporation Sky Lake-E IOAPIC #5 state assert for . Part Name/Number N/A N/A Manufacturer N/A’
Alarm ‘Host hardware sensor state’ on <esxi_hostname> triggered by event 13762873 ‘Sensor -1 type , Description Intel Corporation Sky Lake-E RAS #5 state assert for . Part Name/Number N/A N/A Manufacturer N/A’

We see this issue on HPE Gen9 and Gen10 servers. Other have reported that this is also issue on other hardware (Reddit thread) . Currently we disabled to the alarm since it was spamming our events and also our syslog.

 

vMotion fails with error – “Failed to receive migration”

At some point I noticed that vMotion for several VMs failed with message “Failed to receive migration”.

After some investigation I discovered that VM advanced setting “mks.enable3d” had been changed from TRUE to FALSE without powering off the VM. After power cycling the VM vMotion started to work again. But I was not able to power cycle all the VMs so I changed the mks.enable3d setting value back to TRUE and then vMotion also started to work again.

I guess that’s why you should change advanced settings on powered off VMs instead of powered on VMs.

VM disk consolidation fails – “Unable to access file since it is locked”

Couple of time per month I’m seeing errors during backup where VM has orphaned snapshots are locked and they are preventing new backups to be performed. Under Tasks I see several failed tasks – “Consolidate virtual machine disk files” with status “Unable to access file since it is locked”

To unlock the file I usually restart the management agents of the host from the console where the VM was located when error occurred.

I have wrote about this type of issue before when it happened to me on ESXi 5.5 – VM DISK CONSOLIDATION FAILURES