Monitor and troubleshoot VMWare ESXi through logs

I recently worked with Graylog and Grafana to collect, analyze and visualize VMWare ESXi logs. I’ve started a new page where I will collect all the search strings I have found useful.

Some examples of Grafana charts

I will be updating that page with new strings as I’m upgrading from vSphere 5.5 to vSphere 6.5.

Advertisements

Incompatible device backing specified for device ’13’

I was doing some Shared Nothing Live Migrations between two VMware clusters (version 5.5) and I was getting following error at 25% of the migration – “Incompatible device backing specified for device ’13′”. Searching from internet indicated issues with network adapter but in this case network adapter was not the case.

Issue in this case was a raw device mapping (RDM) that had a different LUN ID in destination cluster.

vMotion between the clusters worked for VM when the datastore was made visible for all the hosts. Storage vMotion did not work in destination cluster – got same error.

Solution for me was to present destination datastore to original hosts and perform Storage vMotion in original location and then perform a vMotion to destination cluster.

Another solution I tested

  • Shutdown the VM
  • Remove the RDM
  • Perform migration to destination cluster
  • Reattach the RDM

Snapshot fails for VM with running Docker container

Recently I noticed some Linux VM backups were failing and sometimes even crashing with following errors:
An error occurred while taking a snapshot: msg.snapshot.error-QUIESCINGERROR.
An error occurred while saving the snapshot: msg.snapshot.error-QUIESCINGERROR.

On closer look another error was visible in hostd.log file – Error when enabling the sync provider.

All of these VMs had one thing in common – they were running Docker containers.
I was not able to figure out why it happened but I was able to find a workaround – disable the VMWare Sync driver.

Copy-paste from Veritas KB article – https://www.veritas.com/support/en_US/article.000021419

Steps to Disable VMware vmsync driver
To prevent the vmsync driver from being called during the quiesce phase of a VMware snapshot, edit the VMware Tools configuration file as follows:

1) Open a console session to the Redhat Linux virtual machine.
2) Navigate to the /etc/vmware-tools directory
3) Using a text editor, modify the tools.conf file with the following entry

[vmbackup]
enableSyncDriver = false

Note: If the tools.conf file does not exist, create a new empty file and add the above parameters.

 

ESXi will not resume syslog log sending when destination has been down for some time

Recently I was playing with ESXi syslog and Logstash + Graylog. For some reason Logstash instance died. After restarting the Logstash only some ESXi hosts resumed log sending. Quick google search revealed that it is a know issue and solution is to reload syslog on the host. After running following script in PowerCLI against my vCenter the log sending resumed.

$hosts = Get-VMHost
foreach($vihost in $hosts){
$esxcli = get-vmhost $vihost | Get-EsxCli
$esxcli.system.syslog.reload()
}

Good information about PowerCLI and ESXCLI:
http://www.virten.net/2016/11/how-to-use-esxcli-v2-commands-in-powercli/
http://www.virten.net/2014/02/howto-use-esxcli-in-powercli/

 

Firmware update fails on HPE server when Serial Number and Product ID is missing

Recently I was having issues updating HPE ProLiant BL460c G7 with latest SPP (2016.10). Firmware update just stopped on Step 1. Also HPE custom ESXi ISO failed to work.

After some digging around I discovered that server Serial Number and Product ID were missing. I went to BIOS and filled in the correct Serial Number and Product ID and after that the firmware update worked and I was also able to install HPE custom ESXi.

I suspect that the Serial Number and Product ID were lost when this blade server was removed from one Virtual Connect infrastructure and placed to another.

Error joining ESXi host to Active Directory

I was trying to join ESXi host to Active Directory using PowerCli ( Get-vmhost <vmhost> | Get-VMHostAuthentication | Set-VMHostAuthentication -JoinDomain -Domain “domain.com/Servers/ESXi” -User “<username>” -Password “<password>” ) and I was getting an error:

Active directory authentication store is not supported for VMHost <hostname>

Tried to join it via old C client – that failed also.

AD join finally worked via VMware Host Client running on the host. VMWare Host Client can be accessed via web browser – https://<hostname>/ui/#/login.

Error during vMotion: The source detected that the destination failed to resume.

During a vMotion one of my VMs refused to move and vCenter was giving a following error:

The VM failed to resume on the destination during early power on.
Module DiskEarly power on failed.
Cannot open the disk ‘/vmfs/volumes/…./…./???????.vmdk’ or one of the snapshot disks it depends on.
Could not open/create change tracking file

I turned the server off and tried to delete the change tracking file. Got an error:

Cannot delete file [<Datastore>] VMFolder/VM-ctk.-vmdk

Migrated the VM to another host and tried power it on. Got an error:

An error was received from the ESXi host while powering on VM VMName.
Failed to start the virtual machine.
Cannot open the disk ‘/vmfs/volumes/5783364b-08fc763e-1389-00215a9b0098/lx61261.sbcore.net/lx61261.sbcore.net.vmdk’ or one of the snapshot disks it depends on.
Could not open/create change tracking file

Next I rebooted the ESXi host on which the problematic VM was initially and after that I was able to delete the *-ctk.vmdk file and power on the VM. It seems that for some reason there was a file locks on the change tracking files and it prevented operations on the VM.