Virtual Machine network latency issue

Recently we were debugging an issue where VM network latency was higher than usual on some VMs as soon as vCPU was utilized. When CPU was loaded we were seeing ping response times up to 30ms in the same VLAN. Normal value usully is below 0.5ms. After several failed attempts one of my colleagues found a thread on SUSE Forums which described the issue we were having – From the thread we found a hint – VM advanced setting called “sched.cpu.latencySensitivity”. On problematic VMs this option was set to “low”. It was the exactly this issue in our environment as well – all problematic VMs had this setting set to “low”. We shut downed the VMs and changed “sched.cpu.latencySensitivity” setting value to “normal” and the issue was fixed. Now the latency is constantly below 0.5ms.

To check value for individual VM you can use Web Client or following command:
Get-VM -Name <VMNAME> | Get-AdvancedSetting -Name sched.cpu.latencySensitivity

If the response is empty then the setting does not exist and I guess “normal” value is used.

To check this setting on all VMs I used this script (developed from script I got from this page):
Get-VM | Select Name, @{N=”CPU Scheduler priority”;E={
($_.ExtensionData.Config.ExtraConfig | `
where {$_.Key -eq “sched.cpu.latencySensitivity”}).Value}}

To fix the setting through PowerCli I used this script (developed from script I got from this page):
Get-VM -Name “VMNAME” | Get-View | foreach {
$ConfigSpec = New-Object VMware.Vim.VirtualMachineConfigSpec
$OValue = New-Object VMware.Vim.optionvalue
$OValue.Key = “sched.cpu.latencySensitivity”
$OValue.Value = “normal”
$ConfigSpec.extraconfig += $OValue
$task = $_.ReconfigVM_Task($ConfigSpec)
Write “$($_.Name) – changed”

We found several other VMs where this setting was set to “low”. Currently we don’t have any idea why some VMs had this setting set to low. There is a VMWare Community thread where at least two other persons claiming that they have faced similar issues with this setting.

Corrupted server profile in HP blade server after firmware upgrade

Recently we ere applying a SPP 2016.04 for some our blade servers. After upgrade one the server did not have network. From ESXi console everything looked OK. Tried cold boot – nothing. Tried downgrade of Emulex CNA firmware – nothing. Tried latest Emulex firmware again – nothing. Finally turned off server, went to VCEM (Virtual Connect Enterprise Manager) and edited the faulty profile by just clicking edit and then saved the profile again. Powered up the server and now everything was OK. I guess firmware update somehow damaged the profile and by re-applying the profile using VCEM it got fixed.

Change tracking target file already exists

After upgrading to VMWare ESXi 5.5 U3 we started seeing random snapshot errors during backups with following message – “An error occurred while saving the snapshot: Change tracking target file already exists.”. Issue is caused by leftover cbt file that is not deleted when snapshot is removed by backup software.

After submitting several logs and traces to VMWare they acknowledged that issue exists and it will be fixed for ESXi 5.5 in June patch release and for ESXi 6.0 in July patch release.

Right now when we detect a problematic VM we browse the datastore and delete the leftover cbt file.

ScaleIO 2.0 now available

EMC has released ScaleIO 2.0 couple of days ago. More information –

Some new features (source ScaleIO 2.0 release notes):

  • Extended MDM cluster – introduces the option of a 5-node MDM cluster, which is able to withstand two points of failure.
  • Read Flash Cache (RFcache) – use PCI flash cards and/or SSDs for caching of the HDDs in the SDS.
  • User authentication using Active Directory (AD) over LDAP.
  • The multiple SDS feature – allows the installation of multiple SDSs on a single Linux or VMware-based server.
  • Oscillating failure handling – provides the ability to handle error situations, and to reduce their impact on normal system operation. This feature detects and reports various oscillating failures, in cases when components fail repeatedly and cause unnecessary failovers.
  • Instant maintenance mode – allows you to restart a server that hosts an SDS, without initiating data migration or exposing the system to the danger of having only a single copy of data.
  • Communication between the ScaleIO system and ESRS (EMC Secure Remote Support) servers is now supported – this feature replaces the call-home mechanism. It allows authorized access to the ScaleIO system for support sessions.
  • Authenticate communication between the ScaleIO MDM and SDS components, and between the MDM and external components, using a Public and Private Key (Key-Pair) associated with a certificate – this will allow strong authentication of components associated with a given ScaleIO system. A Certificate Authority certificate or self-signed certificate can be used.
  • In-flight checksum protection provided for data reads and writes – this feature addresses errors that change the payload during the transit through the ScaleIO system.
  • Performance profiles – predefined settings that affect system performance.

ScaleIO can be downloaded from EMC website – ScaleIO 2.0 supports VMWare (5.5 and 6.0), Linux and Windows.

More info about ScaleIO 2.0 can be found from Chad Sakac blog:

Check out all of my posts about ScaleIO from here.

Storage vMotion fails on ESXi 5.5 for VMs with large memory reservation

I was doing some storage migrations and in one of the clusters Storage vMotions failed with a message: Failed to resume destination VM: msg.vmk.status.VMK_MEM_ADMIT_FAILED. 

Relocate virtual machine failed

After closer look I found that all those VMs that failed had a big memory reservations (192GB and 256GB). The ESXi 5.5 host only had 512GB of RAM. I will not go the topic why those reservations were set, but to successfully Storage vMotion those VMs I temporarily removed those reservations. After that Storage vMotion was successful. My guess is that VMWare check during Storage vMotion does the destination host (even though during Storage vMotion destination host was the same as source) has enough resources for this VM and if it does not it will fail the migration.

VM showing 0 bytes used from old datastore after Storage vMotion

Recently I did some VM migrations to new datastores and I ran into a problem. Two VMs were showing that they were still using the old datastore although Used Space column showed that 0 B used. This happens sometimes when VM has CDROM or floppy disks are configured to use some image. But this time it was not the case.


After some digging around I found that it was VM log file (vmware.log) that had not moved during Storage vMotion.

To fix the issue I did following:

  1. Shutdown VM
  2. Downloaded the VMX file
  3. Edited the VMX file parameter “vmx.log.filename”. Replaced the existing wrong full path with “./vmware.log”
  4. Uploaded the VMX back to VM folder
  5. Started the VM
  6. Executed another Storage vMotion operation

After the Storage vMotion the old datastore was now released and I was able to decommission it.

SSD caching could decrease performance – part 2

In the second part of the “SSD caching could decrease performance” I will cover IO read and write ratio and IO size affects to SSD. Part 1 is accessible here.

Read IO and write IO ratio

Most real workloads are mixed IO workloads – both disk reads and writes. Read and Write ratio is split between disk reads and disk writes. Many cases enterprise SSD disks have equally good read and write performance. But lately MLC and especially TLC drives have made their way into enterprise market and with some of them read and write performance is not equal. In addition SSD disks may become slower over time due to small amount of free blocks. To mitigate the free block issue SSD vendors are installing extra capacity inside the disks – example Intel DC S3700 SSD has about 32% extra capacity.

SSD disks usually handle reads better than writes. Before selecting your SSD disk vendor and model I recommend to do some research. If possible purchase some different models that would suite your needs and test them in real life scenarios.

IO size

When it comes to performance IO size matters. Large IO could potentially overload a small number of SSD disks and with it affect the performance. I would avoid caching workloads with IO size above 128KB. In my personal experience I have seen a situation with a database where SSD caching was hindering performance due to database multi-block reads.

My recommendations for a more successful SSD caching project

  • Determine VMs that would benefit from SSD caching – VMs doing certain amount or more IO.
  • Analyze the IO workloads – no point of doing read caching when server is only writing. IO size.
  • Check your hardware – controller speeds and capabilities. No point to connect fast SSD to a crappy controller.
  • Find a suitable SSD disks for caching. Price vs performance.
  • Talk with storage admins – might be that SSD in array would make more sense than SSD in server.