Snapshot fails for VM with running Docker container

Recently I noticed some Linux VM backups were failing and sometimes even crashing with following errors:
An error occurred while taking a snapshot: msg.snapshot.error-QUIESCINGERROR.
An error occurred while saving the snapshot: msg.snapshot.error-QUIESCINGERROR.

On closer look another error was visible in hostd.log file – Error when enabling the sync provider.

All of these VMs had one thing in common – they were running Docker containers.
I was not able to figure out why it happened but I was able to find a workaround – disable the VMWare Sync driver.

Copy-paste from Veritas KB article – https://www.veritas.com/support/en_US/article.000021419

Steps to Disable VMware vmsync driver
To prevent the vmsync driver from being called during the quiesce phase of a VMware snapshot, edit the VMware Tools configuration file as follows:

1) Open a console session to the Redhat Linux virtual machine.
2) Navigate to the /etc/vmware-tools directory
3) Using a text editor, modify the tools.conf file with the following entry

[vmbackup]
enableSyncDriver = false

Note: If the tools.conf file does not exist, create a new empty file and add the above parameters.

 

Change tracking target file already exists

After upgrading to VMWare ESXi 5.5 U3 we started seeing random snapshot errors during backups with following message – “An error occurred while saving the snapshot: Change tracking target file already exists.”. Issue is caused by leftover cbt file that is not deleted when snapshot is removed by backup software.

After submitting several logs and traces to VMWare they acknowledged that issue exists and it will be fixed for ESXi 5.5 in June patch release and for ESXi 6.0 in July patch release.

Right now when we detect a problematic VM we browse the datastore and delete the leftover cbt file.

Finding backup jobs that dedupe badly in EMC Data Domain

Some time ago we suddenly saw a significant drop in free space on our EMC Data Domain appliances.  So we started to investigate what was causing it.

Our setup is Symantec Netbackup, Linux based Media server + DD Boost and Data Domain appliances.

We checked the usual things:

  1. Was some new backups added recently? … No
  2. Any unusual changes in size of the backups? … No

All the usual things did not pan out. Next thing we struggled was a questions how to see how much space each backup is taking in data domain and what data has the worst dedupe ratio. We could not find much from manuals and nothing much about it was available online.

Finally we found location in Data Domain System Manager where you can download a file “Compression Details” which contains each backup job, time, size of the backup, post dedupe and compression size and dedupe ratio.

How to find it? Go to Data Domain System Manager -> Data Management -> Storage Units > select Storage Unit -> Download Compression Details. After some number crunching you are getting a TSV file.

data_domain_dedupe_results

 

After saving the TSV file it was easy to open it in Excel and sort data into different columns. Columns that we were interested were post_lc and bytes/post_lc. Post_lc contains a byte value of data saved to Data Domain and bytes/post_lc is dedupe-compression ratio. Using values in post_lc we were able to identify backup which was causing us to run out of free space in Data Domain appliance.

Using the Compression Details file it is possible to see what backups dedupe nicely and which are not.

Just for clarification I’m not sure if this solution is valid for any other solutions other than Symantec Netbackup, Linux Media server + DD Boost and EMC Data Domain.