Deduplication and compression with ScaleIO

ScaleIO does not support deduplication and compression natively at the moment. Since ScaleIO can use almost any disk device I decided to test ScaleIO combined with QUADStor storage virtualization software which enables deduplication and compression.

For a test I built a small setup – three CentOS 7 servers with 200GB local disk running QUADStor software and ScaleIO SDS software and one ScaleIO MDM server and ScaleIO client based on Windows Server 2012 R2. On each CentOS server QUADStor was used to create 150GB disk with compression and deduplication enabled. The same 150GB disk was used by ScaleIO SDS as storage device.

To the client machine I presented one 200GB disk. To test the deduplication I copied some iso files to that disk. Below it is visible that my test data resulted almost 2x deduplication ratio. Deduplication ratio is affected by the way ScaleIO works – it distributes data to several nodes. Example: block “A” from “dataset1” will end up on servers “One” and “Two”. Block “A” from “dataset2” will end up on servers “One” and “Three”. On server “One” block “A” will deduplicated since it already had the block but on server “Three” the block “A” will not be deduplicated since it’s unique for this server.

QuadStor stats

I did not perform any performance test since my test systems were running on single host and on singe SSD drive.

Conclusion

For conclusion I can say that using 3rd party software it is possible to add features to ScaleIO – deduplication, tiering, etc. Mixing and matching different software can add complexity but sometimes the added value makes sense.

Related posts 

Enabling data deduplication in Linux with QUADStor

Speeding up writes for ScaleIO with DRAM

Automatic storage tiering with ScaleIO

Links

QUADStor homepage

How to estimate deduplication and compession ratios?

In my opinion the honest answer is that you can’t estimate. There is no good and accurate way to estimate deduplication and compression ratios because there are many variables that will affect the ratio. There are estimation tools available from different vendors but you will get most accurate numbers by testing different solutions with your actual data. I have tested several solutions and deduplication and/or compression ratio has varied between 2.5x to 8x.

Testing

  • Test different solutions – to understand how well different solutions work. Example solution #1 has dedupe ratio of 4 which looks good but solution #2 has dedupe ration 8 with same data.
  • Try to use the same test data during different tests so the results would be comparable.

Solutions

  • Microsoft Windows Server 2012 R2 – built in post-process deduplication engine. Check this page for more information.
  • QuadStor software – inline deduplication and compression. Check this page for more information.
  • Nutanix Community Edition – has both deduplication and compression options.
  • All Flash Arrays – most AFA-s include deduplication and/or compression for data reduction. If you are interested of AFA-s most vendors can hook you up with POC equipment which you can use to test the solution. AFA vendors to check EMC, Pure Storage, Kaminario, SolidFire, etc.

Results

Results will vary between different solutions. Deduplication works well for similar data (VDI, Server OS disks), compression works better for databases (Oracle, MSSQL). Deduplication ratio is also affected from deduplication chunk size – 512bytes, 4K, 8K, 16K, etc. Usually smaller chunk size results better ratio.

More info

More info about this topic can be found on following links.

Deduplication ratios – What should be included in the reported ratio?

Understanding Data Deduplication (SNIA)