Getting more out of vSphere Metro Storage Cluster (vMSC)

I have been thinking about vMSC-s for some time now and I hate the idea of leaving 50% cluster memory resources free for fail over in case of data center failure. Why memory and not the CPU? In my 9 years of working with virtualization almost always the bottleneck has been memory. So I was thinking of different ways how to get more out of metro clusters. So I came up with following deployment scenario – mix test, development and production workloads in same cluster and abandon test and development workloads in case of site failure. NB! I have not tested this scenario – it is purely theoretical.

Deployment scenario

  • Two data centers – let’s call them Site A and Site B.
  • Six hosts with 512GB RAM – 3 in each data center. Let’s call them – host-a1, host-a2. host-a3, host-b1, host-b2 and host-b3.
  • Two datastores for production machines that are available in both data centers. Let’s call them DS-A1-PROD and DS-B1-PROD
  • Number of local datastores for test and development only available to local ESXi hosts within local data center. Let’s call them DS-A1-TEST, DS-A2-TEST, DS-B1-TEST and DS-B2-TEST.
  • Workloads will be divided into four groups – Site A production, Site A test and dev, Site B production and Site B test and dev.
    • “Site A production” will be allowed to consume 12% of cluster resources. Workload will be running in Site A.
    • “Site A test and dev” will be allowed to consume 22% of cluster resources. Workload will be running in Site A.
    • “Site B production” will be allowed to consume 12% of cluster resources. Workload will be running in Site B.
    • “Site B test and dev” will be allowed to consume 22% of cluster resources. Workload will be running in Site B.
  • 68% of cluster memory resources will be consumed.
Stretched cluster memory resource usage

Stretched cluster memory resource usage

Site failure scenario

  • Site A goes down.
  • Workload “Site A production” will be restarted by VMware HA in Site B.
  • Workload “Site A test and dev” will be be down until Site A is restored.
  • 92% of the remaining cluster memory resources will be consumed
Stretched cluster memory load in case of site failure. Resources marked with red X are unavailable.

Stretched cluster memory load in case of site failure. Resources marked with red X are unavailable.

Site failure scenario for Site B is the same. Production workload will be restated in Site A and test/dev workload will be unavailable until site B is restored.

Conclusion 

With this type of scenario it is possible to push the metro cluster memory usage from usual 50% to 68%.  Thinking back to this the site “local” workload doesn’t have to be always test or development. It can also be production workload that does not require site fail over. Multi-instance workloads like web front-end servers, Active Directory servers, terminal servers, etc. Although I only used memory as a resource now you should not forget the CPU resources.

As my experience with metro clusters is mostly theoretical I would appreciate if someone with real experience would comment this.

Advertisements