I have been thinking about vMSC-s for some time now and I hate the idea of leaving 50% cluster memory resources free for fail over in case of data center failure. Why memory and not the CPU? In my 9 years of working with virtualization almost always the bottleneck has been memory. So I was thinking of different ways how to get more out of metro clusters. So I came up with following deployment scenario – mix test, development and production workloads in same cluster and abandon test and development workloads in case of site failure. NB! I have not tested this scenario – it is purely theoretical.
- Two data centers – let’s call them Site A and Site B.
- Six hosts with 512GB RAM – 3 in each data center. Let’s call them – host-a1, host-a2. host-a3, host-b1, host-b2 and host-b3.
- Two datastores for production machines that are available in both data centers. Let’s call them DS-A1-PROD and DS-B1-PROD
- Number of local datastores for test and development only available to local ESXi hosts within local data center. Let’s call them DS-A1-TEST, DS-A2-TEST, DS-B1-TEST and DS-B2-TEST.
- Workloads will be divided into four groups – Site A production, Site A test and dev, Site B production and Site B test and dev.
- “Site A production” will be allowed to consume 12% of cluster resources. Workload will be running in Site A.
- “Site A test and dev” will be allowed to consume 22% of cluster resources. Workload will be running in Site A.
- “Site B production” will be allowed to consume 12% of cluster resources. Workload will be running in Site B.
- “Site B test and dev” will be allowed to consume 22% of cluster resources. Workload will be running in Site B.
- 68% of cluster memory resources will be consumed.
Site failure scenario
- Site A goes down.
- Workload “Site A production” will be restarted by VMware HA in Site B.
- Workload “Site A test and dev” will be be down until Site A is restored.
- 92% of the remaining cluster memory resources will be consumed
Site failure scenario for Site B is the same. Production workload will be restated in Site A and test/dev workload will be unavailable until site B is restored.
With this type of scenario it is possible to push the metro cluster memory usage from usual 50% to 68%. Thinking back to this the site “local” workload doesn’t have to be always test or development. It can also be production workload that does not require site fail over. Multi-instance workloads like web front-end servers, Active Directory servers, terminal servers, etc. Although I only used memory as a resource now you should not forget the CPU resources.
As my experience with metro clusters is mostly theoretical I would appreciate if someone with real experience would comment this.