vSphere Cluster Sizing Guidelines


When determining the size and scope of your clusters, there are many considerations such as licensing, hardware generations and the type of workloads.

First, lets talk about the different types of clusters that can be deployed:

Management cluster

Used for managing the infrastructure. VMs like vCenter, vROPs, vRA, SRM, Log Insight, etc. are in this cluster. In smaller environments, this would also include network functions, such as hosts that would normally be considered edge clusters. Minimum sizes are often 3-4 hosts. This size makes them very well suited for hyper-converged infrastructure, since that is also the minimum size required.

Payload or compute cluster

Used for workloads including production, dev, test and uat. These hosts often do not have any local disks and will either boot from SAN or via PXE with Auto-Deploy.

Island clusters

These are hosts that are segmented because of either difference of hardware (if EVC mode is used, then the CPU features are dumbed down to the lowest common denominator), or because of licensing. Oracle licensing is a perfect example of segmenting licensing to specific hosts.

Storage clusters

These are hosts that are part of a hyper converged storage cluster, such as VSAN.

Performance clusters

These are hosts that have been augmented with flash cards or SSDs and use server-side read and write acceleration. Pernixdata FVP is an example of a technology that provides this kind of acceleration.

Edge clusters

These are hosts that have VMs that are used with NSX to provide gateway services from the logical to physical networks. They are used in environments where the physical network equipment does not provide the gateway services.


Figure 1. A single rack with multiple types of clusters

The key to knowing when to use certain types of clusters comes down to business requirements, licensing and number of hosts. Sometimes, licensing limitations can be overcome by knowing the maximum CPU requirements of the hosts in the cluster. For instance, VSAN is licensed by CPU count and requires a minimum of 3 hosts. However, 4 or more hosts are really required if you want to rebuild components on another host after a failure. If your licensing is constrained by your budget, then you may be better served to have 4 x single CPU hosts, versus 3 x dual CPU hosts.

Another consideration is the vCPU to pCPU ratio for the specific workload that you are running on the hosts. The workload will dictate the number of cores required and thus the number of hosts in a cluster.

Many environments that are smaller simply run on a single cluster with all workloads running on it with HA and DRS taking care of resource contention and placement. However, as the environments grow it makes sense to start isolating the workloads to purpose build hardware. By doing so, you will get greater value, performance and manageability of the platform.

Here’s an example:

A single 8 host cluster with a single SAN. The SAN experiences a failure and all workloads and management access are lost. Now you can’t even troubleshoot because there is no management environment available to do so.

If the management environment were still running on a separate storage platform, then you would have access and means to troubleshoot and repair the issue. This limits the total downtime and effort in performing the repair and the financial costs to the organization.

Some organizations may employ a management SAN that is separate from the rest of the environment. This can be a lower cost SAN because performance is not as important as availability and risk isolation. As an additional measure of protection, key management VMs can also be replicated from one SAN to the other to ensure limited risk exposure.

Another cluster consideration is the concept of a fault domain. If a chassis or rack (dual PDU), were to fail then how many hosts would be affected. In VSAN fault domains can be configured directly within the UI. In general, this form of segmentation to limit risk is done by host affinity or anti-affinity rules, which will ensure that certain VMs are together or apart. VMs may be placed together for performance gains by intra-host communication, or they may be separated to ensure application level clustering availability.

Another method of ensuring application availability is by use of FT (fault tolerance). However, if the hosts participating in FT are in the same fault domain, then there is still a risk of it going down.

Here are some guidelines for host to cluster association based on number of hosts.

Cluster sizing design guidelines

1> Less than 8 hosts, use a single cluster unless requirements dictate otherwise. Ensure management VMs are on a separate disk pool or SAN other than compute VMs (if possible).

2> Segment management and compute clusters for 8-18 hosts

Use 2 hosts for management with a separate entry level SAN. Replicate the VMs to the compute SAN for an additional measure of safety. Bringing a replicated LUN online and registering the VMs on a host may be quicker for restoration purposes and minimizing downtime than restoring from backup. It may even be beneficial to separate the compute architecture. If the compute nodes are blades, then have the management cluster as rackmount.

Use 6 to 16 hosts for compute cluster. Boot from the SAN or use VMware AutoDeploy. The nodes should be stateless and easily replaced or expanded.

3> Implement HCI (Hyperconverged Infrastructure) for larger management clusters.

Once you have a larger management cluster footprint, an entry level SAN may not be able to meet the requirements of the hosted VMs. A Hyperconverged infrastructure is especially suited for this. Nutanix and VSAN are good options. 4 hosts are the sweet spot for this design, which will provide you with good performance, storage independence, reliability and availability.

4> Use single CPU nodes in clusters that have low processor requirements and high, per socket licensing costs.

5> If you stretch clusters across racks or data centers, ensure that the resources utilization is below the threshold of being able to compensate for a full loss of a fault domain. Ensure that HA admission control is also configured to allow for this.

6> When using NSX in a small environment, a single cluster can be used with two hosts that have NSX Edge services VMs.

7> When NSX is used in a medium sized environment the hosts with NSX Edge service VMs should be spread across multiple racks in the management cluster.

8> Large scale NSX deployments should have dedicated Edge clusters, multiple sites and multiple vCenters.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s