Skip to content

9.1 Design a Kubernetes Cluster

Abstract

Designing a Kubernetes cluster starts with understanding the purpose, environment, workload type, traffic pattern, and storage requirements.

In production, the goal is not just to create a working cluster, but to design one that is highly available, scalable, secure, observable, and maintainable.


Key Questions Before Designing

Ask these questions before choosing the cluster architecture.

Area Questions
Purpose Is the cluster for education, development, testing, or production?
Platform Will it run on cloud or on-premises?
Workloads How many applications will run? What type?
Resources Are workloads CPU, memory, or storage intensive?
Traffic Is traffic steady, heavy, or bursty?
Availability Does the cluster require high availability?
Operations Who will upgrade, monitor, and secure the cluster?

Question

Before creating a cluster, always ask: What problem is this cluster solving?


Purpose-Based Cluster Design

Use this for learning Kubernetes basics.

Recommended options:

  • Minikube
  • Kind
  • Single-node kubeadm cluster
  • Small single-node cloud VM

Tip

For learning, keep it simple. You do not need a production-grade HA cluster.

Use this for application testing, CI/CD validation, and team development.

Recommended setup:

  • multi-node cluster
  • single control plane node
  • multiple worker nodes
  • kubeadm, GKE, EKS, or AKS

Note

Development clusters should be close enough to production to catch deployment and networking issues early.

Use this for customer-facing or business-critical applications.

Recommended setup:

  • highly available multi-node cluster
  • multiple control plane nodes
  • multiple worker nodes
  • managed Kubernetes or well-automated self-managed setup
  • proper monitoring, backup, security, and upgrade strategy

Warning

Do not design production clusters like learning clusters. Production requires HA, security, monitoring, and disaster recovery.


Education Cluster

For education or practice:

Single Node
  ├── Control Plane
  └── Worker Components

Common tools:

minikube start
kind create cluster
kubeadm init

Best for:

  • Kubernetes basics
  • kubectl practice
  • YAML testing
  • local labs
  • CKA/CKAD practice

Failure

Do not use Minikube or single-node clusters for production workloads.


Development and Testing Cluster

A development cluster usually has one control plane node and multiple workers.

Control Plane Node
  ├── kube-apiserver
  ├── scheduler
  ├── controller-manager
  └── etcd

Worker Nodes
  ├── kubelet
  ├── kube-proxy
  └── application pods

Recommended minimum:

Component Recommendation
Control plane 1 node
Workers 2 or more
Networking CNI plugin installed
Storage StorageClass configured
Monitoring Basic metrics and logs
CI/CD Test deployments before production

Tip

Use development clusters to validate deployments, Helm charts, manifests, network policies, and RBAC before production.


Production Cluster Design

Production clusters should be highly available and resilient.

Recommended architecture:

Load Balancer
Multiple Control Plane Nodes
Multiple Worker Nodes
Application Workloads

Production goals:

  • high availability
  • fault tolerance
  • secure access
  • controlled upgrades
  • proper backup and restore
  • centralized logging
  • monitoring and alerting
  • capacity planning
  • autoscaling

Success

A good production cluster is designed for failure. Nodes, pods, zones, and disks can fail without taking the application down.


Production Scale Limits

Kubernetes can support large clusters when designed correctly.

Common scale reference:

Resource Scale
Nodes Up to 5,000
Pods Up to 150,000
Containers Up to 300,000
Pods per node Up to 100

Note

These are upper limits. Real production sizing depends on workload type, node size, networking, storage, API server load, and operational maturity.


Example Node Sizing

Cloud providers can automatically recommend node sizes, but the following table gives a rough idea.

Nodes GCP Example vCPU / Memory AWS Example vCPU / Memory
1–5 n1-standard-1 1 vCPU / 3.75 GB m3.medium 1 vCPU / 3.75 GB
6–10 n1-standard-2 2 vCPU / 7.5 GB m3.large 2 vCPU / 7.5 GB
11–100 n1-standard-4 4 vCPU / 15 GB m3.xlarge 4 vCPU / 15 GB
101–250 n1-standard-8 8 vCPU / 30 GB m3.2xlarge 8 vCPU / 30 GB
251–500 n1-standard-16 16 vCPU / 60 GB c4.4xlarge 16 vCPU / 30 GB
500+ n1-standard-32 32 vCPU / 120 GB c4.8xlarge 36 vCPU / 60 GB

Warning

Do not memorize instance sizes. Use them as starting points and validate using real workload metrics.


Cloud or On-Premises?

Choose the platform based on operational control, team skill, compliance, cost, and cloud strategy.

Platform Recommended Option
On-premises kubeadm
GCP GKE
AWS EKS
Azure AKS
AWS self-managed kOps or kubeadm
Hybrid Depends on networking and compliance requirements

Tip

Managed Kubernetes reduces control plane management overhead and is usually preferred unless there is a strong reason to self-manage.


Managed Kubernetes Options

Good for Google Cloud environments.

Benefits:

  • managed control plane
  • simplified upgrades
  • integrated load balancing
  • integrated storage
  • strong autoscaling support

Good for AWS environments.

Benefits:

  • AWS IAM integration
  • ELB integration
  • EBS/EFS CSI support
  • managed control plane
  • node group support

Good for Azure environments.

Benefits:

  • Azure AD integration
  • Azure Disk/File integration
  • managed control plane
  • simple cluster provisioning

Good for on-premises or custom environments.

Benefits:

  • full control
  • useful for learning internals
  • works on physical or virtual machines

Tradeoff:

  • you manage upgrades, certificates, etcd backups, HA setup, and monitoring

Workload Planning

Workload type affects node size, autoscaling, storage, and networking.

Workload Type Design Consideration
Web apps autoscaling, ingress, load balancing
APIs service discovery, HPA, rate limiting
Big data storage throughput, CPU, memory
Analytics memory and CPU sizing
Databases persistent storage, backups, anti-affinity
Batch jobs job scheduling, quotas, autoscaling
ML workloads GPU nodes, node selectors, taints

Question

Are workloads stateless, stateful, CPU-heavy, memory-heavy, or storage-heavy?


Traffic Planning

Traffic pattern affects autoscaling and ingress design.

Traffic Type Cluster Design Need
Light traffic small nodes, basic autoscaling
Heavy traffic larger node pools, HPA, load balancing
Burst traffic cluster autoscaler, HPA, buffer capacity
Internal traffic service mesh or NetworkPolicies
Public traffic Ingress/Gateway API, TLS, WAF

Tip

For burst traffic, configure both Horizontal Pod Autoscaler and Cluster Autoscaler.


Node Design

Kubernetes nodes can be physical or virtual machines.

Recommended production approach:

  • use 64-bit Linux nodes
  • keep node images consistent
  • separate control plane and worker nodes
  • use multiple worker nodes
  • use multiple availability zones where possible
  • use node pools for different workload types

Note

Master/control plane nodes can technically run workloads, but production clusters should usually keep workloads on worker nodes.


Control Plane vs Worker Nodes

Node Type Purpose
Control Plane Runs API server, scheduler, controller manager, etcd
Worker Node Runs application workloads
Dedicated etcd Nodes Optional for large HA clusters

Production recommendation:

Control Plane Nodes
  - no application workloads
  - protected and monitored
  - highly available

Worker Nodes
  - application workloads
  - autoscaling enabled
  - grouped by workload type

Warning

Avoid scheduling normal workloads on control plane nodes in production.


Minimum Node Count

For production, avoid single-node designs.

Recommended baseline:

Environment Suggested Nodes
Learning 1 node
Development 1 control plane + 2 workers
Small production 3 control plane + 3 workers
HA production 3 or 5 control plane + multiple workers
Large production dedicated etcd + multiple control plane + multiple worker pools

Success

A minimum of 3 control plane nodes is common for high availability.


Storage Design

Storage design depends on workload requirements.

Key storage considerations:

Requirement Recommendation
High performance SSD-backed storage
Shared access network-based storage
Stateful workloads PersistentVolumes and PVCs
Different performance tiers StorageClasses
Specific disk types node labels and node selectors
Backup needs snapshots and backup tooling

Tip

Use StorageClasses to provide different storage tiers such as silver, gold, and platinum.


Storage Best Practices

Recommended

  • Use CSI drivers for production storage
  • Use StorageClasses for dynamic provisioning
  • Use SSD-backed disks for high-performance workloads
  • Use backup and snapshot policies
  • Use node selectors for workloads needing specific disk types
  • Avoid hostPath for production application data

Danger

hostPath is not recommended for production storage because data is tied to a specific node.


Network Design

Production networking must support:

  • pod-to-pod communication
  • service networking
  • ingress or Gateway API
  • DNS resolution
  • NetworkPolicies
  • CNI plugin support
  • load balancer integration

Common CNI options:

  • Calico
  • Cilium
  • Flannel
  • Weave Net
  • cloud provider CNI plugins

Warning

Choose a CNI that supports NetworkPolicies if you need production network security.


High Availability Design

Production HA requires removing single points of failure.

Recommended:

  • multiple control plane nodes
  • external load balancer for API server
  • HA etcd
  • multiple worker nodes
  • multiple zones
  • replicated workloads
  • PodDisruptionBudgets
  • anti-affinity rules
  • backup and restore plan

Tip

Use odd numbers for etcd members, such as 3 or 5, to maintain quorum.


Cluster Provisioning Options

Good for:

  • learning
  • on-premises clusters
  • custom self-managed clusters

You manage:

  • upgrades
  • certificates
  • etcd backups
  • HA setup
  • monitoring

Good for:

  • production cloud workloads
  • faster setup
  • reduced operational burden

Cloud provider manages:

  • control plane
  • some upgrades
  • API availability
  • cloud integrations

Good for:

  • self-managed AWS clusters
  • infrastructure automation
  • repeatable cluster creation

Production Best Practices

Recommended

  • Use managed Kubernetes where possible
  • Use multiple control plane nodes for HA
  • Keep workloads off control plane nodes
  • Use separate node pools for different workloads
  • Enable autoscaling
  • Use production-ready CNI
  • Use NetworkPolicies
  • Configure centralized logging and monitoring
  • Back up etcd
  • Use RBAC and least privilege
  • Use namespaces and resource quotas
  • Use StorageClasses and CSI drivers
  • Test upgrades before production rollout

Do's

  • Define the cluster purpose first
  • Size nodes based on workload requirements
  • Use HA for production control plane
  • Use separate node pools for different workloads
  • Use managed Kubernetes when operational simplicity matters
  • Use SSD storage for high-performance workloads
  • Use network storage for shared persistent data
  • Use monitoring, logging, and alerting
  • Use namespaces, quotas, and RBAC
  • Plan backup and disaster recovery

Don'ts

  • Don't use single-node clusters for production
  • Don't run production workloads on control plane nodes
  • Don't use hostPath for production persistence
  • Don't ignore CNI and NetworkPolicy support
  • Don't use one node pool for every workload type
  • Don't skip etcd backups
  • Don't over-provision without monitoring
  • Don't expose the API server publicly without strict controls
  • Don't deploy without resource requests and limits

Common Design Mistakes

Mistake Impact
Single control plane in production control plane outage risk
No etcd backup cluster recovery risk
Wrong CNI choice networking/security limitations
No resource limits noisy neighbor issues
No node separation workload interference
No monitoring slow incident response
No autoscaling poor traffic handling
No storage planning stateful workload failures
No upgrade strategy version drift and security risk

Bug

Most production failures come from missing operational planning, not from Kubernetes YAML syntax.


Quick Design Checklist

Before building the cluster, confirm:

  • What is the cluster purpose?
  • Is it production or non-production?
  • Cloud or on-premises?
  • Managed or self-managed?
  • How many workloads?
  • What workload types?
  • What traffic pattern?
  • What storage requirements?
  • What CNI plugin?
  • Is NetworkPolicy required?
  • How many control plane nodes?
  • How many worker nodes?
  • Is etcd backed up?
  • Is monitoring configured?
  • Is RBAC configured?
  • Is the upgrade plan defined?

Summary

Quote

  • Cluster design starts with purpose and workload requirements
  • Learning clusters can be simple and single-node
  • Development clusters should be multi-node
  • Production clusters require HA, security, monitoring, and backups
  • Managed Kubernetes is preferred when possible
  • Storage, networking, and node sizing must match workload needs
  • Keep control plane nodes dedicated in production
  • Use production-ready CNI, CSI, RBAC, and observability