9.1 Design a Kubernetes Cluster

Abstract

Designing a Kubernetes cluster starts with understanding the purpose, environment, workload type, traffic pattern, and storage requirements.

In production, the goal is not just to create a working cluster, but to design one that is highly available, scalable, secure, observable, and maintainable.

Key Questions Before Designing

Ask these questions before choosing the cluster architecture.

Area	Questions
Purpose	Is the cluster for education, development, testing, or production?
Platform	Will it run on cloud or on-premises?
Workloads	How many applications will run? What type?
Resources	Are workloads CPU, memory, or storage intensive?
Traffic	Is traffic steady, heavy, or bursty?
Availability	Does the cluster require high availability?
Operations	Who will upgrade, monitor, and secure the cluster?

Question

Before creating a cluster, always ask: What problem is this cluster solving?

Purpose-Based Cluster Design

EducationDevelopment & TestingProduction

Use this for learning Kubernetes basics.

Recommended options:

Minikube
Kind
Single-node kubeadm cluster
Small single-node cloud VM

Tip

For learning, keep it simple. You do not need a production-grade HA cluster.

Use this for application testing, CI/CD validation, and team development.

Recommended setup:

multi-node cluster
single control plane node
multiple worker nodes
kubeadm, GKE, EKS, or AKS

Note

Development clusters should be close enough to production to catch deployment and networking issues early.

Use this for customer-facing or business-critical applications.

Recommended setup:

highly available multi-node cluster
multiple control plane nodes
multiple worker nodes
managed Kubernetes or well-automated self-managed setup
proper monitoring, backup, security, and upgrade strategy

Warning

Do not design production clusters like learning clusters. Production requires HA, security, monitoring, and disaster recovery.

Education Cluster

For education or practice:

Single Node
  ├── Control Plane
  └── Worker Components

Common tools:

minikube start

kind create cluster

kubeadm init

Best for:

Kubernetes basics
kubectl practice
YAML testing
local labs
CKA/CKAD practice

Failure

Do not use Minikube or single-node clusters for production workloads.

Development and Testing Cluster

A development cluster usually has one control plane node and multiple workers.

Control Plane Node
  ├── kube-apiserver
  ├── scheduler
  ├── controller-manager
  └── etcd

Worker Nodes
  ├── kubelet
  ├── kube-proxy
  └── application pods

Recommended minimum:

Component	Recommendation
Control plane	1 node
Workers	2 or more
Networking	CNI plugin installed
Storage	StorageClass configured
Monitoring	Basic metrics and logs
CI/CD	Test deployments before production

Tip

Use development clusters to validate deployments, Helm charts, manifests, network policies, and RBAC before production.

Production Cluster Design

Production clusters should be highly available and resilient.

Recommended architecture:

Load Balancer
  ↓
Multiple Control Plane Nodes
  ↓
Multiple Worker Nodes
  ↓
Application Workloads

Production goals:

high availability
fault tolerance
secure access
controlled upgrades
proper backup and restore
centralized logging
monitoring and alerting
capacity planning
autoscaling

Success

A good production cluster is designed for failure. Nodes, pods, zones, and disks can fail without taking the application down.

Production Scale Limits

Kubernetes can support large clusters when designed correctly.

Common scale reference:

Resource	Scale
Nodes	Up to 5,000
Pods	Up to 150,000
Containers	Up to 300,000
Pods per node	Up to 100

Note

These are upper limits. Real production sizing depends on workload type, node size, networking, storage, API server load, and operational maturity.

Example Node Sizing

Cloud providers can automatically recommend node sizes, but the following table gives a rough idea.

Nodes	GCP Example	vCPU / Memory	AWS Example	vCPU / Memory
1–5	`n1-standard-1`	1 vCPU / 3.75 GB	`m3.medium`	1 vCPU / 3.75 GB
6–10	`n1-standard-2`	2 vCPU / 7.5 GB	`m3.large`	2 vCPU / 7.5 GB
11–100	`n1-standard-4`	4 vCPU / 15 GB	`m3.xlarge`	4 vCPU / 15 GB
101–250	`n1-standard-8`	8 vCPU / 30 GB	`m3.2xlarge`	8 vCPU / 30 GB
251–500	`n1-standard-16`	16 vCPU / 60 GB	`c4.4xlarge`	16 vCPU / 30 GB
500+	`n1-standard-32`	32 vCPU / 120 GB	`c4.8xlarge`	36 vCPU / 60 GB

Warning

Do not memorize instance sizes. Use them as starting points and validate using real workload metrics.

Cloud or On-Premises?

Choose the platform based on operational control, team skill, compliance, cost, and cloud strategy.

Platform	Recommended Option
On-premises	kubeadm
GCP	GKE
AWS	EKS
Azure	AKS
AWS self-managed	kOps or kubeadm
Hybrid	Depends on networking and compliance requirements

Tip

Managed Kubernetes reduces control plane management overhead and is usually preferred unless there is a strong reason to self-manage.

Managed Kubernetes Options

GKEEKSAKSkubeadm

Good for Google Cloud environments.

Benefits:

managed control plane
simplified upgrades
integrated load balancing
integrated storage
strong autoscaling support

Good for AWS environments.

Benefits:

AWS IAM integration
ELB integration
EBS/EFS CSI support
managed control plane
node group support

Good for Azure environments.

Benefits:

Azure AD integration
Azure Disk/File integration
managed control plane
simple cluster provisioning

Good for on-premises or custom environments.

Benefits:

full control
useful for learning internals
works on physical or virtual machines

Tradeoff:

you manage upgrades, certificates, etcd backups, HA setup, and monitoring

Workload Planning

Workload type affects node size, autoscaling, storage, and networking.

Workload Type	Design Consideration
Web apps	autoscaling, ingress, load balancing
APIs	service discovery, HPA, rate limiting
Big data	storage throughput, CPU, memory
Analytics	memory and CPU sizing
Databases	persistent storage, backups, anti-affinity
Batch jobs	job scheduling, quotas, autoscaling
ML workloads	GPU nodes, node selectors, taints

Question

Are workloads stateless, stateful, CPU-heavy, memory-heavy, or storage-heavy?

Traffic Planning

Traffic pattern affects autoscaling and ingress design.

Traffic Type	Cluster Design Need
Light traffic	small nodes, basic autoscaling
Heavy traffic	larger node pools, HPA, load balancing
Burst traffic	cluster autoscaler, HPA, buffer capacity
Internal traffic	service mesh or NetworkPolicies
Public traffic	Ingress/Gateway API, TLS, WAF

Tip

For burst traffic, configure both Horizontal Pod Autoscaler and Cluster Autoscaler.

Node Design

Kubernetes nodes can be physical or virtual machines.

Recommended production approach:

use 64-bit Linux nodes
keep node images consistent
separate control plane and worker nodes
use multiple worker nodes
use multiple availability zones where possible
use node pools for different workload types

Note

Master/control plane nodes can technically run workloads, but production clusters should usually keep workloads on worker nodes.

Control Plane vs Worker Nodes

Node Type	Purpose
Control Plane	Runs API server, scheduler, controller manager, etcd
Worker Node	Runs application workloads
Dedicated etcd Nodes	Optional for large HA clusters

Production recommendation:

Control Plane Nodes
  - no application workloads
  - protected and monitored
  - highly available

Worker Nodes
  - application workloads
  - autoscaling enabled
  - grouped by workload type

Warning

Avoid scheduling normal workloads on control plane nodes in production.

Minimum Node Count

For production, avoid single-node designs.

Recommended baseline:

Environment	Suggested Nodes
Learning	1 node
Development	1 control plane + 2 workers
Small production	3 control plane + 3 workers
HA production	3 or 5 control plane + multiple workers
Large production	dedicated etcd + multiple control plane + multiple worker pools

Success

A minimum of 3 control plane nodes is common for high availability.

Storage Design

Storage design depends on workload requirements.

Key storage considerations:

Requirement	Recommendation
High performance	SSD-backed storage
Shared access	network-based storage
Stateful workloads	PersistentVolumes and PVCs
Different performance tiers	StorageClasses
Specific disk types	node labels and node selectors
Backup needs	snapshots and backup tooling

Tip

Use StorageClasses to provide different storage tiers such as silver, gold, and platinum.

Storage Best Practices

Recommended

Use CSI drivers for production storage
Use StorageClasses for dynamic provisioning
Use SSD-backed disks for high-performance workloads
Use backup and snapshot policies
Use node selectors for workloads needing specific disk types
Avoid hostPath for production application data

Danger

hostPath is not recommended for production storage because data is tied to a specific node.

Network Design

Production networking must support:

pod-to-pod communication
service networking
ingress or Gateway API
DNS resolution
NetworkPolicies
CNI plugin support
load balancer integration

Common CNI options:

Calico
Cilium
Flannel
Weave Net
cloud provider CNI plugins

Warning

Choose a CNI that supports NetworkPolicies if you need production network security.

High Availability Design

Production HA requires removing single points of failure.

Recommended:

multiple control plane nodes
external load balancer for API server
HA etcd
multiple worker nodes
multiple zones
replicated workloads
PodDisruptionBudgets
anti-affinity rules
backup and restore plan

Tip

Use odd numbers for etcd members, such as 3 or 5, to maintain quorum.

Cluster Provisioning Options

kubeadmManaged KuberneteskOps

Good for:

learning
on-premises clusters
custom self-managed clusters

You manage:

upgrades
certificates
etcd backups
HA setup
monitoring

Good for:

production cloud workloads
faster setup
reduced operational burden

Cloud provider manages:

control plane
some upgrades
API availability
cloud integrations

Good for:

self-managed AWS clusters
infrastructure automation
repeatable cluster creation

Production Best Practices

Recommended

Use managed Kubernetes where possible
Use multiple control plane nodes for HA
Keep workloads off control plane nodes
Use separate node pools for different workloads
Enable autoscaling
Use production-ready CNI
Use NetworkPolicies
Configure centralized logging and monitoring
Back up etcd
Use RBAC and least privilege
Use namespaces and resource quotas
Use StorageClasses and CSI drivers
Test upgrades before production rollout

Do's

Define the cluster purpose first
Size nodes based on workload requirements
Use HA for production control plane
Use separate node pools for different workloads
Use managed Kubernetes when operational simplicity matters
Use SSD storage for high-performance workloads
Use network storage for shared persistent data
Use monitoring, logging, and alerting
Use namespaces, quotas, and RBAC
Plan backup and disaster recovery

Don'ts

Don't use single-node clusters for production
Don't run production workloads on control plane nodes
Don't use hostPath for production persistence
Don't ignore CNI and NetworkPolicy support
Don't use one node pool for every workload type
Don't skip etcd backups
Don't over-provision without monitoring
Don't expose the API server publicly without strict controls
Don't deploy without resource requests and limits

Common Design Mistakes

Mistake	Impact
Single control plane in production	control plane outage risk
No etcd backup	cluster recovery risk
Wrong CNI choice	networking/security limitations
No resource limits	noisy neighbor issues
No node separation	workload interference
No monitoring	slow incident response
No autoscaling	poor traffic handling
No storage planning	stateful workload failures
No upgrade strategy	version drift and security risk

Bug

Most production failures come from missing operational planning, not from Kubernetes YAML syntax.

Quick Design Checklist

Before building the cluster, confirm:

What is the cluster purpose?
Is it production or non-production?
Cloud or on-premises?
Managed or self-managed?
How many workloads?
What workload types?
What traffic pattern?
What storage requirements?
What CNI plugin?
Is NetworkPolicy required?
How many control plane nodes?
How many worker nodes?
Is etcd backed up?
Is monitoring configured?
Is RBAC configured?
Is the upgrade plan defined?

Summary

Quote

Cluster design starts with purpose and workload requirements
Learning clusters can be simple and single-node
Development clusters should be multi-node
Production clusters require HA, security, monitoring, and backups
Managed Kubernetes is preferred when possible
Storage, networking, and node sizing must match workload needs
Keep control plane nodes dedicated in production
Use production-ready CNI, CSI, RBAC, and observability