12.02 Control Plane Failure
Control plane failure means Kubernetes management components are unhealthy or unreachable. Applications may continue running on worker nodes for some time, but scheduling, scaling, updates, and kubectl access can fail.
What to check first
Start with the basic health chain:
- Node status
- Workload Pod status
- Control plane Pods in
kube-system - Control plane services on the host
- Kubelet and kube-proxy services
- Component logs
Control Plane Components
| Component | Purpose | Failure impact |
|---|---|---|
kube-apiserver |
Entry point for Kubernetes API | kubectl and controllers cannot talk to cluster |
etcd |
Cluster state database | API server may fail or become inconsistent |
kube-controller-manager |
Runs controllers | ReplicaSet, Node, Job, and endpoint controllers may stop reconciling |
kube-scheduler |
Assigns Pods to Nodes | New Pods remain Pending |
kubelet |
Runs Pods on each node | Node cannot manage containers properly |
kube-proxy |
Service networking rules | Services may stop routing traffic correctly |
| CNI Pods | Pod networking | Pod-to-Pod communication may fail |
Note
In kubeadm clusters, control plane components usually run as static Pods in the kube-system namespace. In some non-kubeadm setups, they may run as Linux services.
Troubleshooting Flow
Control plane issue
│
├── Check nodes
│ └── kubectl get nodes
│
├── Check cluster Pods
│ └── kubectl get pods -A
│
├── Check control plane Pods
│ └── kubectl get pods -n kube-system
│
├── Check services on master/worker nodes
│ └── systemctl status <service>
│
├── Check logs
│ ├── kubectl logs -n kube-system <pod>
│ └── journalctl -u <service>
│
└── Fix failed component and re-check cluster health
Tip
If kubectl itself is not working, troubleshoot directly on the control plane node using systemctl, journalctl, static Pod manifests, container runtime logs, and certificates.
Step 1: Check Node Status
Run:
Example:
Healthy nodes should show Ready.
Node status symptoms
NotReady: kubelet, container runtime, CNI, or node network issueUnknown: API server cannot get updates from kubelet- missing node: node registration or API connectivity issue
Step 2: Check Application Pods
Before blaming the control plane, confirm workload Pods.
Example:
Note
Existing Pods can continue running even when the control plane has issues. The problem usually appears when you create, delete, scale, reschedule, or update workloads.
Step 3: Check Control Plane Pods
For kubeadm-based clusters, check kube-system.
Expected components include:
coredns
etcd-master
kube-apiserver-master
kube-controller-manager-master
kube-scheduler-master
kube-proxy
cni pods
Example:
NAME READY STATUS RESTARTS AGE
coredns-xxxxx 1/1 Running 0 1h
etcd-master 1/1 Running 0 1h
kube-apiserver-master 1/1 Running 0 1h
kube-controller-manager-master 1/1 Running 0 1h
kube-scheduler-master 1/1 Running 0 1h
kube-proxy-xxxxx 1/1 Running 0 1h
Healthy control plane Pods
Control plane Pods should be Running and READY. Restart counts should not be continuously increasing.
Common failure signs
CrashLoopBackOffErrorPending- high restart count
ImagePullBackOffCreateContainerConfigError
Step 4: Check Control Plane Services
If control plane components run as system services, check them on the control plane node.
systemctl status kube-apiserver
systemctl status kube-controller-manager
systemctl status kube-scheduler
Some older systems may use:
Healthy output should show:
Warning
If the API server is down, many kubectl commands will fail. Use node-level service checks instead.
Step 5: Check Kubelet and Kube Proxy
On all nodes, especially affected nodes:
Or:
Note
kubelet keeps static control plane Pods running in kubeadm clusters. If kubelet fails on the control plane node, API server, scheduler, controller manager, and etcd Pods may also fail.
Step 6: Check Control Plane Pod Logs
For kubeadm/static Pod control plane components:
kubectl logs kube-apiserver-master -n kube-system
kubectl logs kube-controller-manager-master -n kube-system
kubectl logs kube-scheduler-master -n kube-system
kubectl logs etcd-master -n kube-system
For previous container logs:
Tip
Use --previous when a control plane Pod is restarting. Current logs may not show the original crash reason.
Step 7: Check Host Service Logs
If components are services on the host, use journalctl.
sudo journalctl -u kube-apiserver
sudo journalctl -u kube-controller-manager
sudo journalctl -u kube-scheduler
sudo journalctl -u kubelet
sudo journalctl -u kube-proxy
Follow logs live:
Limit recent logs:
API server logs
API server logs can reveal certificate errors, etcd connection failures, invalid flags, admission plugin issues, and port binding problems.
Static Pod Manifest Checks
In kubeadm clusters, control plane static Pod manifests are usually stored here:
Check files:
ls -l /etc/kubernetes/manifests/
cat /etc/kubernetes/manifests/kube-apiserver.yaml
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
cat /etc/kubernetes/manifests/kube-scheduler.yaml
cat /etc/kubernetes/manifests/etcd.yaml
Be careful
Editing static Pod manifests triggers kubelet to recreate the control plane Pod. Always back up the manifest before making changes.
Common Root Causes
| Symptom | Likely cause | What to check |
|---|---|---|
kubectl cannot connect |
API server down or kubeconfig issue | API server status, kubeconfig, port 6443 |
| API server Pod restarting | etcd/cert/config issue | API server logs, manifest flags |
New Pods stuck Pending |
scheduler issue | scheduler Pod/service logs |
| Deployments not self-healing | controller-manager issue | controller manager logs |
| Services not routing | kube-proxy issue | kube-proxy status/logs, iptables/ipvs |
Nodes NotReady |
kubelet/CNI/runtime issue | kubelet logs, container runtime, CNI Pods |
| DNS not working | CoreDNS issue | CoreDNS Pods/logs, kube-dns Service |
| API server cannot start | etcd unreachable | etcd status/logs/certs |
Control plane Pod CrashLoopBackOff |
bad manifest or certificate | --previous logs and manifest file |
Important Ports
| Component | Port | Purpose |
|---|---|---|
| kube-apiserver | 6443 |
Kubernetes API |
| etcd client | 2379 |
API server to etcd |
| etcd peer | 2380 |
etcd member communication |
| kubelet | 10250 |
kubelet API |
| kube-scheduler | 10259 |
scheduler secure port |
| kube-controller-manager | 10257 |
controller manager secure port |
Warning
In production, restrict these ports using security groups, firewalls, or network policies. Do not expose control plane ports publicly except the API server through a controlled endpoint.
Commands
Production Best Practices
Do
- Run multiple control plane nodes for production.
- Put a load balancer in front of API servers.
- Use an odd number of etcd members, usually 3 or 5.
- Monitor API server, etcd, scheduler, controller manager, kubelet, and kube-proxy.
- Back up etcd regularly.
- Protect
/etc/kubernetescertificates and manifests. - Keep control plane nodes dedicated for control plane workloads.
- Track restart counts for
kube-systemPods. - Use alerting for API server latency, etcd health, and node readiness.
Don't
- Do not run production with a single control plane node.
- Do not edit static Pod manifests without backup.
- Do not expose etcd to the public network.
- Do not ignore expired certificates.
- Do not restart control plane components randomly without checking logs.
- Do not schedule application workloads on control plane nodes unless intentionally designed.
- Do not forget to check kubelet when static control plane Pods are down.
Control Plane Failure Checklist
[ ] Can kubectl connect to the API server?
[ ] Are all nodes Ready?
[ ] Are kube-system Pods Running?
[ ] Is kube-apiserver healthy?
[ ] Is etcd healthy?
[ ] Is scheduler running?
[ ] Is controller-manager running?
[ ] Is kubelet running on all nodes?
[ ] Is kube-proxy running on all nodes?
[ ] Are control plane Pods restarting?
[ ] Do logs show certificate, etcd, port, or config errors?
[ ] Are static Pod manifests valid?
[ ] Are required ports open?
Quick Investigation Example
# 1. Check nodes
kubectl get nodes
# 2. Check all system Pods
kubectl get pods -n kube-system
# 3. Check API server Pod
kubectl describe pod kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system --previous
# 4. Check etcd
kubectl logs etcd-master -n kube-system
# 5. Check kubelet on the node
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"
# 6. Check static Pod manifests
ls -l /etc/kubernetes/manifests/
Quote
Control plane troubleshooting is about checking the Kubernetes brain first: API server, etcd, scheduler, controller manager, kubelet, and their logs.