12.02 Control Plane Failure

Control plane failure means Kubernetes management components are unhealthy or unreachable. Applications may continue running on worker nodes for some time, but scheduling, scaling, updates, and kubectl access can fail.

What to check first

Start with the basic health chain:

Node status
Workload Pod status
Control plane Pods in kube-system
Control plane services on the host
Kubelet and kube-proxy services
Component logs

Control Plane Components

Component	Purpose	Failure impact
`kube-apiserver`	Entry point for Kubernetes API	`kubectl` and controllers cannot talk to cluster
`etcd`	Cluster state database	API server may fail or become inconsistent
`kube-controller-manager`	Runs controllers	ReplicaSet, Node, Job, and endpoint controllers may stop reconciling
`kube-scheduler`	Assigns Pods to Nodes	New Pods remain `Pending`
`kubelet`	Runs Pods on each node	Node cannot manage containers properly
`kube-proxy`	Service networking rules	Services may stop routing traffic correctly
CNI Pods	Pod networking	Pod-to-Pod communication may fail

Note

In kubeadm clusters, control plane components usually run as static Pods in the kube-system namespace. In some non-kubeadm setups, they may run as Linux services.

Troubleshooting Flow

Control plane issue
│
├── Check nodes
│   └── kubectl get nodes
│
├── Check cluster Pods
│   └── kubectl get pods -A
│
├── Check control plane Pods
│   └── kubectl get pods -n kube-system
│
├── Check services on master/worker nodes
│   └── systemctl status <service>
│
├── Check logs
│   ├── kubectl logs -n kube-system <pod>
│   └── journalctl -u <service>
│
└── Fix failed component and re-check cluster health

Tip

If kubectl itself is not working, troubleshoot directly on the control plane node using systemctl, journalctl, static Pod manifests, container runtime logs, and certificates.

Step 1: Check Node Status

Run:

kubectl get nodes

Example:

NAME       STATUS   ROLES    AGE   VERSION
worker-1   Ready    <none>   8d    v1.xx.0
worker-2   Ready    <none>   8d    v1.xx.0

Healthy nodes should show Ready.

Node status symptoms

NotReady: kubelet, container runtime, CNI, or node network issue
Unknown: API server cannot get updates from kubelet
missing node: node registration or API connectivity issue

Step 2: Check Application Pods

Before blaming the control plane, confirm workload Pods.

kubectl get pods

Example:

NAME           READY   STATUS    RESTARTS   AGE
mysql          1/1     Running   0          113m
webapp-mysql   1/1     Running   0          113m

Note

Existing Pods can continue running even when the control plane has issues. The problem usually appears when you create, delete, scale, reschedule, or update workloads.

Step 3: Check Control Plane Pods

For kubeadm-based clusters, check kube-system.

kubectl get pods -n kube-system

Expected components include:

coredns
etcd-master
kube-apiserver-master
kube-controller-manager-master
kube-scheduler-master
kube-proxy
cni pods

Example:

NAME                            READY   STATUS    RESTARTS   AGE
coredns-xxxxx                   1/1     Running   0          1h
etcd-master                     1/1     Running   0          1h
kube-apiserver-master           1/1     Running   0          1h
kube-controller-manager-master  1/1     Running   0          1h
kube-scheduler-master           1/1     Running   0          1h
kube-proxy-xxxxx                1/1     Running   0          1h

Healthy control plane Pods

Control plane Pods should be Running and READY. Restart counts should not be continuously increasing.

Common failure signs

CrashLoopBackOff
Error
Pending
high restart count
ImagePullBackOff
CreateContainerConfigError

Step 4: Check Control Plane Services

If control plane components run as system services, check them on the control plane node.

systemctl status kube-apiserver
systemctl status kube-controller-manager
systemctl status kube-scheduler

Some older systems may use:

service kube-apiserver status
service kube-controller-manager status
service kube-scheduler status

Healthy output should show:

Active: active (running)

Warning

If the API server is down, many kubectl commands will fail. Use node-level service checks instead.

Step 5: Check Kubelet and Kube Proxy

On all nodes, especially affected nodes:

systemctl status kubelet
systemctl status kube-proxy

Or:

service kubelet status
service kube-proxy status

Note

kubelet keeps static control plane Pods running in kubeadm clusters. If kubelet fails on the control plane node, API server, scheduler, controller manager, and etcd Pods may also fail.

Step 6: Check Control Plane Pod Logs

For kubeadm/static Pod control plane components:

kubectl logs kube-apiserver-master -n kube-system
kubectl logs kube-controller-manager-master -n kube-system
kubectl logs kube-scheduler-master -n kube-system
kubectl logs etcd-master -n kube-system

For previous container logs:

kubectl logs kube-apiserver-master -n kube-system --previous

Tip

Use --previous when a control plane Pod is restarting. Current logs may not show the original crash reason.

Step 7: Check Host Service Logs

If components are services on the host, use journalctl.

sudo journalctl -u kube-apiserver
sudo journalctl -u kube-controller-manager
sudo journalctl -u kube-scheduler
sudo journalctl -u kubelet
sudo journalctl -u kube-proxy

Follow logs live:

sudo journalctl -u kubelet -f

Limit recent logs:

sudo journalctl -u kube-apiserver --since "30 minutes ago"

API server logs

API server logs can reveal certificate errors, etcd connection failures, invalid flags, admission plugin issues, and port binding problems.

Static Pod Manifest Checks

In kubeadm clusters, control plane static Pod manifests are usually stored here:

/etc/kubernetes/manifests/

Check files:

ls -l /etc/kubernetes/manifests/
cat /etc/kubernetes/manifests/kube-apiserver.yaml
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
cat /etc/kubernetes/manifests/kube-scheduler.yaml
cat /etc/kubernetes/manifests/etcd.yaml

Be careful

Editing static Pod manifests triggers kubelet to recreate the control plane Pod. Always back up the manifest before making changes.

sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak

Common Root Causes

Symptom	Likely cause	What to check
`kubectl` cannot connect	API server down or kubeconfig issue	API server status, kubeconfig, port `6443`
API server Pod restarting	etcd/cert/config issue	API server logs, manifest flags
New Pods stuck `Pending`	scheduler issue	scheduler Pod/service logs
Deployments not self-healing	controller-manager issue	controller manager logs
Services not routing	kube-proxy issue	kube-proxy status/logs, iptables/ipvs
Nodes `NotReady`	kubelet/CNI/runtime issue	kubelet logs, container runtime, CNI Pods
DNS not working	CoreDNS issue	CoreDNS Pods/logs, kube-dns Service
API server cannot start	etcd unreachable	etcd status/logs/certs
Control plane Pod `CrashLoopBackOff`	bad manifest or certificate	`--previous` logs and manifest file

Important Ports

Component	Port	Purpose
kube-apiserver	`6443`	Kubernetes API
etcd client	`2379`	API server to etcd
etcd peer	`2380`	etcd member communication
kubelet	`10250`	kubelet API
kube-scheduler	`10259`	scheduler secure port
kube-controller-manager	`10257`	controller manager secure port

Warning

In production, restrict these ports using security groups, firewalls, or network policies. Do not expose control plane ports publicly except the API server through a controlled endpoint.

Commands

Cluster overviewControl plane PodsHost servicesJournal logsStatic manifests

kubectl get nodes
kubectl get pods -A
kubectl get pods -n kube-system
kubectl cluster-info

kubectl get pods -n kube-system
kubectl describe pod kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system --previous

systemctl status kubelet
systemctl status kube-apiserver
systemctl status kube-controller-manager
systemctl status kube-scheduler
systemctl status kube-proxy

sudo journalctl -u kubelet -f
sudo journalctl -u kube-apiserver --since "30 minutes ago"
sudo journalctl -u kube-controller-manager --since "30 minutes ago"
sudo journalctl -u kube-scheduler --since "30 minutes ago"

ls -l /etc/kubernetes/manifests/
sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml
sudo vi /etc/kubernetes/manifests/etcd.yaml

Production Best Practices

Do

Run multiple control plane nodes for production.
Put a load balancer in front of API servers.
Use an odd number of etcd members, usually 3 or 5.
Monitor API server, etcd, scheduler, controller manager, kubelet, and kube-proxy.
Back up etcd regularly.
Protect /etc/kubernetes certificates and manifests.
Keep control plane nodes dedicated for control plane workloads.
Track restart counts for kube-system Pods.
Use alerting for API server latency, etcd health, and node readiness.

Don't

Do not run production with a single control plane node.
Do not edit static Pod manifests without backup.
Do not expose etcd to the public network.
Do not ignore expired certificates.
Do not restart control plane components randomly without checking logs.
Do not schedule application workloads on control plane nodes unless intentionally designed.
Do not forget to check kubelet when static control plane Pods are down.

Control Plane Failure Checklist

[ ] Can kubectl connect to the API server?
[ ] Are all nodes Ready?
[ ] Are kube-system Pods Running?
[ ] Is kube-apiserver healthy?
[ ] Is etcd healthy?
[ ] Is scheduler running?
[ ] Is controller-manager running?
[ ] Is kubelet running on all nodes?
[ ] Is kube-proxy running on all nodes?
[ ] Are control plane Pods restarting?
[ ] Do logs show certificate, etcd, port, or config errors?
[ ] Are static Pod manifests valid?
[ ] Are required ports open?

Quick Investigation Example

# 1. Check nodes
kubectl get nodes

# 2. Check all system Pods
kubectl get pods -n kube-system

# 3. Check API server Pod
kubectl describe pod kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system --previous

# 4. Check etcd
kubectl logs etcd-master -n kube-system

# 5. Check kubelet on the node
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"

# 6. Check static Pod manifests
ls -l /etc/kubernetes/manifests/

Quote

Control plane troubleshooting is about checking the Kubernetes brain first: API server, etcd, scheduler, controller manager, kubelet, and their logs.