Skip to content

12.02 Control Plane Failure

Control plane failure means Kubernetes management components are unhealthy or unreachable. Applications may continue running on worker nodes for some time, but scheduling, scaling, updates, and kubectl access can fail.

What to check first

Start with the basic health chain:

  1. Node status
  2. Workload Pod status
  3. Control plane Pods in kube-system
  4. Control plane services on the host
  5. Kubelet and kube-proxy services
  6. Component logs

Control Plane Components

Component Purpose Failure impact
kube-apiserver Entry point for Kubernetes API kubectl and controllers cannot talk to cluster
etcd Cluster state database API server may fail or become inconsistent
kube-controller-manager Runs controllers ReplicaSet, Node, Job, and endpoint controllers may stop reconciling
kube-scheduler Assigns Pods to Nodes New Pods remain Pending
kubelet Runs Pods on each node Node cannot manage containers properly
kube-proxy Service networking rules Services may stop routing traffic correctly
CNI Pods Pod networking Pod-to-Pod communication may fail

Note

In kubeadm clusters, control plane components usually run as static Pods in the kube-system namespace. In some non-kubeadm setups, they may run as Linux services.


Troubleshooting Flow

Control plane issue
├── Check nodes
│   └── kubectl get nodes
├── Check cluster Pods
│   └── kubectl get pods -A
├── Check control plane Pods
│   └── kubectl get pods -n kube-system
├── Check services on master/worker nodes
│   └── systemctl status <service>
├── Check logs
│   ├── kubectl logs -n kube-system <pod>
│   └── journalctl -u <service>
└── Fix failed component and re-check cluster health

Tip

If kubectl itself is not working, troubleshoot directly on the control plane node using systemctl, journalctl, static Pod manifests, container runtime logs, and certificates.


Step 1: Check Node Status

Run:

kubectl get nodes

Example:

NAME       STATUS   ROLES    AGE   VERSION
worker-1   Ready    <none>   8d    v1.xx.0
worker-2   Ready    <none>   8d    v1.xx.0

Healthy nodes should show Ready.

Node status symptoms

  • NotReady: kubelet, container runtime, CNI, or node network issue
  • Unknown: API server cannot get updates from kubelet
  • missing node: node registration or API connectivity issue

Step 2: Check Application Pods

Before blaming the control plane, confirm workload Pods.

kubectl get pods

Example:

NAME           READY   STATUS    RESTARTS   AGE
mysql          1/1     Running   0          113m
webapp-mysql   1/1     Running   0          113m

Note

Existing Pods can continue running even when the control plane has issues. The problem usually appears when you create, delete, scale, reschedule, or update workloads.


Step 3: Check Control Plane Pods

For kubeadm-based clusters, check kube-system.

kubectl get pods -n kube-system

Expected components include:

coredns
etcd-master
kube-apiserver-master
kube-controller-manager-master
kube-scheduler-master
kube-proxy
cni pods

Example:

NAME                            READY   STATUS    RESTARTS   AGE
coredns-xxxxx                   1/1     Running   0          1h
etcd-master                     1/1     Running   0          1h
kube-apiserver-master           1/1     Running   0          1h
kube-controller-manager-master  1/1     Running   0          1h
kube-scheduler-master           1/1     Running   0          1h
kube-proxy-xxxxx                1/1     Running   0          1h

Healthy control plane Pods

Control plane Pods should be Running and READY. Restart counts should not be continuously increasing.

Common failure signs

  • CrashLoopBackOff
  • Error
  • Pending
  • high restart count
  • ImagePullBackOff
  • CreateContainerConfigError

Step 4: Check Control Plane Services

If control plane components run as system services, check them on the control plane node.

systemctl status kube-apiserver
systemctl status kube-controller-manager
systemctl status kube-scheduler

Some older systems may use:

service kube-apiserver status
service kube-controller-manager status
service kube-scheduler status

Healthy output should show:

Active: active (running)

Warning

If the API server is down, many kubectl commands will fail. Use node-level service checks instead.


Step 5: Check Kubelet and Kube Proxy

On all nodes, especially affected nodes:

systemctl status kubelet
systemctl status kube-proxy

Or:

service kubelet status
service kube-proxy status

Note

kubelet keeps static control plane Pods running in kubeadm clusters. If kubelet fails on the control plane node, API server, scheduler, controller manager, and etcd Pods may also fail.


Step 6: Check Control Plane Pod Logs

For kubeadm/static Pod control plane components:

kubectl logs kube-apiserver-master -n kube-system
kubectl logs kube-controller-manager-master -n kube-system
kubectl logs kube-scheduler-master -n kube-system
kubectl logs etcd-master -n kube-system

For previous container logs:

kubectl logs kube-apiserver-master -n kube-system --previous

Tip

Use --previous when a control plane Pod is restarting. Current logs may not show the original crash reason.


Step 7: Check Host Service Logs

If components are services on the host, use journalctl.

sudo journalctl -u kube-apiserver
sudo journalctl -u kube-controller-manager
sudo journalctl -u kube-scheduler
sudo journalctl -u kubelet
sudo journalctl -u kube-proxy

Follow logs live:

sudo journalctl -u kubelet -f

Limit recent logs:

sudo journalctl -u kube-apiserver --since "30 minutes ago"

API server logs

API server logs can reveal certificate errors, etcd connection failures, invalid flags, admission plugin issues, and port binding problems.


Static Pod Manifest Checks

In kubeadm clusters, control plane static Pod manifests are usually stored here:

/etc/kubernetes/manifests/

Check files:

ls -l /etc/kubernetes/manifests/
cat /etc/kubernetes/manifests/kube-apiserver.yaml
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
cat /etc/kubernetes/manifests/kube-scheduler.yaml
cat /etc/kubernetes/manifests/etcd.yaml

Be careful

Editing static Pod manifests triggers kubelet to recreate the control plane Pod. Always back up the manifest before making changes.

sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak

Common Root Causes

Symptom Likely cause What to check
kubectl cannot connect API server down or kubeconfig issue API server status, kubeconfig, port 6443
API server Pod restarting etcd/cert/config issue API server logs, manifest flags
New Pods stuck Pending scheduler issue scheduler Pod/service logs
Deployments not self-healing controller-manager issue controller manager logs
Services not routing kube-proxy issue kube-proxy status/logs, iptables/ipvs
Nodes NotReady kubelet/CNI/runtime issue kubelet logs, container runtime, CNI Pods
DNS not working CoreDNS issue CoreDNS Pods/logs, kube-dns Service
API server cannot start etcd unreachable etcd status/logs/certs
Control plane Pod CrashLoopBackOff bad manifest or certificate --previous logs and manifest file

Important Ports

Component Port Purpose
kube-apiserver 6443 Kubernetes API
etcd client 2379 API server to etcd
etcd peer 2380 etcd member communication
kubelet 10250 kubelet API
kube-scheduler 10259 scheduler secure port
kube-controller-manager 10257 controller manager secure port

Warning

In production, restrict these ports using security groups, firewalls, or network policies. Do not expose control plane ports publicly except the API server through a controlled endpoint.


Commands

kubectl get nodes
kubectl get pods -A
kubectl get pods -n kube-system
kubectl cluster-info
kubectl get pods -n kube-system
kubectl describe pod kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system --previous
systemctl status kubelet
systemctl status kube-apiserver
systemctl status kube-controller-manager
systemctl status kube-scheduler
systemctl status kube-proxy
sudo journalctl -u kubelet -f
sudo journalctl -u kube-apiserver --since "30 minutes ago"
sudo journalctl -u kube-controller-manager --since "30 minutes ago"
sudo journalctl -u kube-scheduler --since "30 minutes ago"
ls -l /etc/kubernetes/manifests/
sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml
sudo vi /etc/kubernetes/manifests/etcd.yaml

Production Best Practices

Do

  • Run multiple control plane nodes for production.
  • Put a load balancer in front of API servers.
  • Use an odd number of etcd members, usually 3 or 5.
  • Monitor API server, etcd, scheduler, controller manager, kubelet, and kube-proxy.
  • Back up etcd regularly.
  • Protect /etc/kubernetes certificates and manifests.
  • Keep control plane nodes dedicated for control plane workloads.
  • Track restart counts for kube-system Pods.
  • Use alerting for API server latency, etcd health, and node readiness.

Don't

  • Do not run production with a single control plane node.
  • Do not edit static Pod manifests without backup.
  • Do not expose etcd to the public network.
  • Do not ignore expired certificates.
  • Do not restart control plane components randomly without checking logs.
  • Do not schedule application workloads on control plane nodes unless intentionally designed.
  • Do not forget to check kubelet when static control plane Pods are down.

Control Plane Failure Checklist

[ ] Can kubectl connect to the API server?
[ ] Are all nodes Ready?
[ ] Are kube-system Pods Running?
[ ] Is kube-apiserver healthy?
[ ] Is etcd healthy?
[ ] Is scheduler running?
[ ] Is controller-manager running?
[ ] Is kubelet running on all nodes?
[ ] Is kube-proxy running on all nodes?
[ ] Are control plane Pods restarting?
[ ] Do logs show certificate, etcd, port, or config errors?
[ ] Are static Pod manifests valid?
[ ] Are required ports open?

Quick Investigation Example

# 1. Check nodes
kubectl get nodes

# 2. Check all system Pods
kubectl get pods -n kube-system

# 3. Check API server Pod
kubectl describe pod kube-apiserver-master -n kube-system
kubectl logs kube-apiserver-master -n kube-system --previous

# 4. Check etcd
kubectl logs etcd-master -n kube-system

# 5. Check kubelet on the node
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"

# 6. Check static Pod manifests
ls -l /etc/kubernetes/manifests/

Quote

Control plane troubleshooting is about checking the Kubernetes brain first: API server, etcd, scheduler, controller manager, kubelet, and their logs.