Skip to content

12.03 Worker Node Failure

Worker node failure means one or more worker nodes cannot properly run Pods or communicate with the control plane. The node may show NotReady, Unknown, or resource pressure conditions such as disk, memory, or PID pressure.

Goal

Troubleshoot worker node failures by checking:

  1. Node status
  2. Node conditions
  3. Node resource usage
  4. Kubelet status
  5. Kubelet logs
  6. Certificates
  7. Container runtime and network health

Worker Node Troubleshooting Flow

Worker node issue
├── Check node status
│   └── kubectl get nodes
├── Describe the affected node
│   └── kubectl describe node <node-name>
├── Check node resources
│   ├── top
│   └── df -h
├── Check kubelet
│   ├── systemctl status kubelet
│   └── journalctl -u kubelet
├── Check certificates
│   └── openssl x509 -in <cert-file> -text
├── Check container runtime
│   └── systemctl status containerd
└── Fix root cause and re-check node readiness

Tip

Always start from kubectl get nodes. It gives the fastest signal about whether the issue is cluster-wide or isolated to one worker node.


Step 1: Check Node Status

Run:

kubectl get nodes

Example:

NAME       STATUS                     ROLES    AGE   VERSION
worker-1   Ready                      <none>   8d    v1.13.0
worker-2   NotReady                   <none>   8d    v1.13.0
worker-3   Ready,SchedulingDisabled   <none>   8d    v1.13.0
Status Meaning
Ready Node is healthy and can run Pods
NotReady Node is not healthy or kubelet is not reporting correctly
Unknown Control plane stopped receiving node heartbeats
SchedulingDisabled Node is cordoned; no new Pods will be scheduled

Important

Ready,SchedulingDisabled does not always mean the node is broken. It usually means the node was cordoned using kubectl cordon.


Step 2: Describe the Node

Use kubectl describe node to inspect conditions, events, labels, capacity, and allocatable resources.

kubectl describe node worker-1

Focus on the Conditions section.

Conditions:
  Type             Status    Reason                    Message
  ----             ------    ------                    -------
  OutOfDisk        False     KubeletHasSufficientDisk  kubelet has sufficient disk space available
  MemoryPressure   False     KubeletHasSufficientMemory kubelet has sufficient memory available
  DiskPressure     False     KubeletHasNoDiskPressure kubelet has no disk pressure
  PIDPressure      False     KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            True      KubeletReady              kubelet is posting ready status

Healthy condition pattern

A healthy node usually has:

  • OutOfDisk=False
  • MemoryPressure=False
  • DiskPressure=False
  • PIDPressure=False
  • Ready=True

Node Conditions Explained

Condition Healthy value Problem value Meaning
Ready True False / Unknown Node is ready to run Pods
MemoryPressure False True / Unknown Node is low on memory
DiskPressure False True / Unknown Node has low disk space
PIDPressure False True / Unknown Too many processes are running
OutOfDisk False True / Unknown Node is out of disk space

Note

When kubelet stops posting node status, conditions may become Unknown. Check LastHeartbeatTime to estimate when the node stopped communicating.


Step 3: Check Unknown Node Status

If a node shows Unknown, describe it:

kubectl describe node worker-1

Example:

Type             Status    Reason              Message
MemoryPressure   Unknown   NodeStatusUnknown   Kubelet stopped posting node status.
DiskPressure     Unknown   NodeStatusUnknown   Kubelet stopped posting node status.
PIDPressure      Unknown   NodeStatusUnknown   Kubelet stopped posting node status.
Ready            Unknown   NodeStatusUnknown   Kubelet stopped posting node status.

This usually means:

  • kubelet is stopped
  • node is powered off
  • node network is broken
  • API server cannot reach the node
  • node cannot reach the API server
  • certificate/authentication issue

Likely root cause

If all conditions are Unknown, do not start with Pod debugging. First verify the node itself is alive and kubelet is running.


Step 4: Check Node Health Directly

SSH into the affected worker node and check CPU, memory, process, and disk usage.

top

Look for:

  • very high load average
  • memory exhaustion
  • high CPU processes
  • stuck or zombie processes
df -h

Look for full filesystems, especially:

  • /
  • /var
  • /var/lib/kubelet
  • /var/lib/containerd
  • /var/log
ps aux --sort=-%mem | head
ps aux --sort=-%cpu | head

Warning

Disk pressure is common in production when container logs, unused images, or /var/lib/containerd grow without cleanup.


Step 5: Check Kubelet Status

Kubelet is the main node agent. If kubelet is down, the node cannot report status or manage Pods correctly.

systemctl status kubelet

Older systems may use:

service kubelet status

Expected:

Active: active (running)

Restart kubelet if required:

sudo systemctl restart kubelet
sudo systemctl status kubelet

Danger

Do not blindly restart kubelet in production before checking logs. Restarting may hide the original error or disrupt workloads.


Step 6: Check Kubelet Logs

Use journalctl to inspect kubelet errors.

sudo journalctl -u kubelet

Follow logs live:

sudo journalctl -u kubelet -f

Check recent logs:

sudo journalctl -u kubelet --since "30 minutes ago"

Common log issues:

Log symptom Possible cause
certificate expired kubelet client certificate issue
connection refused API server or network issue
failed to create pod sandbox CNI or container runtime issue
image garbage collection failed disk/runtime issue
node not found node registration issue
failed to update node status kubelet cannot talk to API server

Tip

Kubelet logs are usually the most important source when a worker node is NotReady.


Step 7: Check Certificates

Kubelet uses certificates to authenticate with the API server. Expired or wrong certificates can make the node fail to report status.

Example command:

openssl x509 -in /var/lib/kubelet/worker-1.crt -text -noout

Check:

  • Not Before
  • Not After
  • Issuer
  • Subject
  • organization/group
  • whether it is signed by the correct Kubernetes CA

Example certificate details:

Issuer: CN = kubernetes
Subject: O = system:nodes, CN = system:node:worker-1
Not After : Apr 19 20:09:29 2019 GMT

Certificate checks

The kubelet certificate should usually identify the node as:

O = system:nodes
CN = system:node:<node-name>

If the node name does not match, kubelet authentication may fail.


Step 8: Check Container Runtime

Most modern Kubernetes clusters use containerd.

systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"

Restart if required:

sudo systemctl restart containerd
sudo systemctl restart kubelet

For Docker-based older clusters:

systemctl status docker
sudo journalctl -u docker --since "30 minutes ago"

Runtime failure symptoms

  • Pods stuck in ContainerCreating
  • kubelet logs show sandbox creation failures
  • images cannot be pulled
  • node has many stopped containers/images consuming disk

Step 9: Check CNI and Pod Networking

If the node is Ready but Pods cannot communicate, check CNI.

kubectl get pods -n kube-system
kubectl describe pod <cni-pod-name> -n kube-system
kubectl logs <cni-pod-name> -n kube-system

On the worker node:

ls -l /etc/cni/net.d/
ls -l /opt/cni/bin/
ip addr
ip route

Note

CNI issues often appear as failed to create pod sandbox in kubelet logs.


Common Worker Node Failure Causes

Symptom Likely cause Check
Node NotReady kubelet down systemctl status kubelet
Node Unknown node stopped reporting heartbeat time, node reachability
DiskPressure=True disk full df -h, container logs/images
MemoryPressure=True memory exhausted top, free -m
PIDPressure=True too many processes ps, process limits
Pods stuck ContainerCreating CNI/runtime issue kubelet logs, CNI Pods
Pods evicted resource pressure kubectl describe pod
Kubelet auth errors certificate issue kubelet cert and logs
Services not working on node kube-proxy issue kube-proxy logs/status
Image pull failures registry/network/secret issue describe pod, runtime logs

Commands

kubectl get nodes
kubectl describe node worker-1
kubectl get pods -A -o wide
kubectl get events --sort-by=.metadata.creationTimestamp
top
df -h
free -m
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"
sudo journalctl -u kubelet -f
sudo systemctl restart kubelet
systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"
crictl ps
crictl images
openssl x509 -in /var/lib/kubelet/worker-1.crt -text -noout
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout
ip addr
ip route
ping <api-server-ip>
curl -k https://<api-server-ip>:6443/healthz

Production Best Practices

Do

  • Monitor node readiness, CPU, memory, disk, PID pressure, and kubelet health.
  • Set resource requests and limits for applications.
  • Configure log rotation for containers.
  • Keep enough disk space under /var/lib/kubelet and /var/lib/containerd.
  • Use node alerts for NotReady, Unknown, and pressure conditions.
  • Use multiple worker nodes across availability zones.
  • Drain nodes before planned maintenance.
  • Keep kubelet certificates renewed and monitored.
  • Run regular OS and Kubernetes patching through a controlled maintenance process.

Don't

  • Do not ignore DiskPressure or high restart counts.
  • Do not manually delete random files under /var/lib/kubelet.
  • Do not restart production nodes without checking running workloads.
  • Do not run critical workloads without replicas.
  • Do not let container logs grow without rotation.
  • Do not use worker nodes with mismatched hostnames, duplicate MAC addresses, or wrong certificates.
  • Do not troubleshoot only from kubectl if the node itself is unhealthy.

Worker Node Failure Checklist

[ ] Is the node Ready, NotReady, or Unknown?
[ ] What does kubectl describe node show?
[ ] Are node conditions showing disk, memory, or PID pressure?
[ ] What is the LastHeartbeatTime?
[ ] Is the node reachable over SSH?
[ ] Is CPU or memory exhausted?
[ ] Is disk full?
[ ] Is kubelet active?
[ ] What do kubelet logs show?
[ ] Is container runtime active?
[ ] Are CNI Pods healthy?
[ ] Are kubelet certificates valid?
[ ] Can the node reach the API server on port 6443?
[ ] Are Pods failing only on this node or cluster-wide?

Quick Investigation Example

# 1. Check node status
kubectl get nodes

# 2. Describe affected node
kubectl describe node worker-2

# 3. SSH into node and check resources
top
df -h
free -m

# 4. Check kubelet
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"

# 5. Check runtime
systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"

# 6. Check certificate
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout

Quote

Worker node troubleshooting is about finding why kubelet cannot keep the node healthy: resource pressure, kubelet failure, runtime failure, network failure, or certificate problems.