12.03 Worker Node Failure

Worker node failure means one or more worker nodes cannot properly run Pods or communicate with the control plane. The node may show NotReady, Unknown, or resource pressure conditions such as disk, memory, or PID pressure.

Goal

Troubleshoot worker node failures by checking:

Node status
Node conditions
Node resource usage
Kubelet status
Kubelet logs
Certificates
Container runtime and network health

Worker Node Troubleshooting Flow

Worker node issue
│
├── Check node status
│   └── kubectl get nodes
│
├── Describe the affected node
│   └── kubectl describe node <node-name>
│
├── Check node resources
│   ├── top
│   └── df -h
│
├── Check kubelet
│   ├── systemctl status kubelet
│   └── journalctl -u kubelet
│
├── Check certificates
│   └── openssl x509 -in <cert-file> -text
│
├── Check container runtime
│   └── systemctl status containerd
│
└── Fix root cause and re-check node readiness

Tip

Always start from kubectl get nodes. It gives the fastest signal about whether the issue is cluster-wide or isolated to one worker node.

Step 1: Check Node Status

Run:

kubectl get nodes

Example:

NAME       STATUS                     ROLES    AGE   VERSION
worker-1   Ready                      <none>   8d    v1.13.0
worker-2   NotReady                   <none>   8d    v1.13.0
worker-3   Ready,SchedulingDisabled   <none>   8d    v1.13.0

Status	Meaning
`Ready`	Node is healthy and can run Pods
`NotReady`	Node is not healthy or kubelet is not reporting correctly
`Unknown`	Control plane stopped receiving node heartbeats
`SchedulingDisabled`	Node is cordoned; no new Pods will be scheduled

Important

Ready,SchedulingDisabled does not always mean the node is broken. It usually means the node was cordoned using kubectl cordon.

Step 2: Describe the Node

Use kubectl describe node to inspect conditions, events, labels, capacity, and allocatable resources.

kubectl describe node worker-1

Focus on the Conditions section.

Conditions:
  Type             Status    Reason                    Message
  ----             ------    ------                    -------
  OutOfDisk        False     KubeletHasSufficientDisk  kubelet has sufficient disk space available
  MemoryPressure   False     KubeletHasSufficientMemory kubelet has sufficient memory available
  DiskPressure     False     KubeletHasNoDiskPressure kubelet has no disk pressure
  PIDPressure      False     KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            True      KubeletReady              kubelet is posting ready status

Healthy condition pattern

A healthy node usually has:

OutOfDisk=False
MemoryPressure=False
DiskPressure=False
PIDPressure=False
Ready=True

Node Conditions Explained

Condition	Healthy value	Problem value	Meaning
`Ready`	`True`	`False` / `Unknown`	Node is ready to run Pods
`MemoryPressure`	`False`	`True` / `Unknown`	Node is low on memory
`DiskPressure`	`False`	`True` / `Unknown`	Node has low disk space
`PIDPressure`	`False`	`True` / `Unknown`	Too many processes are running
`OutOfDisk`	`False`	`True` / `Unknown`	Node is out of disk space

Note

When kubelet stops posting node status, conditions may become Unknown. Check LastHeartbeatTime to estimate when the node stopped communicating.

Step 3: Check Unknown Node Status

If a node shows Unknown, describe it:

kubectl describe node worker-1

Example:

Type             Status    Reason              Message
MemoryPressure   Unknown   NodeStatusUnknown   Kubelet stopped posting node status.
DiskPressure     Unknown   NodeStatusUnknown   Kubelet stopped posting node status.
PIDPressure      Unknown   NodeStatusUnknown   Kubelet stopped posting node status.
Ready            Unknown   NodeStatusUnknown   Kubelet stopped posting node status.

This usually means:

kubelet is stopped
node is powered off
node network is broken
API server cannot reach the node
node cannot reach the API server
certificate/authentication issue

Likely root cause

If all conditions are Unknown, do not start with Pod debugging. First verify the node itself is alive and kubelet is running.

Step 4: Check Node Health Directly

SSH into the affected worker node and check CPU, memory, process, and disk usage.

CPU and memoryDisk usageProcesses

top

Look for:

very high load average
memory exhaustion
high CPU processes
stuck or zombie processes

df -h

Look for full filesystems, especially:

/
/var
/var/lib/kubelet
/var/lib/containerd
/var/log

ps aux --sort=-%mem | head
ps aux --sort=-%cpu | head

Warning

Disk pressure is common in production when container logs, unused images, or /var/lib/containerd grow without cleanup.

Step 5: Check Kubelet Status

Kubelet is the main node agent. If kubelet is down, the node cannot report status or manage Pods correctly.

systemctl status kubelet

Older systems may use:

service kubelet status

Expected:

Active: active (running)

Restart kubelet if required:

sudo systemctl restart kubelet
sudo systemctl status kubelet

Danger

Do not blindly restart kubelet in production before checking logs. Restarting may hide the original error or disrupt workloads.

Step 6: Check Kubelet Logs

Use journalctl to inspect kubelet errors.

sudo journalctl -u kubelet

Follow logs live:

sudo journalctl -u kubelet -f

Check recent logs:

sudo journalctl -u kubelet --since "30 minutes ago"

Common log issues:

Log symptom	Possible cause
certificate expired	kubelet client certificate issue
connection refused	API server or network issue
failed to create pod sandbox	CNI or container runtime issue
image garbage collection failed	disk/runtime issue
node not found	node registration issue
failed to update node status	kubelet cannot talk to API server

Tip

Kubelet logs are usually the most important source when a worker node is NotReady.

Step 7: Check Certificates

Kubelet uses certificates to authenticate with the API server. Expired or wrong certificates can make the node fail to report status.

Example command:

openssl x509 -in /var/lib/kubelet/worker-1.crt -text -noout

Check:

Not Before
Not After
Issuer
Subject
organization/group
whether it is signed by the correct Kubernetes CA

Example certificate details:

Issuer: CN = kubernetes
Subject: O = system:nodes, CN = system:node:worker-1
Not After : Apr 19 20:09:29 2019 GMT

Certificate checks

The kubelet certificate should usually identify the node as:

O = system:nodes
CN = system:node:<node-name>

If the node name does not match, kubelet authentication may fail.

Step 8: Check Container Runtime

Most modern Kubernetes clusters use containerd.

systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"

Restart if required:

sudo systemctl restart containerd
sudo systemctl restart kubelet

For Docker-based older clusters:

systemctl status docker
sudo journalctl -u docker --since "30 minutes ago"

Runtime failure symptoms

Pods stuck in ContainerCreating
kubelet logs show sandbox creation failures
images cannot be pulled
node has many stopped containers/images consuming disk

Step 9: Check CNI and Pod Networking

If the node is Ready but Pods cannot communicate, check CNI.

kubectl get pods -n kube-system
kubectl describe pod <cni-pod-name> -n kube-system
kubectl logs <cni-pod-name> -n kube-system

On the worker node:

ls -l /etc/cni/net.d/
ls -l /opt/cni/bin/
ip addr
ip route

Note

CNI issues often appear as failed to create pod sandbox in kubelet logs.

Common Worker Node Failure Causes

Symptom	Likely cause	Check
Node `NotReady`	kubelet down	`systemctl status kubelet`
Node `Unknown`	node stopped reporting	heartbeat time, node reachability
`DiskPressure=True`	disk full	`df -h`, container logs/images
`MemoryPressure=True`	memory exhausted	`top`, `free -m`
`PIDPressure=True`	too many processes	`ps`, process limits
Pods stuck `ContainerCreating`	CNI/runtime issue	kubelet logs, CNI Pods
Pods evicted	resource pressure	`kubectl describe pod`
Kubelet auth errors	certificate issue	kubelet cert and logs
Services not working on node	kube-proxy issue	kube-proxy logs/status
Image pull failures	registry/network/secret issue	`describe pod`, runtime logs

Commands

Cluster-side checksNode resource checksKubelet checksRuntime checksCertificate checksNetwork checks

kubectl get nodes
kubectl describe node worker-1
kubectl get pods -A -o wide
kubectl get events --sort-by=.metadata.creationTimestamp

top
df -h
free -m
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head

systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"
sudo journalctl -u kubelet -f
sudo systemctl restart kubelet

systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"
crictl ps
crictl images

openssl x509 -in /var/lib/kubelet/worker-1.crt -text -noout
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout

ip addr
ip route
ping <api-server-ip>
curl -k https://<api-server-ip>:6443/healthz

Production Best Practices

Do

Monitor node readiness, CPU, memory, disk, PID pressure, and kubelet health.
Set resource requests and limits for applications.
Configure log rotation for containers.
Keep enough disk space under /var/lib/kubelet and /var/lib/containerd.
Use node alerts for NotReady, Unknown, and pressure conditions.
Use multiple worker nodes across availability zones.
Drain nodes before planned maintenance.
Keep kubelet certificates renewed and monitored.
Run regular OS and Kubernetes patching through a controlled maintenance process.

Don't

Do not ignore DiskPressure or high restart counts.
Do not manually delete random files under /var/lib/kubelet.
Do not restart production nodes without checking running workloads.
Do not run critical workloads without replicas.
Do not let container logs grow without rotation.
Do not use worker nodes with mismatched hostnames, duplicate MAC addresses, or wrong certificates.
Do not troubleshoot only from kubectl if the node itself is unhealthy.

Worker Node Failure Checklist

[ ] Is the node Ready, NotReady, or Unknown?
[ ] What does kubectl describe node show?
[ ] Are node conditions showing disk, memory, or PID pressure?
[ ] What is the LastHeartbeatTime?
[ ] Is the node reachable over SSH?
[ ] Is CPU or memory exhausted?
[ ] Is disk full?
[ ] Is kubelet active?
[ ] What do kubelet logs show?
[ ] Is container runtime active?
[ ] Are CNI Pods healthy?
[ ] Are kubelet certificates valid?
[ ] Can the node reach the API server on port 6443?
[ ] Are Pods failing only on this node or cluster-wide?

Quick Investigation Example

# 1. Check node status
kubectl get nodes

# 2. Describe affected node
kubectl describe node worker-2

# 3. SSH into node and check resources
top
df -h
free -m

# 4. Check kubelet
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"

# 5. Check runtime
systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"

# 6. Check certificate
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout

Quote

Worker node troubleshooting is about finding why kubelet cannot keep the node healthy: resource pressure, kubelet failure, runtime failure, network failure, or certificate problems.