12.03 Worker Node Failure
Worker node failure means one or more worker nodes cannot properly run Pods or communicate with the control plane. The node may show NotReady, Unknown, or resource pressure conditions such as disk, memory, or PID pressure.
Goal
Troubleshoot worker node failures by checking:
- Node status
- Node conditions
- Node resource usage
- Kubelet status
- Kubelet logs
- Certificates
- Container runtime and network health
Worker Node Troubleshooting Flow
Worker node issue
│
├── Check node status
│ └── kubectl get nodes
│
├── Describe the affected node
│ └── kubectl describe node <node-name>
│
├── Check node resources
│ ├── top
│ └── df -h
│
├── Check kubelet
│ ├── systemctl status kubelet
│ └── journalctl -u kubelet
│
├── Check certificates
│ └── openssl x509 -in <cert-file> -text
│
├── Check container runtime
│ └── systemctl status containerd
│
└── Fix root cause and re-check node readiness
Tip
Always start from kubectl get nodes. It gives the fastest signal about whether the issue is cluster-wide or isolated to one worker node.
Step 1: Check Node Status
Run:
Example:
NAME STATUS ROLES AGE VERSION
worker-1 Ready <none> 8d v1.13.0
worker-2 NotReady <none> 8d v1.13.0
worker-3 Ready,SchedulingDisabled <none> 8d v1.13.0
| Status | Meaning |
|---|---|
Ready |
Node is healthy and can run Pods |
NotReady |
Node is not healthy or kubelet is not reporting correctly |
Unknown |
Control plane stopped receiving node heartbeats |
SchedulingDisabled |
Node is cordoned; no new Pods will be scheduled |
Important
Ready,SchedulingDisabled does not always mean the node is broken. It usually means the node was cordoned using kubectl cordon.
Step 2: Describe the Node
Use kubectl describe node to inspect conditions, events, labels, capacity, and allocatable resources.
Focus on the Conditions section.
Conditions:
Type Status Reason Message
---- ------ ------ -------
OutOfDisk False KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready True KubeletReady kubelet is posting ready status
Healthy condition pattern
A healthy node usually has:
OutOfDisk=FalseMemoryPressure=FalseDiskPressure=FalsePIDPressure=FalseReady=True
Node Conditions Explained
| Condition | Healthy value | Problem value | Meaning |
|---|---|---|---|
Ready |
True |
False / Unknown |
Node is ready to run Pods |
MemoryPressure |
False |
True / Unknown |
Node is low on memory |
DiskPressure |
False |
True / Unknown |
Node has low disk space |
PIDPressure |
False |
True / Unknown |
Too many processes are running |
OutOfDisk |
False |
True / Unknown |
Node is out of disk space |
Note
When kubelet stops posting node status, conditions may become Unknown. Check LastHeartbeatTime to estimate when the node stopped communicating.
Step 3: Check Unknown Node Status
If a node shows Unknown, describe it:
Example:
Type Status Reason Message
MemoryPressure Unknown NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown NodeStatusUnknown Kubelet stopped posting node status.
This usually means:
- kubelet is stopped
- node is powered off
- node network is broken
- API server cannot reach the node
- node cannot reach the API server
- certificate/authentication issue
Likely root cause
If all conditions are Unknown, do not start with Pod debugging. First verify the node itself is alive and kubelet is running.
Step 4: Check Node Health Directly
SSH into the affected worker node and check CPU, memory, process, and disk usage.
Warning
Disk pressure is common in production when container logs, unused images, or /var/lib/containerd grow without cleanup.
Step 5: Check Kubelet Status
Kubelet is the main node agent. If kubelet is down, the node cannot report status or manage Pods correctly.
Older systems may use:
Expected:
Restart kubelet if required:
Danger
Do not blindly restart kubelet in production before checking logs. Restarting may hide the original error or disrupt workloads.
Step 6: Check Kubelet Logs
Use journalctl to inspect kubelet errors.
Follow logs live:
Check recent logs:
Common log issues:
| Log symptom | Possible cause |
|---|---|
| certificate expired | kubelet client certificate issue |
| connection refused | API server or network issue |
| failed to create pod sandbox | CNI or container runtime issue |
| image garbage collection failed | disk/runtime issue |
| node not found | node registration issue |
| failed to update node status | kubelet cannot talk to API server |
Tip
Kubelet logs are usually the most important source when a worker node is NotReady.
Step 7: Check Certificates
Kubelet uses certificates to authenticate with the API server. Expired or wrong certificates can make the node fail to report status.
Example command:
Check:
Not BeforeNot AfterIssuerSubject- organization/group
- whether it is signed by the correct Kubernetes CA
Example certificate details:
Issuer: CN = kubernetes
Subject: O = system:nodes, CN = system:node:worker-1
Not After : Apr 19 20:09:29 2019 GMT
Certificate checks
The kubelet certificate should usually identify the node as:
If the node name does not match, kubelet authentication may fail.
Step 8: Check Container Runtime
Most modern Kubernetes clusters use containerd.
Restart if required:
For Docker-based older clusters:
Runtime failure symptoms
- Pods stuck in
ContainerCreating - kubelet logs show sandbox creation failures
- images cannot be pulled
- node has many stopped containers/images consuming disk
Step 9: Check CNI and Pod Networking
If the node is Ready but Pods cannot communicate, check CNI.
kubectl get pods -n kube-system
kubectl describe pod <cni-pod-name> -n kube-system
kubectl logs <cni-pod-name> -n kube-system
On the worker node:
Note
CNI issues often appear as failed to create pod sandbox in kubelet logs.
Common Worker Node Failure Causes
| Symptom | Likely cause | Check |
|---|---|---|
Node NotReady |
kubelet down | systemctl status kubelet |
Node Unknown |
node stopped reporting | heartbeat time, node reachability |
DiskPressure=True |
disk full | df -h, container logs/images |
MemoryPressure=True |
memory exhausted | top, free -m |
PIDPressure=True |
too many processes | ps, process limits |
Pods stuck ContainerCreating |
CNI/runtime issue | kubelet logs, CNI Pods |
| Pods evicted | resource pressure | kubectl describe pod |
| Kubelet auth errors | certificate issue | kubelet cert and logs |
| Services not working on node | kube-proxy issue | kube-proxy logs/status |
| Image pull failures | registry/network/secret issue | describe pod, runtime logs |
Commands
Production Best Practices
Do
- Monitor node readiness, CPU, memory, disk, PID pressure, and kubelet health.
- Set resource requests and limits for applications.
- Configure log rotation for containers.
- Keep enough disk space under
/var/lib/kubeletand/var/lib/containerd. - Use node alerts for
NotReady,Unknown, and pressure conditions. - Use multiple worker nodes across availability zones.
- Drain nodes before planned maintenance.
- Keep kubelet certificates renewed and monitored.
- Run regular OS and Kubernetes patching through a controlled maintenance process.
Don't
- Do not ignore
DiskPressureor high restart counts. - Do not manually delete random files under
/var/lib/kubelet. - Do not restart production nodes without checking running workloads.
- Do not run critical workloads without replicas.
- Do not let container logs grow without rotation.
- Do not use worker nodes with mismatched hostnames, duplicate MAC addresses, or wrong certificates.
- Do not troubleshoot only from
kubectlif the node itself is unhealthy.
Worker Node Failure Checklist
[ ] Is the node Ready, NotReady, or Unknown?
[ ] What does kubectl describe node show?
[ ] Are node conditions showing disk, memory, or PID pressure?
[ ] What is the LastHeartbeatTime?
[ ] Is the node reachable over SSH?
[ ] Is CPU or memory exhausted?
[ ] Is disk full?
[ ] Is kubelet active?
[ ] What do kubelet logs show?
[ ] Is container runtime active?
[ ] Are CNI Pods healthy?
[ ] Are kubelet certificates valid?
[ ] Can the node reach the API server on port 6443?
[ ] Are Pods failing only on this node or cluster-wide?
Quick Investigation Example
# 1. Check node status
kubectl get nodes
# 2. Describe affected node
kubectl describe node worker-2
# 3. SSH into node and check resources
top
df -h
free -m
# 4. Check kubelet
systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago"
# 5. Check runtime
systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago"
# 6. Check certificate
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout
Quote
Worker node troubleshooting is about finding why kubelet cannot keep the node healthy: resource pressure, kubelet failure, runtime failure, network failure, or certificate problems.