5.1 OS Upgrades
Why OS Upgrades Are Required
In production Kubernetes clusters, nodes must be upgraded regularly for:
- OS security patches
- Kernel updates
- Critical CVE remediation
- Base image hardening
- Cloud provider maintenance
Success
Regular OS patching is mandatory for production-grade security and compliance.
What Happens When a Node Goes Down?
When a node becomes NotReady or Unreachable:
- Kubernetes marks the node unhealthy.
- A
NoExecutetaint is applied automatically. - Pods remain temporarily.
- After the pod eviction timeout (~5 minutes default), pods are evicted.
- If managed by a ReplicaSet/Deployment, they are recreated on healthy nodes.
Default Failure Flow
Node failure β Wait (~5 min) β
Pods evicted β Controllers recreate pods elsewhere
Warning
Standalone pods (not part of a controller) are permanently lost.
Pod Eviction Timeout
- Default: ~5 minutes
- Managed by controller manager
- Implemented via taints and tolerations
Note
If a node returns before timeout, pods may recover. After timeout, pods are deleted and rescheduled.
Production-Safe OS Upgrade Strategy
-
Never reboot nodes directly in production.
-
Use controlled workload migration.
Cordon vs Drain vs Uncordon (Clear Comparison)
1οΈβ£ Cordon
- Marks node unschedulable
- No new pods scheduled
- Existing pods continue running
Tip
Use cordon when preparing a node for maintenance.
2οΈβ£ Drain (Safe Production Method)
What drain does:
- Marks node unschedulable
- Gracefully evicts running pods
- Controllers recreate pods on other nodes
This matches the visual flow:
- Workloads move off the node
- Traffic continues via replicas
- Node becomes empty and safe to reboot
Success
Drain is the recommended method for OS upgrades.
Warning
Drain may fail if: - PodDisruptionBudgets block eviction - Standalone pods exist - Cluster lacks spare capacity
3οΈβ£ Perform the OS Upgrade
After draining:
- Apply OS patches
- Reboot the node
- Ensure kubelet reconnects
- Verify node shows
Ready
4οΈβ£ Uncordon
- Node becomes schedulable again
- New pods can be placed
- Existing pods do NOT automatically move back
Recommended Zero-Downtime Upgrade Workflow
Abstract
- Ensure replicas > 1
- Confirm cluster capacity
- Cordon node
- Drain node
- Upgrade OS & reboot
- Verify node Ready
- Uncordon node
Repeat node-by-node (never all at once).
Production Best Practices
DO This
- Always run workloads with replicas > 1
- Use Deployments/ReplicaSets
- Configure PodDisruptionBudgets
- Maintain N+1 capacity
- Upgrade one node at a time
- Monitor eviction events
- Validate node health before uncordon
- Test upgrades in staging first
- Automate with maintenance pipelines
Production Do & Don't
β DO
Tip
- Drain before upgrading
- Monitor cluster utilization
- Keep workloads stateless when possible
- Use readiness & liveness probes
- Maintain spare capacity
β DON'T
Danger
- Donβt reboot nodes without draining
- Donβt upgrade all nodes simultaneously
- Donβt run single-replica production apps
- Donβt ignore PodDisruptionBudgets
- Donβt assume node recovery under 5 minutes
Special Considerations
Stateful Workloads
Warning
- Ensure persistent volumes are externalized
- Validate replication health before draining
- Coordinate with application teams
Cluster Autoscaler Interaction
If cluster is near full capacity:
- Drain may fail due to insufficient scheduling space.
- Always ensure spare capacity before maintenance.
Tip
Maintain buffer capacity in production clusters.
Quick Reference Table
| Command | New Pods Allowed | Existing Pods Removed | Use Case |
|---|---|---|---|
| Cordon | No | No | Prep maintenance |
| Drain | No | Yes | Safe OS upgrade |
| Uncordon | Yes | No | Resume scheduling |
Interview / Exam Memory
Question
- Node down β ~5 minute eviction
- Drain = safe node maintenance
- Cordon = stop scheduling only
- Uncordon = resume scheduling
- ReplicaSets recreate pods
Final Production Takeaway
Quote
Safe OS upgrades in Kubernetes require:
Cordon β Drain β Upgrade β Verify β Uncordon
Design your workloads assuming nodes will fail.