5.1 OS Upgrades

Why OS Upgrades Are Required

In production Kubernetes clusters, nodes must be upgraded regularly for:

OS security patches
Kernel updates
Critical CVE remediation
Base image hardening
Cloud provider maintenance

Success

Regular OS patching is mandatory for production-grade security and compliance.

What Happens When a Node Goes Down?

When a node becomes NotReady or Unreachable:

Kubernetes marks the node unhealthy.
A NoExecute taint is applied automatically.
Pods remain temporarily.
After the pod eviction timeout (~5 minutes default), pods are evicted.
If managed by a ReplicaSet/Deployment, they are recreated on healthy nodes.

Default Failure Flow

Node failure → Wait (~5 min) →
Pods evicted → Controllers recreate pods elsewhere

Warning

Standalone pods (not part of a controller) are permanently lost.

Pod Eviction Timeout

Default: ~5 minutes
Managed by controller manager
Implemented via taints and tolerations

Note

If a node returns before timeout, pods may recover. After timeout, pods are deleted and rescheduled.

Production-Safe OS Upgrade Strategy

Never reboot nodes directly in production.
Use controlled workload migration.

Cordon vs Drain vs Uncordon (Clear Comparison)

1️⃣ Cordon

kubectl cordon <node-name>

Marks node unschedulable
No new pods scheduled
Existing pods continue running

Tip

Use cordon when preparing a node for maintenance.

2️⃣ Drain (Safe Production Method)

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

What drain does:

Marks node unschedulable
Gracefully evicts running pods
Controllers recreate pods on other nodes

This matches the visual flow:

Workloads move off the node
Traffic continues via replicas
Node becomes empty and safe to reboot

Success

Drain is the recommended method for OS upgrades.

Warning

Drain may fail if: - PodDisruptionBudgets block eviction - Standalone pods exist - Cluster lacks spare capacity

3️⃣ Perform the OS Upgrade

After draining:

Apply OS patches
Reboot the node
Ensure kubelet reconnects
Verify node shows Ready

4️⃣ Uncordon

kubectl uncordon <node-name>

Node becomes schedulable again
New pods can be placed
Existing pods do NOT automatically move back

Recommended Zero-Downtime Upgrade Workflow

Abstract

Ensure replicas > 1
Confirm cluster capacity
Cordon node
Drain node
Upgrade OS & reboot
Verify node Ready
Uncordon node

Repeat node-by-node (never all at once).

Production Best Practices

DO This

Always run workloads with replicas > 1
Use Deployments/ReplicaSets
Configure PodDisruptionBudgets
Maintain N+1 capacity
Upgrade one node at a time
Monitor eviction events
Validate node health before uncordon
Test upgrades in staging first
Automate with maintenance pipelines

Production Do & Don't

✅ DO

Tip

Drain before upgrading
Monitor cluster utilization
Keep workloads stateless when possible
Use readiness & liveness probes
Maintain spare capacity

❌ DON'T

Danger

Don’t reboot nodes without draining
Don’t upgrade all nodes simultaneously
Don’t run single-replica production apps
Don’t ignore PodDisruptionBudgets
Don’t assume node recovery under 5 minutes

Special Considerations

Stateful Workloads

Warning

Ensure persistent volumes are externalized
Validate replication health before draining
Coordinate with application teams

Cluster Autoscaler Interaction

If cluster is near full capacity:

Drain may fail due to insufficient scheduling space.
Always ensure spare capacity before maintenance.

Tip

Maintain buffer capacity in production clusters.

Quick Reference Table

Command	New Pods Allowed	Existing Pods Removed	Use Case
Cordon	No	No	Prep maintenance
Drain	No	Yes	Safe OS upgrade
Uncordon	Yes	No	Resume scheduling

Interview / Exam Memory

Question

Node down → ~5 minute eviction
Drain = safe node maintenance
Cordon = stop scheduling only
Uncordon = resume scheduling
ReplicaSets recreate pods

Final Production Takeaway

Quote

Safe OS upgrades in Kubernetes require:

Cordon → Drain → Upgrade → Verify → Uncordon

Design your workloads assuming nodes will fail.