Skip to content

5.1 OS Upgrades

Why OS Upgrades Are Required

In production Kubernetes clusters, nodes must be upgraded regularly for:

  • OS security patches
  • Kernel updates
  • Critical CVE remediation
  • Base image hardening
  • Cloud provider maintenance

Success

Regular OS patching is mandatory for production-grade security and compliance.


What Happens When a Node Goes Down?

When a node becomes NotReady or Unreachable:

  1. Kubernetes marks the node unhealthy.
  2. A NoExecute taint is applied automatically.
  3. Pods remain temporarily.
  4. After the pod eviction timeout (~5 minutes default), pods are evicted.
  5. If managed by a ReplicaSet/Deployment, they are recreated on healthy nodes.

Default Failure Flow

Node failure β†’ Wait (~5 min) β†’
Pods evicted β†’ Controllers recreate pods elsewhere

Warning

Standalone pods (not part of a controller) are permanently lost.


Pod Eviction Timeout

  • Default: ~5 minutes
  • Managed by controller manager
  • Implemented via taints and tolerations

Note

If a node returns before timeout, pods may recover. After timeout, pods are deleted and rescheduled.


Production-Safe OS Upgrade Strategy

  • Never reboot nodes directly in production.

  • Use controlled workload migration.


Cordon vs Drain vs Uncordon (Clear Comparison)

1️⃣ Cordon

kubectl cordon <node-name>
  • Marks node unschedulable
  • No new pods scheduled
  • Existing pods continue running

Tip

Use cordon when preparing a node for maintenance.


2️⃣ Drain (Safe Production Method)

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

What drain does:

  • Marks node unschedulable
  • Gracefully evicts running pods
  • Controllers recreate pods on other nodes

This matches the visual flow:

  • Workloads move off the node
  • Traffic continues via replicas
  • Node becomes empty and safe to reboot

Success

Drain is the recommended method for OS upgrades.

Warning

Drain may fail if: - PodDisruptionBudgets block eviction - Standalone pods exist - Cluster lacks spare capacity


3️⃣ Perform the OS Upgrade

After draining:

  • Apply OS patches
  • Reboot the node
  • Ensure kubelet reconnects
  • Verify node shows Ready

4️⃣ Uncordon

kubectl uncordon <node-name>
  • Node becomes schedulable again
  • New pods can be placed
  • Existing pods do NOT automatically move back

Recommended Zero-Downtime Upgrade Workflow

Abstract

  1. Ensure replicas > 1
  2. Confirm cluster capacity
  3. Cordon node
  4. Drain node
  5. Upgrade OS & reboot
  6. Verify node Ready
  7. Uncordon node

Repeat node-by-node (never all at once).


Production Best Practices

DO This

  • Always run workloads with replicas > 1
  • Use Deployments/ReplicaSets
  • Configure PodDisruptionBudgets
  • Maintain N+1 capacity
  • Upgrade one node at a time
  • Monitor eviction events
  • Validate node health before uncordon
  • Test upgrades in staging first
  • Automate with maintenance pipelines

Production Do & Don't

βœ… DO

Tip

  • Drain before upgrading
  • Monitor cluster utilization
  • Keep workloads stateless when possible
  • Use readiness & liveness probes
  • Maintain spare capacity

❌ DON'T

Danger

  • Don’t reboot nodes without draining
  • Don’t upgrade all nodes simultaneously
  • Don’t run single-replica production apps
  • Don’t ignore PodDisruptionBudgets
  • Don’t assume node recovery under 5 minutes

Special Considerations

Stateful Workloads

Warning

  • Ensure persistent volumes are externalized
  • Validate replication health before draining
  • Coordinate with application teams

Cluster Autoscaler Interaction

If cluster is near full capacity:

  • Drain may fail due to insufficient scheduling space.
  • Always ensure spare capacity before maintenance.

Tip

Maintain buffer capacity in production clusters.


Quick Reference Table

Command New Pods Allowed Existing Pods Removed Use Case
Cordon No No Prep maintenance
Drain No Yes Safe OS upgrade
Uncordon Yes No Resume scheduling

Interview / Exam Memory

Question

  • Node down β†’ ~5 minute eviction
  • Drain = safe node maintenance
  • Cordon = stop scheduling only
  • Uncordon = resume scheduling
  • ReplicaSets recreate pods

Final Production Takeaway

Quote

Safe OS upgrades in Kubernetes require:

Cordon β†’ Drain β†’ Upgrade β†’ Verify β†’ Uncordon

Design your workloads assuming nodes will fail.