Skip to content

5.3 Kubernetes Backup & Restore

What Should You Back Up?

In production, you must protect 3 things which actually matters for recovery:

πŸ”Ή 1. Resource Configuration

  • Deployments
  • Services
  • ConfigMaps
  • Secrets
  • Ingress
  • RBAC objects

πŸ”Ή 2. Cluster State (etcd)

  • Nodes
  • Pods
  • All Kubernetes objects
  • Cluster metadata

πŸ”Ή 3. Persistent Volumes (Application Data)

Abstract

If the cluster is lost:

  • YAML files restore workloads
  • etcd restores cluster state
  • PV backups restore business data

Backup Resource Configuration

Declarative (Best Practice)

Store all YAML manifests in Git:

kubectl apply -f app/

Benefits:

  • Version controlled
  • Reusable
  • Team-friendly
  • Git becomes your backup

Success

GitOps-style configuration storage is the safest production approach.


Export Live Resources (Safety Net)

If objects were created imperatively:

kubectl get all --all-namespaces -o yaml > cluster-backup.yaml

This queries the API server and exports resource definitions.

Warning

This does NOT back up persistent volume data.


Tools for Resource Backup

  • Velero
  • Cloud-native backup operators
  • GitOps pipelines

Tip

In managed clusters (EKS/GKE/AKS), API-based backup is usually required because etcd access is restricted.


Backup etcd (Cluster State Backup)

etcd is the Kubernetes key-value database.

It stores:

  • All cluster objects
  • Secrets
  • Nodes
  • Cluster metadata

Instead of exporting YAML, you can snapshot etcd.


Working with ETCDCTL & ETCDUTL

Note

Always use ETCDCTL_API=3 for Kubernetes clusters.


Taking etcd Backup

Option 1 β€” Live Snapshot (etcdctl)

ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /backup/etcd-snapshot.db

Required Flags

  • --endpoints β†’ etcd endpoint
  • --cacert β†’ CA certificate
  • --cert β†’ client certificate
  • --key β†’ client key

Verify snapshot:

etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

Success

This creates a portable .db snapshot of the entire cluster state.


Option 2 β€” File-Level Backup (etcdutl)

etcdutl backup \
  --data-dir /var/lib/etcd \
  --backup-dir /backup/etcd-backup

This copies:

  • etcd backend database
  • WAL files

Note

etcdutl backup works even if etcd is not running.


Restoring etcd from snapshot

Step 1 β€” Stop kube-apiserver

mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 30

Note

Moving static pod manifest stops the component automatically.


Step 2 β€” Restore etcd Snapshot

etcdutl snapshot restore /opt/snapshot-pre-boot.db \
  --data-dir /var/lib/etcd-from-backup

This creates a new restored data directory:

/var/lib/etcd-from-backup

Step 3 β€” Update etcd Static Pod Config

vi /etc/kubernetes/manifests/etcd.yaml

VERY IMPORTANT : You must modify ONLY the hostPath inside the volumes section.


πŸ”΄ OLD Configuration

volumes:
- hostPath:
    path: /etc/kubernetes/pki/etcd
    type: DirectoryOrCreate
  name: etcd-certs
- hostPath:
    path: /var/lib/etcd        # OLD directory
    type: DirectoryOrCreate
  name: etcd-data

🟒 NEW Configuration (After Restore)

volumes:
- hostPath:
    path: /etc/kubernetes/pki/etcd
    type: DirectoryOrCreate
  name: etcd-certs
- hostPath:
    path: /var/lib/etcd-from-backup   # NEW restored directory
    type: DirectoryOrCreate
  name: etcd-data

❗ Why Do We Change ONLY volumes.hostPath.path?

This is extremely important.

There are two different layers:

Section Meaning Location
hostPath.path Directory on the node (host machine) Physical storage
volumeMounts.mountPath Directory inside container Container filesystem

πŸ” Volume Mount Section (DO NOT CHANGE)

In the same file you will see:

volumeMounts:
- mountPath: /var/lib/etcd
  name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
  name: etcd-certs

🚨 This must NOT be changed.


🧠 Why Not Change mountPath?

Because etcd container is started with:

--data-dir=/var/lib/etcd

That path is inside the container.

If you change mountPath, etcd will not find its database and will fail to start.


πŸ“¦ What Actually Happens

After restore:

Host machine:

/var/lib/etcd-from-backup

Container view:

/var/lib/etcd

Mapping:

Host: /var/lib/etcd-from-backup
        ↓ mounted as
Container: /var/lib/etcd

The container does NOT know the host path changed.

That is why:

βœ… Change hostPath.path
❌ Do NOT change mountPath


⚠️ What Happens If You Change mountPath?

If you modify:

mountPath: /var/lib/etcd-from-backup

Then:

  • etcd still expects /var/lib/etcd
  • Database not found
  • etcd crashes
  • Control plane fails

Success

Always change only the hostPath during restore.


Step 4 - Restart all ETCD Components

Restart kube-apiserver

mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

Wait ~60 seconds.

Restart controller-manager

mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
sleep 20
mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/

Restart scheduler

mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
sleep 20
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

Restart kubelet

systemctl restart kubelet

Monitor Restore

watch crictl ps

Ensure:

  • All control plane pods = Running
  • etcd is healthy
  • kube-apiserver is Running

Verify Restore

kubectl get deployments,services --all-namespaces
kubectl get pods --all-namespaces
kubectl get nodes

Success

All resources should reflect snapshot time.


Persistent Volume Backup

etcd does NOT store application data.

For stateful workloads:

  • Use cloud disk snapshots (EBS, GCE PD, Azure Disk)
  • Use CSI VolumeSnapshots
  • Use storage-native backup tools

Danger

Without PV backups, application data loss is permanent.


Comparing Backup Methods

Method Protects Best For
Git/YAML App definitions GitOps
kubectl export Resource configs Quick backup
etcdctl snapshot Full cluster state Self-managed clusters
etcdutl backup Raw etcd files Advanced DR
PV snapshot Application data Stateful apps

Production Best Practices

DO This

  • Store manifests in Git
  • Schedule automated etcd snapshots
  • Backup persistent volumes separately
  • Encrypt backup files
  • Store backups off-cluster
  • Test restore regularly
  • Document restore runbook
  • Monitor etcd health
  • Verify snapshot integrity
  • Maintain off-site copies

❌ DON'T

Danger

  • Don’t rely only on YAML exports
  • Don’t skip PV backups
  • Don’t store backups on same node
  • Don’t skip restore testing
  • Don’t expose etcd without TLS
  • Don’t modify volumeMount path

Disaster Recovery Flow

Abstract

  1. Identify failure type (state vs data)
  2. Restore etcd snapshot (if control plane issue)
  3. Restore persistent volumes (if data issue)
  4. Reapply manifests if needed
  5. Validate cluster health
  6. Monitor workloads

🎯 Interview Memory

Question

  • etcd stores cluster state
  • YAML + Git = config backup
  • ETCDCTL_API=3 required
  • etcdctl snapshot save
  • Stop API server before restore
  • etcdutl snapshot restore
  • Update data-dir in etcd.yaml
  • Static pod restart behavior
  • PV backups are separate

Question

We change only hostPath.path because it defines the data location on the host. mountPath defines where data appears inside the container. etcd expects /var/lib/etcd inside container. Changing mountPath breaks etcd startup.


Final Production Takeaway

Quote

Backups are useless without restore testing.

Protect:

  • Configuration (Git)
  • Cluster State (etcd)
  • Application Data (Persistent Volumes)

Automate backups. Test restores. Protect production.

  • In static pod restore, only change the host path β€” never the container path.