Skip to content

5.3 Kubernetes Backup & Restore

1️⃣ What Should You Back Up?

In production, you must protect 3 things which actually matters for recovery:

πŸ”Ή 1. Resource Configuration

  • Deployments
  • Services
  • ConfigMaps
  • Secrets
  • Ingress
  • RBAC objects

πŸ”Ή 2. Cluster State (etcd)

  • Nodes
  • Pods
  • All Kubernetes objects
  • Cluster metadata

πŸ”Ή 3. Persistent Volumes (Application Data)

Abstract

If the cluster is lost:

  • YAML files restore workloads
  • etcd restores cluster state
  • PV backups restore business data

2️⃣ Backup Method 1 β€” Resource Configuration (Recommended)

Declarative (Best Practice)

Store all YAML manifests in Git:

kubectl apply -f app/

Benefits:

  • Version controlled
  • Reusable
  • Team-friendly
  • Git becomes your backup

Success

GitOps-style configuration storage is the safest production approach.


Export Live Resources (Safety Net)

If objects were created imperatively:

kubectl get all --all-namespaces -o yaml > cluster-backup.yaml

This queries the API server and exports resource definitions.

Warning

This does NOT back up persistent volume data.


Tools for Resource Backup

  • Velero
  • Cloud-native backup operators
  • GitOps pipelines

Tip

In managed clusters (EKS/GKE/AKS), API-based backup is usually required because etcd access is restricted.


3️⃣ Backup Method 2 β€” etcd Snapshot (Cluster State Backup)

etcd is the Kubernetes key-value database.

It stores:

  • All cluster objects
  • Secrets
  • Nodes
  • Cluster metadata

Instead of exporting YAML, you can snapshot etcd.


4️⃣ Working with ETCDCTL & ETCDUTL

Verify etcd Version

etcdctl version

Example:

etcdctl version: 3.5.16
API version: 3.5

Note

Always use ETCDCTL_API=3 for Kubernetes clusters.


5️⃣ Taking etcd Backup

Option 1 β€” Live Snapshot (etcdctl)

ETCDCTL_API=3 etcdctl   --endpoints=https://127.0.0.1:2379   --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key   snapshot save /backup/etcd-snapshot.db

Required Flags

  • --endpoints β†’ etcd endpoint
  • --cacert β†’ CA certificate
  • --cert β†’ client certificate
  • --key β†’ client key

Verify snapshot:

etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

Success

This creates a portable .db snapshot of the entire cluster state.


Option 2 β€” File-Level Backup (etcdutl)

etcdutl backup   --data-dir /var/lib/etcd   --backup-dir /backup/etcd-backup

This copies:

  • etcd backend database
  • WAL files

Note

etcdutl backup works even if etcd is not running.


6️⃣ Restoring etcd

etcdutl snapshot restore /backup/etcd-snapshot.db   --data-dir /var/lib/etcd-restored

This creates a new data directory.


Manual Restore Process

Step 1 β€” Stop API Server

systemctl stop kube-apiserver

Step 2 β€” Restore Snapshot

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db   --data-dir /var/lib/etcd-from-backup

Step 3 β€” Update etcd Config

Change:

--data-dir=/var/lib/etcd-from-backup

Step 4 β€” Restart Services

systemctl daemon-reload
systemctl restart etcd
systemctl start kube-apiserver

Warning

Always include certificates and endpoints when using etcdctl.


7️⃣ Persistent Volume Backup

etcd does NOT store application data.

For stateful workloads:

  • Use cloud disk snapshots (EBS, GCE PD, Azure Disk)
  • Use CSI VolumeSnapshots
  • Use storage-native backup tools

Danger

Without PV backups, application data loss is permanent.


8️⃣ Comparing Backup Methods

Method Protects Best For
Git/YAML App definitions GitOps
kubectl export Resource configs Quick backup
etcdctl snapshot Full cluster state Self-managed clusters
etcdutl backup Raw etcd files Advanced DR
PV snapshot Application data Stateful apps

9️⃣ Production Best Practices

DO This

  • Store manifests in Git
  • Schedule automated etcd snapshots
  • Backup persistent volumes separately
  • Encrypt backup files
  • Store backups off-cluster
  • Test restore regularly
  • Document recovery runbook

πŸ”Ÿ Production Do & Don’t

βœ… DO

Tip

  • Verify snapshot integrity
  • Maintain off-site copies
  • Automate backup schedules
  • Monitor etcd health

❌ DON'T

Danger

  • Don’t rely only on YAML exports
  • Don’t ignore PV backups
  • Don’t store backups on same node
  • Don’t skip restore testing
  • Don’t expose etcd without TLS

1️⃣1️⃣ Disaster Recovery Flow

Abstract

  1. Identify failure type (state vs data)
  2. Restore etcd snapshot (if control plane issue)
  3. Restore persistent volumes (if data issue)
  4. Reapply manifests if needed
  5. Validate cluster health
  6. Monitor workloads

🎯 Interview Memory

Question

  • etcd stores cluster state
  • YAML + Git = config backup
  • etcdctl snapshot save
  • etcdutl snapshot restore
  • Stop API server before restore
  • PV backups are separate

Final Production Takeaway

Quote

Backups are useless without restore testing.

Protect:

  • Configuration (Git)
  • Cluster State (etcd)
  • Application Data (Persistent Volumes)

Automate backups. Test restores. Protect production.