9.4 ETCD in High Availability
Abstract
etcd is the distributed key-value store used by Kubernetes to store all cluster state.
In a High Availability (HA) Kubernetes cluster, etcd must also be highly available because losing etcd means losing the source of truth for the cluster.
What is etcd?
etcd is a:
- distributed key-value store
- reliable datastore
- strongly consistent system
- secure and fast backend for Kubernetes state
Kubernetes stores critical cluster data in etcd, including:
| Data Type | Examples |
|---|---|
| Workloads | Pods, Deployments, ReplicaSets |
| Networking | Services, Endpoints, NetworkPolicies |
| Security | Secrets, ServiceAccounts, RBAC |
| Configuration | ConfigMaps, API objects |
| Cluster state | Nodes, leases, controllers, scheduler state |
Danger
If etcd data is lost and no backup exists, the Kubernetes cluster may not be recoverable.
Why etcd Needs HA
A single etcd node is a single point of failure.
If one etcd server stores all cluster state and it fails:
- Kubernetes API may stop working
- cluster state cannot be read or updated
- controllers cannot reconcile resources
- new scheduling decisions may fail
- recovery depends on backup availability
Success
In production, etcd should run as a cluster with multiple members.
Distributed etcd Cluster
In HA mode, etcd runs on multiple servers.
Each etcd member stores a copy of the same data.
All members cooperate to keep data consistent.
Note
You can read from any etcd member, but writes are coordinated through a leader.
Consistency in etcd
etcd provides strong consistency.
That means:
- all members agree on the same data
- writes are replicated before being committed
- clients get reliable cluster state
- Kubernetes does not see conflicting data
Client writes key=name value=John
↓
Leader accepts write
↓
Followers replicate write
↓
Write is committed after quorum
Tip
Strong consistency is important because Kubernetes controllers depend on accurate cluster state.
Read and Write Behavior
Reads
Reads can be served from etcd members because all members maintain consistent data.
Writes
Writes are handled by the etcd leader.
If a write request reaches a follower:
- follower forwards the request to the leader
- leader processes the write
- leader replicates the write to followers
- write is committed after quorum is reached
Note
etcd may receive write requests on any member, but internally the leader coordinates the write.
Leader and Followers
An etcd cluster has:
| Role | Description |
|---|---|
| Leader | Handles writes and coordinates replication |
| Followers | Replicate data and participate in quorum |
Example:
If the leader fails, the remaining members elect a new leader.
Warning
If quorum is lost, etcd cannot safely process writes.
Raft Consensus Protocol
etcd uses the Raft consensus algorithm.
Raft is responsible for:
- leader election
- log replication
- quorum-based writes
- failover when leader is unavailable
Leader election flow:
- members start without a leader
- each member starts an election timer
- one member requests votes
- majority votes elect a leader
- leader sends heartbeats
- if heartbeats stop, a new election starts
Abstract
Raft helps etcd members agree on one correct cluster state.
Write Replication Flow
When a write happens:
Write request
↓
Leader receives request
↓
Leader sends update to followers
↓
Majority confirms
↓
Write is committed
↓
Data becomes available consistently
Example
In a 3-member etcd cluster, a write is committed when at least 2 members agree.
Quorum
Quorum is the minimum number of etcd members required for the cluster to make decisions.
Formula:
Where N is the number of etcd members.
| etcd Members | Quorum | Fault Tolerance |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |
| 6 | 4 | 2 |
| 7 | 4 | 3 |
Note
Fault tolerance means how many members can fail while still keeping quorum.
Why 2 etcd Members Is Not Enough
A 2-member etcd cluster has quorum of 2.
If one member fails:
Result:
- no quorum
- writes cannot be committed
- cluster becomes unhealthy
Failure
A 2-member etcd cluster gives almost no real HA benefit because losing one member breaks quorum.
Odd Number of etcd Members
Odd numbers are preferred:
- 3 members
- 5 members
- 7 members
Why?
Because odd numbers improve quorum behavior during failures or network partitions.
Success
Use an odd number of etcd members in production. Common choices are 3 or 5.
Odd vs Even Example
A 6-member cluster has quorum of 4.
If a network partition splits it into 3 + 3:
Neither side has quorum.
Result:
- cluster cannot make progress
- writes fail
A 7-member cluster has quorum of 4.
If split into 4 + 3:
Cluster can continue on the majority side.
Warning
Even-numbered etcd clusters can fail during equal network partitions.
Recommended etcd Cluster Size
| Size | Recommendation |
|---|---|
| 1 member | Only for labs/dev |
| 2 members | Avoid |
| 3 members | Good production minimum |
| 5 members | Better fault tolerance |
| 7+ members | Usually unnecessary |
Tip
For most production Kubernetes clusters, 3 or 5 etcd members is enough.
etcd Topologies in Kubernetes
There are two common Kubernetes HA etcd topologies:
- Stacked etcd topology
- External etcd topology
Stacked etcd Topology
In stacked topology, etcd runs on the same nodes as the control plane.
master-1
├── kube-apiserver
├── controller-manager
├── scheduler
└── etcd
master-2
├── kube-apiserver
├── controller-manager
├── scheduler
└── etcd
master-3
├── kube-apiserver
├── controller-manager
├── scheduler
└── etcd
Advantages
- easier to set up
- fewer servers
- easier to manage
- common with kubeadm HA clusters
Disadvantages
- losing a control plane node also loses an etcd member
- lower failure isolation
- more risk during node failures
Warning
Stacked topology is simpler but couples control plane failure with etcd failure.
External etcd Topology
In external etcd topology, etcd runs on dedicated nodes separate from the control plane.
Control Plane Nodes
├── kube-apiserver
├── controller-manager
└── scheduler
External etcd Nodes
├── etcd-1
├── etcd-2
└── etcd-3
Advantages
- better failure isolation
- safer for critical production clusters
- control plane node failure does not remove etcd member
- easier to scale or maintain etcd separately
Disadvantages
- harder to set up
- more servers
- more certificates
- more networking and operational complexity
Success
External etcd is preferred for strict production environments where cluster state must be isolated and protected.
Stacked vs External etcd
| Area | Stacked etcd | External etcd |
|---|---|---|
| Complexity | Lower | Higher |
| Server count | Fewer | More |
| Failure isolation | Lower | Higher |
| Cost | Lower | Higher |
| Operational effort | Easier | Harder |
| Production safety | Good | Better |
etcd Ports
etcd uses two important ports:
| Port | Purpose |
|---|---|
2379 |
Client communication, used by kube-apiserver |
2380 |
Peer communication between etcd members |
Note
In HA etcd, both client and peer communication must be allowed through firewall rules.
Installing etcd Manually
Typical manual setup steps:
wget -q --https-only \
"https://github.com/etcd-io/etcd/releases/download/v3.3.9/etcd-v3.3.9-linux-amd64.tar.gz"
tar -xvf etcd-v3.3.9-linux-amd64.tar.gz
mv etcd-v3.3.9-linux-amd64/etcd* /usr/local/bin/
mkdir -p /etc/etcd /var/lib/etcd
cp ca.pem kubernetes-key.pem kubernetes.pem /etc/etcd/
Warning
Use a version compatible with your Kubernetes release. Do not randomly upgrade etcd in production.
etcd Service Configuration
Example systemd-style etcd configuration:
ExecStart=/usr/local/bin/etcd \
--name=${ETCD_NAME} \
--cert-file=/etc/etcd/kubernetes.pem \
--key-file=/etc/etcd/kubernetes-key.pem \
--peer-cert-file=/etc/etcd/kubernetes.pem \
--peer-key-file=/etc/etcd/kubernetes-key.pem \
--trusted-ca-file=/etc/etcd/ca.pem \
--peer-trusted-ca-file=/etc/etcd/ca.pem \
--peer-client-cert-auth \
--client-cert-auth \
--initial-advertise-peer-urls=https://${INTERNAL_IP}:2380 \
--listen-peer-urls=https://${INTERNAL_IP}:2380 \
--listen-client-urls=https://${INTERNAL_IP}:2379,https://127.0.0.1:2379 \
--advertise-client-urls=https://${INTERNAL_IP}:2379 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=master-1=https://${MASTER1_IP}:2380,master-2=https://${MASTER2_IP}:2380,master-3=https://${MASTER3_IP}:2380 \
--initial-cluster-state=new \
--data-dir=/var/lib/etcd
Note
The --initial-cluster flag tells each etcd member where its peers are.
Important etcd Flags
| Flag | Purpose |
|---|---|
--name |
Unique etcd member name |
--data-dir |
etcd data directory |
--listen-client-urls |
Client listener URL |
--advertise-client-urls |
Client URL advertised to clients |
--listen-peer-urls |
Peer listener URL |
--initial-advertise-peer-urls |
Peer URL advertised to members |
--initial-cluster |
List of all initial etcd members |
--initial-cluster-state |
new for new cluster, existing for joining |
--cert-file |
TLS certificate for client traffic |
--peer-cert-file |
TLS certificate for peer traffic |
Using etcdctl
etcdctl is the CLI tool used to interact with etcd.
Set API version:
Put a key:
Get a key:
List keys:
Tip
In Kubernetes troubleshooting, etcdctl is commonly used for snapshots, member checks, and health checks.
Check etcd Cluster Health
Check endpoint health:
Check endpoint status:
List members:
Note
In secured clusters, include --cacert, --cert, and --key options.
Example with certificates:
ETCDCTL_API=3 etcdctl endpoint health --cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
etcd Backup
Back up etcd regularly.
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Verify snapshot:
Danger
etcd backups must be tested. A backup without a tested restore process is not a reliable recovery plan.
Kubernetes API Server and etcd
The kube-apiserver talks to etcd using the --etcd-servers option.
Example:
kube-apiserver \
--etcd-servers=https://10.240.0.10:2379,https://10.240.0.11:2379,https://10.240.0.12:2379 \
--etcd-cafile=/var/lib/kubernetes/ca.pem \
--etcd-certfile=/var/lib/kubernetes/apiserver-etcd-client.crt \
--etcd-keyfile=/var/lib/kubernetes/apiserver-etcd-client.key
Note
The kube-apiserver should be configured with multiple etcd endpoints for HA.
Production Best Practices
Recommended
- Use 3 or 5 etcd members
- Use an odd number of etcd members
- Enable TLS for client and peer communication
- Keep etcd behind private networking
- Do not expose etcd publicly
- Use fast disks, preferably SSD
- Monitor etcd latency, leader changes, and disk usage
- Take regular snapshots
- Test snapshot restore
- Keep etcd members close in network latency
- Avoid running heavy workloads on etcd nodes
- Use external etcd for critical production clusters when possible
Do's
- Use odd-numbered etcd clusters
- Use at least 3 members for HA
- Protect etcd with TLS
- Restrict etcd access to API servers only
- Monitor quorum and leader status
- Back up etcd before upgrades
- Test restore in a non-production environment
- Use reliable disks and stable networking
Don'ts
- Don't use 2 etcd members for production HA
- Don't expose port
2379publicly - Don't ignore quorum requirements
- Don't run etcd on slow or unstable disks
- Don't manually edit etcd data unless absolutely necessary
- Don't upgrade etcd without checking Kubernetes compatibility
- Don't skip backup validation
- Don't place all etcd members in the same failure domain
Failure
etcd HA depends on quorum. Adding more nodes does not help if they are poorly distributed or frequently partitioned.
Common Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| One etcd member fails in 3-node cluster | Cluster continues | Replace failed member quickly |
| Two etcd members fail in 3-node cluster | Quorum lost | Restore member or recover from backup |
| Network partition | Minority side stops processing writes | Use odd number and stable networking |
| Disk latency increases | API server may become slow | Use SSD and monitor disk I/O |
| etcd certificate expires | API server cannot connect | Monitor and rotate certificates |
| Backup missing | Recovery becomes difficult | Schedule and test snapshots |
Troubleshooting Commands
Check etcd pods in kubeadm clusters:
Check etcd static pod manifest:
Check etcd logs:
Check members:
Check endpoint status:
Check endpoint health:
Check kube-apiserver etcd config:
Exam and Practical Notes
CKA Focus
For certification-style tasks, focus on:
- identifying etcd endpoints
- checking etcd health
- taking etcd snapshots
- restoring from etcd snapshots
- understanding quorum
- knowing why odd number of members is recommended
Quick Reference
| Task | Command |
|---|---|
| Put key | etcdctl put name John |
| Get key | etcdctl get name |
| List keys | etcdctl get / --prefix --keys-only |
| Member list | etcdctl member list |
| Endpoint health | etcdctl endpoint health --cluster |
| Endpoint status | etcdctl endpoint status --cluster -w table |
| Snapshot save | etcdctl snapshot save snapshot.db |
| Snapshot status | etcdctl snapshot status snapshot.db |
Summary
Quote
- etcd stores all Kubernetes cluster state
- etcd HA depends on Raft consensus and quorum
- Writes are coordinated by a leader
- A majority of members must agree before writes are committed
- Avoid 2-member etcd clusters
- Use odd-numbered clusters, usually 3 or 5 members
- Use stacked etcd for simpler setup
- Use external etcd for stronger production isolation
- Always secure, monitor, back up, and test restore for etcd