Unable to Replace an etcd Node in a RKE2 cluster – “Unhealthy Cluster” Error
Article Number: 000022264
Environment
Rancher 2.x
RKE2
Situation
While attempting to replace an etcd node, the operation fails and the new node does not join the etcd cluster successfully. On the affected node, only the kube-proxy container is observed to be running, while other control plane components do not start.
The etcd logs show the following warning during the node replacement attempt:
{"level":"warn","caller":"etcdserver/server.go:1695","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"a05d36c92035f2b5","requested-member-add":"{ID:a6a5355c5b413436 RaftAttributes:{PeerURLs:[https://<peer>:2380] IsLearner:true} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
This prevents the new etcd member from being added to the cluster.
Cause
This issue occurs when one of the existing etcd members is not actively peered with all other members, even though the cluster appears healthy at first glance.
When checking the etcd metrics, the etcd_network_active_peers metric shows that one peer connection is inactive (value 0), while another peer connection is active (value 1):
# HELP etcd_network_active_peers The current number of active peer connections.
# TYPE etcd_network_active_peers gauge
etcd_network_active_peers{Local="a05d36c92035f2b5",Remote="169159dce6da316"} 0
etcd_network_active_peers{Local="a05d36c92035f2b5",Remote="84123c0970f97229"} 1
Although the etcd endpoint status shows all members present and in sync ( output is pasted below), the lack of an active peer connection causes etcd to treat the cluster as unhealthy, which results in rejecting any new member add requests in order to protect the active quorum.
| https://10.69.x.0:2379 | 169159dce6da316 | 3.5.21 | 304 MB | false | false | 30 | 669745205 | 669745205 | |
| https://10.69.y.3:2379 | 84123c0970f97229 | 3.5.21 | 302 MB | false | false | 30 | 669745205 | 669745205 | |
| https://10.69.y.1:2379 | a05d36c92035f2b5 | 3.5.21 | 300 MB | true | false | 30 | 669745205 | 669745205 | |
Resolution
Restarting the affected etcd container restores the missing peer connection.
After the restart:
The etcd_network_active_peers metric reports active connections to all peers.
The etcd cluster is considered healthy.
The new etcd node is able to join successfully.
The node replacement operation completes as expected.