How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI
This document (000020033) is provided subject to the disclaimer at the end of this document.
Environment
A Rancher Kubernetes Engine (RKE) CLI provisioned cluster
- A Highly Available control plane / etcd configuration, with an odd number of mixed role control plane / etcd nodes, commonly 3 or 5
- The cluster is quorate, i.e. with 3 control plane / etcd nodes only a single node is unresponsive, or with 5 control plane / etcd nodes upto two nodes are unresponsive
- The cluster configuration file (e.g.
cluster.yml
) and .rkestate file (e.g.cluster.rkestate
) - The RKE binary and SSH access to the nodes
Situation
This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.
Resolution
This operation is relatively simple, and uses the example cluster.yaml
below for demonstration purposes.
N.B. Be sure to use your
cluster.yaml
and matchingcluster.rkestate
for the relevant cluster.
In this demonstration example, the node that is failing has the address 1.2.3.3
:
nodes:
- address: 1.2.3.1
user: ubuntu
role:
- controlplane
- etcd
- address: 1.2.3.2
user: ubuntu
role:
- controlplane
- etcd
- address: 1.2.3.3
user: ubuntu
role:
- controlplane
- etcd
[...] # rest of cluster.yaml except control plane / etcd nodes restracted
Step 1. Validate the cluster is quorate and confirm the unresponsive node
On the control plane / etcd nodes perform the following command, per the Rancher Troubleshooting Documentation to determine etcd endpoint health:
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health
On the unresponsive node the command may fail to execute, on the healthy nodes you should see output of the following format indicating the health status of each node:
{"level":"warn","ts":"2020-12-31T12:11:41.840Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c65a15b4-9646-4c71-914d-f3c892c04c2f/1.2.3.3:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 1.2.3.3:2379: connect: connection refused\""}
https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Step 2. Remove the unresponsive node
Having confirmed which node is unresponsive in the cluster, remove this from the nodes block in the cluster configuration file ( cluster.yaml
), per the example of 1.2.3.3
removed below:
nodes:
- address: 1.2.3.1
user: ubuntu
role:
- controlplane
- etcd
- address: 1.2.3.2
user: ubuntu
role:
- controlplane
- etcd
[...] # rest of cluster.yaml except control plane / etcd nodes restracted
After updating the cluster.yaml
file, execute an rke up
run to remove the node:
rke up --config cluster.yaml
The above action will remove the problematic and unresponsive control plane / etcd node.
Step 3. Clean and add the removed node back to the cluster
Once the rke up
invocation has run through without any errors, and you can see the node removed from the Rancher UI or kubectl get nodes
output, it is safe to move onto adding the node back in.
First clean the removed node ( 1.2.3.3
) in our example, using the Extended Rancher 2 Cleanup script.
After cleaning the node, add this back into the cluster configuration ( cluster.yaml
) file:
nodes:
- address: 1.2.3.1
user: ubuntu
role:
- controlplane
- etcd
- address: 1.2.3.2
user: ubuntu
role:
- controlplane
- etcd
- address: 1.2.3.3
user: ubuntu
role:
- controlplane
- etcd
[...] # rest of cluster.yaml except control plane / etcd nodes restracted
And run the rke up command again:
rke up --config cluster.yaml
Step 4. Validate final cluster state
Once the rke up
command has completed, without errors, you can now verify the node is visible and ready via kubectl get nodes
and the Rancher UI.
The etcd endpoint health commands on the control plane / etcd nodes should also show each endpoint as healthy, per the following example output:
https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is healthy: successfully committed proposal: took = 22.065616ms
Further reading
Status
Top Issue
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.