How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI

Article Number: 000020033

Environment

A Rancher Kubernetes Engine (RKE) CLI provisioned cluster

A Highly Available control plane / etcd configuration, with an odd number of mixed role control plane / etcd nodes, commonly 3 or 5
The cluster is quorate, i.e. with 3 control plane / etcd nodes only a single node is unresponsive, or with 5 control plane / etcd nodes upto two nodes are unresponsive
The cluster configuration file (e.g. cluster.yml) and .rkestate file (e.g. cluster.rkestate)
The RKE binary and SSH access to the nodes

Situation

This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.

Resolution

This operation is relatively simple, and uses the example `cluster.yaml` below for demonstration purposes.

N.B. Be sure to use your cluster.yaml and matching cluster.rkestate for the relevant cluster.

In this demonstration example, the node that is failing has the address 1.2.3.3:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.3
      user: ubuntu
      role:
        - controlplane
        - etcd 
[...] # rest of cluster.yaml except control plane / etcd nodes restracted

Step 1. Validate the cluster is quorate and confirm the unresponsive node

On the control plane / etcd nodes perform the following command, per the Rancher Troubleshooting Documentation to determine etcd endpoint health:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health

On the unresponsive node the command may fail to execute, on the healthy nodes you should see output of the following format indicating the health status of each node:

{"level":"warn","ts":"2020-12-31T12:11:41.840Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c65a15b4-9646-4c71-914d-f3c892c04c2f/1.2.3.3:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 1.2.3.3:2379: connect: connection refused\""}
https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is unhealthy: failed to commit proposal: context deadline exceeded

Step 2. Remove the unresponsive node

Having confirmed which node is unresponsive in the cluster, remove this from the nodes block in the cluster configuration file (cluster.yaml), per the example of 1.2.3.3 removed below:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd

[...] # rest of cluster.yaml except control plane / etcd nodes restracted

After updating the cluster.yaml file, execute an rke up run to remove the node:

rke up --config cluster.yaml

The above action will remove the problematic and unresponsive control plane / etcd node.

Step 3. Clean and add the removed node back to the cluster

Once the rke up invocation has run through without any errors, and you can see the node removed from the Rancher UI or kubectl get nodes output, it is safe to move onto adding the node back in.

First clean the removed node (1.2.3.3) in our example, using the Extended Rancher 2 Cleanup script.

After cleaning the node, add this back into the cluster configuration (cluster.yaml) file:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.3
      user: ubuntu
      role:
        - controlplane
        - etcd 
[...] # rest of cluster.yaml except control plane / etcd nodes restracted

And run the rke up command again:

rke up --config cluster.yaml

Step 4. Validate final cluster state

Once the rke up command has completed, without errors, you can now verify the node is visible and ready via kubectl get nodes and the Rancher UI.

The etcd endpoint health commands on the control plane / etcd nodes should also show each endpoint as healthy, per the following example output:

https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is healthy: successfully committed proposal: took = 22.065616ms