Skip to content

How to gracefully shutdown an RKE cluster

Article Number: 000020031

Environment

A Rancher-provisioned or standalone RKE cluster

Situation

If you need to shut down the infrastructure running an RKE Kubernetes cluster, such as for a datacenter maintenance, this guide provides steps in the proper order to ensure a safe cluster shutdown.

Resolution

1. Take a cluster snapshot

As with any cluster maintenance operation, it is strongly advised to take a cluster snapshot before performing this process.

  • For standalone clusters, the snapshot process can be found in the RKE documentation.
  • For Rancher-provisioned clusters, the process can be found in the Rancher documentation.

2. Drain nodes

Note on Longhorn: If the cluster in question is running SUSE Storage (Longhorn), please also reference the Longhorn "Node Maintenance and Kubernetes Upgrade Guide" documentation. Ensure that all workloads with volumes have been evicted or scaled-down, and all Longhorn volumes are in a detached state, before proceeding to stop Kubernetes.

Iterate through all nodes in the cluster (starting with worker/agent nodes) to stop pods gracefully:

kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data

3. Stop worker nodes

Note on mixed roles: If you have nodes that share worker, control plane, or etcd roles, postpone the docker stop and shutdown operations until worker or control plane containers have been stopped.

On each worker role node:

  1. Open an SSH session to the worker node
  2. Stop kubelet and kube-proxy: sudo docker stop kubelet kube-proxy
  3. Stop Docker: sudo systemctl stop docker

4. Stop control plane nodes

For each control plane node:

  1. Open an SSH session to the control plane node
  2. Stop kubelet and kube-proxy: sudo docker stop kubelet kube-proxy
  3. Stop kube-scheduler and kube-controller-manager: sudo docker stop kube-scheduler kube-controller-manager
  4. Stop kube-apiserver: sudo docker stop kube-apiserver
  5. Stop Docker: sudo systemctl stop docker

5. Stop etcd nodes

For each etcd node:

  1. Open an SSH session to the etcd node
  2. Stop kubelet and kube-proxy: sudo docker stop kubelet kube-proxy
  3. Stop etcd: sudo docker stop etcd
  4. Stop Docker: sudo systemctl stop docker

6. Shut down nodes

Once Docker has stopped on the nodes, you may safely power down the hosts:

sudo shutdown -h now

Note on Network Attached Storage: If you have network attached storage devices consumed as volumes, wait until the cluster itself is completely shut down before powering down the storage devices.

7. Start Kubernetes again after the shutdown

Kubernetes is resilient and typically requires little intervention during recovery, provided a specific power-on order is followed:

  1. Ensure any network attached storage devices are first powered on, if applicable.
  2. For each etcd node:

  3. Power on the system/start the instance

  4. Open an SSH session to the node
  5. Ensure docker has started: sudo systemctl status docker
  6. Ensure the etcd and kubelet containers' status is Up in Docker: sudo docker ps
  7. For each control plane node:

  8. Power on the system/start the instance

  9. Open an SSH session to the node
  10. Ensure docker has started: sudo systemctl status docker
  11. Ensure the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet containers' status is Up in Docker: sudo docker ps
  12. For each worker node:

  13. Power on the system/start the instance

  14. Open an SSH session to the node
  15. Ensure docker has started: sudo systemctl status docker
  16. Ensure the kubelet container's status is Up in Docker: sudo docker ps
  17. Uncordon nodes to allow workloads to schedule: kubectl uncordon <NODE_NAME>
  18. Log into the Rancher UI (or use kubectl) to ensure workloads have started as expected. This may take several minutes depending on the number of workloads and your server capacity.