How to gracefully shutdown an RKE cluster
Article Number: 000020031
Environment
A Rancher-provisioned or standalone RKE cluster
Situation
If you need to shut down the infrastructure running an RKE Kubernetes cluster, such as for a datacenter maintenance, this guide provides steps in the proper order to ensure a safe cluster shutdown.
Resolution
1. Take a cluster snapshot
As with any cluster maintenance operation, it is strongly advised to take a cluster snapshot before performing this process.
- For standalone clusters, the snapshot process can be found in the RKE documentation.
- For Rancher-provisioned clusters, the process can be found in the Rancher documentation.
2. Drain nodes
Note on Longhorn: If the cluster in question is running SUSE Storage (Longhorn), please also reference the Longhorn "Node Maintenance and Kubernetes Upgrade Guide" documentation. Ensure that all workloads with volumes have been evicted or scaled-down, and all Longhorn volumes are in a detached state, before proceeding to stop Kubernetes.
Iterate through all nodes in the cluster (starting with worker/agent nodes) to stop pods gracefully:
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
3. Stop worker nodes
Note on mixed roles: If you have nodes that share worker, control plane, or etcd roles, postpone the docker stop and shutdown operations until worker or control plane containers have been stopped.
On each worker role node:
- Open an SSH session to the worker node
- Stop kubelet and kube-proxy:
sudo docker stop kubelet kube-proxy - Stop Docker:
sudo systemctl stop docker
4. Stop control plane nodes
For each control plane node:
- Open an SSH session to the control plane node
- Stop kubelet and kube-proxy:
sudo docker stop kubelet kube-proxy - Stop kube-scheduler and kube-controller-manager:
sudo docker stop kube-scheduler kube-controller-manager - Stop kube-apiserver:
sudo docker stop kube-apiserver - Stop Docker:
sudo systemctl stop docker
5. Stop etcd nodes
For each etcd node:
- Open an SSH session to the etcd node
- Stop kubelet and kube-proxy:
sudo docker stop kubelet kube-proxy - Stop etcd:
sudo docker stop etcd - Stop Docker:
sudo systemctl stop docker
6. Shut down nodes
Once Docker has stopped on the nodes, you may safely power down the hosts:
sudo shutdown -h now
Note on Network Attached Storage: If you have network attached storage devices consumed as volumes, wait until the cluster itself is completely shut down before powering down the storage devices.
7. Start Kubernetes again after the shutdown
Kubernetes is resilient and typically requires little intervention during recovery, provided a specific power-on order is followed:
- Ensure any network attached storage devices are first powered on, if applicable.
-
For each etcd node:
-
Power on the system/start the instance
- Open an SSH session to the node
- Ensure docker has started:
sudo systemctl status docker - Ensure the etcd and kubelet containers' status is Up in Docker:
sudo docker ps -
For each control plane node:
-
Power on the system/start the instance
- Open an SSH session to the node
- Ensure docker has started:
sudo systemctl status docker - Ensure the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet containers' status is Up in Docker:
sudo docker ps -
For each worker node:
-
Power on the system/start the instance
- Open an SSH session to the node
- Ensure docker has started:
sudo systemctl status docker - Ensure the kubelet container's status is Up in Docker:
sudo docker ps - Uncordon nodes to allow workloads to schedule:
kubectl uncordon <NODE_NAME> - Log into the Rancher UI (or use
kubectl) to ensure workloads have started as expected. This may take several minutes depending on the number of workloads and your server capacity.