How to rollback the Kubernetes version of an RKE2 upstream/standalone cluster following an upgrade
This document (000021618) is provided subject to the disclaimer at the end of this document.
Environment
A standalone (i.e. not Rancher-provisioned) RKE2 cluster, such as an upstream Rancher local RKE2 cluster, that has been upgraded
Situation
Following a Kubernetes version upgrade to a standalone RKE2 cluster, it may be necessary to perform a version rollback, due to an issue encountered with the upgrade or new Kubernetes version.
This requires:
- An RKE2 etcd snapshot of the cluster prior to the version upgrade
- The previous RKE2 version
Resolution
The rollback process is described below, for the purpose of this example, a rollback to RKE2 v1.30.9+rke2r1 is imagined, after an upgrade to a later version.
- If the cluster is running, with the Kubernetes API available, and you wish to gracefully stop workloads within the cluster before executing the rollback, first drain all of the nodes: `kubectl drain --ignore-daemonsets --delete-emptydir-data
...` - On each node, run the `rke2-killall.sh` script to stop the rke2-server or rke2-agent service (depending if it is a server or agent node) and all running Pod processes.
- Manually roll back the RKE2 binary to the previous version.
- In clusters with internet access, this can be done via the RKE2 installation script, specifying the previous RKE2 Kubernetes version via the variable
INSTALL_RKE2_VERSION:
- On server nodes run `
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION="v1.30.9+rke2r1" sh -
` - On agent nodes run `
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION="v1.30.9+rke2r1" INSTALL_RKE2_TYPE=agent sh -
`
- On server nodes run `
- In air-gapped clusters, this can be done by downloading the artefacts for the previous RKE2 version on the nodes and invoking the install script locally, per the RKE2-airgapped installation documentation.
```With the previous RKE2 binary installed on all cluster nodes, the next step is to restore an etcd snapshot taken whilst the cluster was still running this previous RKE2 version (in this example v1.30.9+rke2r1). On the first server node (i.e. the node without a \
server:` defined in its RKE2 config file), initiate the cluster restore (per step 2 of the RKE2 documentation on Restoring a Snapshot to Existing Nodes): `rke2 server --cluster-reset --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
`- Once the restore process is completed, proceed through steps 3-5 of the RKE2 documentation on Restoring a Snapshot to Existing Nodes):
- Start the rke2-server service on the first server node as follows: `
systemctl start rke2-server
`. - Remove the rke2 db directory on the other server nodes as follows: `rm -rf /var/lib/rancher/rke2/server/db`
- Start the rke2-server service on other server nodes with the following command: `systemctl start rke2-server`
- Finally, start the rke2-agent service on any agent nodes: `systemctl start rke2-agent`
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.