Restore fails due unmounting Rancher directories
Article Number: 000021611
Environment
Rancher v2.8.5 and less
RKE2 v1.27 and less
Situation
When attempting to restore an RKE2 cluster, it fails due to Rancher directories being unmounted by the rke2-killall.sh script.
After initiating the restore process, the job expects the "/var/lib/rancher" to be mounted but the rke2-killall explicitly unmount it due to the command being hardcoded within the script itself per here in Kubernetes v1.27.12
It will then try to run "[Applyinator] Command touch [/var/lib/rancher/rke2/server/db/etcd/tombstone", which fails.
This leaves the cluster in a broken state and even performing a cluster reset will not help in this case.
Some error messages (symptoms) can be seen in the logs as the following
level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3 2>/dev/null] finished with err: <nil> and exit code: 127
Cause
The rke2-killall.sh script unmounts the Rancher directories.
Resolution
It's strongly recommended to upgrade to at least Kubernetes v1.27.16 as the issue has been addressed starting from that version.
Or, you can apply the following workaround in sequence if you're still on v1.27.12 version:
When the restore first fails
1. Go onto each Control Plane node, and comment out the single line in the script rke2-killall
The script is supposed to be under /usr/local/bin
do_unmount_and_remove '/var/lib/rancher/rke2'
2. execute "mount -a" on each Control Plane node (as this was removed by the script)
3. execute "systemctl restart rancher-system-agent" on each node.
This causes it to fetch the machine-plan, and use the already present script, to successfully run or proceed with the restore.