Unexpected node draining in node driver clusters when upgrading Rancher from v2.12.2

Article Number: 000022311

Environment

Rancher v2.12.2
RKE2 or K3s node driver cluster(s)

Situation

Upgrading Rancher from v2.12.2, to a higher release, triggers a drain and subsequent uncordoning of all nodes in node driver clusters, where Drain Nodes is set to Yes in the cluster's Upgrade Strategy. All workloads running in the cluster are re-scheduled as a result. This is unexpected, as no new configuration change is made to the cluster during the Rancher upgrade.

Cause

Upgrading from version v2.12.2 to any higher version results in a change to the plan for nodes in node driver clusters. The rancher-system-agent, on node driver cluster nodes, monitors the plan's hash, to check for upgrades, and so this change results in a reconciliation, with the node drained and then uncordoned, in accordance with the Drain Nodes setting in the Upgrade Strategy. It is not expected that a Rancher upgrade results in a downstream cluster reconciliation, and this behaviour is not observed in upgrades of other Rancher versions.

Resolution

To avoid node driver clusters from draining - which will cause temporary workload unavailability due to rescheduling - when upgrading Rancher from v2.12.2, set Drain Nodes to No in the Upgrade Strategy of the clusters, before the Rancher upgrade:

Navigate to Cluster Management in the Rancher UI and click Edit Config for the relevant RKE2/K3s node driver cluster.
Under Cluster Configuration select Upgrade Strategy.
Set the Drain Nodes option to No for both Control Plane and Worker Nodes.
Click Save to apply the changes.

After the Rancher upgrade, you can revert this change, to set Drain Nodes back to Yes.