The system-upgrade-controller is not updating tainted nodes
Article Number: 000021800
Environment
- Product: Standalone Rancher Kubernetes Engine (RKE2) / K3s (not Rancher-deployed clusters)
- Component: system-upgrade-controller
- Versions Affected: All versions of system-upgrade-controller
- Kubernetes Versions: 1.21.x and above
- Operating Systems: All supported Linux distributions
Situation
When attempting to perform an automated upgrade of a standalone RKE2 or K3s cluster using the system-upgrade-controller, nodes with taints may be skipped during the upgrade process. The upgrade jobs are created, but they remain in a pending state indefinitely for tainted nodes. The system-upgrade-controller logs may show that plans were applied, but some nodes are not receiving the upgrade.
Users observe that:
- Non-tainted nodes upgrade successfully
- Tainted nodes (such as Longhorn storage nodes with node.longhorn.io/create-default-disk:NoSchedule) remain on the old version
- The upgrade plan shows as applied, but doesn't complete for all nodes
Cause
This issue occurs because, by default, the system-upgrade-controller doesn't deploy upgrade jobs on tainted nodes. The upgrade job pods respect node taints unless explicitly configured to tolerate them.
The system-upgrade-controller creates upgrade jobs as standard Kubernetes pods, and these pods follow the normal scheduling rules, including respecting node taints. Since taints are specifically designed to prevent pods from being scheduled on nodes, the upgrade jobs can't be scheduled on tainted nodes.
Resolution
To resolve this issue, modify your Plan resource to include the appropriate tolerations for the tainted nodes. Here's how to do it:
Option 1: Add specific tolerations for Longhorn nodes
If you know the specific taints on your nodes (such as Longhorn storage nodes), you can add those specific tolerations:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: k3s-upgrade
namespace: system-upgrade
spec:
concurrency: 1
version: v1.32.3+rke2r1
nodeSelector:
matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
serviceAccountName: system-upgrade
cordon: true
upgrade:
image: rancher/k3s-upgrade
# Add tolerations for Longhorn tainted nodes
tolerations:
- key: "node.longhorn.io/create-default-disk"
operator: "Exists"
effect: "NoSchedule"
# Add any other specific tolerations your nodes might have
Option 2: Ignore all taints (recommended for mixed workload clusters)
For clusters with multiple types of taints, you can use a blanket toleration that ignores all taints:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
labels:
rke2-upgrade: agent
spec:
concurrency: 1
cordon: true
drain:
force: true
nodeSelector:
matchExpressions:
- key: beta.kubernetes.io/os
operator: In
values:
- linux
- key: node-role.kubernetes.io/control-plane
operator: NotIn
values:
- "true"
prepare:
args:
- prepare
- server-plan
image: rancher/rke2-upgrade
serviceAccountName: system-upgrade
# This single toleration will ignore ALL taints
tolerations:
- operator: Exists
upgrade:
image: rancher/rke2-upgrade
version: v1.32.3+rke2r1
Apply the updated Plan:
kubectl apply -f your-plan.yaml
Verify that the upgrade jobs are now being scheduled on the tainted nodes:
kubectl get pods -n system-upgrade -w
If needed, you can check node versions after the upgrade completes:
kubectl get nodes -o wide