cilium-operator fails to schedule on controlplane/etcd role nodes in RKE2 clusters running rke2-cilium v1.18.0 or v1.18.1
Article Number: 000022073
Environment
- A Rancher-provisioned RKE2 cluster running rke2-cilium v1.18.0 or v1.18.1 (affected RKE2 versions are v1.30.14+rke2r4, v1.31.12+rke2r1 - v1.31.13+rke2r1, v1.32.8+rke2r1 - v1.32.9+rke2r1, v1.33.4+rke2r1 - v1.33.5+rke2r1, and v1.34.1+rke2r1).
- Separate controlplane/etcd and worker role nodes.
Situation
In a Rancher-provisioned RKE2 cluster, running rke2-cilium v1.18.0 or v1.18.1, with separate controlplane/etcd and worker role nodes, the cilium-operator Pods remain in a Pending state during cluster provisioning. The cilium-operator Pods have FailedScheduling events with a message "0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/etcd: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling." As a result the cluster provisioning fails to progress.
$ kubectl -n kube-system get pods -l app.kubernetes.io/name=cilium-operator
NAME READY STATUS
cilium-operator-59fcfc5dbb-2b5jm 0/1 Pending
cilium-operator-59fcfc5dbb-4cqm9 0/1 Pending
$ kubectl -n kube-system describe pod cilium-operator-59fcfc5dbb-2b5jm
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 11m default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/etcd: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Warning FailedScheduling 52s (x2 over 5m52s) default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/etcd: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Cause
The behaviour is caused by a change in the default calico-operator tolerations in the upstream cilium Helm chart and was reported upstream in cilium/41921. The issue is tracked in rke2/8974 and resolved by including the required node-role.kubernetes.io/etcd toleration in the default values for the rke2-cilium chart v1.18.2+.
Resolution
To resolve the issue, upgrade to a later RKE2 release, running rke2-cilium v1.18.2+.
To workaround the issue in affected versions, the toleration below should be added to the cilium-operator in the rke2-cilium chart.
- key: node-role.kubernetes.io/etcd
operator: Exists
To add this toleration:
- Navigate to Cluster Management within the Rancher UI
- Click Edit Config for the affected cluster.
- Under Cluster Configuration click Add-on: Cilium
- Scroll down to the operator.tolerations block and add the node-role.kubernetes.io/etcd toleration:
[...]
operator:
[...]
tolerations:
- key: node-role.kubernetes.io/etcd
operator: Exists
- key: node-role.kubernetes.io/control-plane
operator: Exists
- key: node-role.kubernetes.io/master
operator: Exists
- key: node.kubernetes.io/not-ready
operator: Exists
[...]