Skip to content

cilium-operator fails to schedule on controlplane/etcd role nodes in RKE2 clusters running rke2-cilium v1.18.0 or v1.18.1

Article Number: 000022073

Environment

Situation

In a Rancher-provisioned RKE2 cluster, running rke2-cilium v1.18.0 or v1.18.1, with separate controlplane/etcd and worker role nodes, the cilium-operator Pods remain in a Pending state during cluster provisioning. The cilium-operator Pods have FailedScheduling events with a message "0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/etcd: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling." As a result the cluster provisioning fails to progress.

$ kubectl -n kube-system get pods -l app.kubernetes.io/name=cilium-operator
NAME                                                      READY   STATUS              
cilium-operator-59fcfc5dbb-2b5jm                          0/1     Pending             
cilium-operator-59fcfc5dbb-4cqm9                          0/1     Pending    

$ kubectl -n kube-system describe pod cilium-operator-59fcfc5dbb-2b5jm
[...]
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  11m                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/etcd: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  52s (x2 over 5m52s)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/etcd: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

Cause

The behaviour is caused by a change in the default calico-operator tolerations in the upstream cilium Helm chart and was reported upstream in cilium/41921. The issue is tracked in rke2/8974 and resolved by including the required node-role.kubernetes.io/etcd toleration in the default values for the rke2-cilium chart v1.18.2+.

Resolution

To resolve the issue, upgrade to a later RKE2 release, running rke2-cilium v1.18.2+.

To workaround the issue in affected versions, the toleration below should be added to the cilium-operator in the rke2-cilium chart.

 - key: node-role.kubernetes.io/etcd
   operator: Exists

To add this toleration:

  1. Navigate to Cluster Management within the Rancher UI
  2. Click Edit Config for the affected cluster.
  3. Under Cluster Configuration click Add-on: Cilium
  4. Scroll down to the operator.tolerations block and add the node-role.kubernetes.io/etcd toleration:

[...]
operator:
  [...]
  tolerations:
    - key: node-role.kubernetes.io/etcd
      operator: Exists
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
    - key: node-role.kubernetes.io/master
      operator: Exists
    - key: node.kubernetes.io/not-ready
      operator: Exists
[...]
5. Click Save to update rke2-cilium with the new toleration