Skip to content

Tigera-operator fails due to rancher-webhook denying access while upgrading the RKE2 cluster

Article Number: 000022161

Environment

SUSE Rancher: v2.10.5

RKE2 v1.31.7

PSA is enabled on the downstream cluster

Situation

During an RKE2 cluster upgrade from version 1.30.7 to 1.31.11, one of the four downstream clusters becomes stuck with the message 'Waiting for cluster agent to connect'. The cluster-agent fails to resolve the Rancher URL because the calico-kube-controller deployment fails. pod fails to start due to a hostname resolution issue, which is caused by rke2-coredns failing to start with a Calico error. This leads to the deployment failing with the error message 'cannot find a qualified ippool'.

Cause

  • This issue is caused by a known bug mentioned here.
  • This seems to be because of a  deadlock where the rancher-webhook pod fails to start due to its inability to create a Calico sandbox, while Calico cannot be installed without the rancher-webhook. A potential race condition or timeout during the upgrade process, possibly related to the PSA template, may also contribute to the issue. The validating webhook blocks the creation of namespaces, leading to the 'cannot find a qualified ippool' error.

Resolution

  • The downstream cluster is RKE2 v1.31.7 with PSA(Pod Security Admission) enabled.
  • The calico-kube-controller pod fails with the following error:
cannot find a qualified ippool
  • While investigating further, it was found that tigera-operator was failing because rancher-webhook denied access:
Error creating resource : admission webhook rancher.cattel.io.namespace.create-non-kubesystem denied the request : Unauthorized
  • The workaround below can be applied to fix the issue temporarily:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: tigera-operator-psa
rules:
- apiGroups:
  - management.cattle.io
  resources:
  - projects
  verbs:
  - updatepsa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: tigera-operator-psa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: tigera-operator-psa
subjects:
- kind: ServiceAccount
  name: tigera-operator
  namespace: tigera-operator
  • Verify that tigera-operator starts immediately and calico-kube-controller deployment rolls out successfully. The cluster-agent also connects, and the cluster becomes active.