Rancher-provisioned RKE2 clusters stuck "Waiting for kube-controller-manager/kube-scheduler probes" due to certificate expiration in Rancher < v2.7.5
Article Number: 000021371
Environment
- SUSE Rancher < v2.7.5
- A Rancher-provisioned RKE2 cluster
Situation
This article explains a problem where a Rancher-provisioned RKE2 cluster gets stuck in a "Waiting for probes: kube-controller-manager, kube-scheduler" state, when running Rancher < v2.7.5. This happens because the certificates used by the kube-controller-manager
and kube-scheduler
components expire.
Cause
RKE2's regular certificate rotation doesn't manage the certificates for the kube-controller-manager
and kube-scheduler
in Rancher-provisioned RKE2 clusters. Because of an issue in Rancher versions before v2.7.5, these certificates were not automatically renewed when other cluster certificates were rotated. If these certificates expire, it can stop communication between these cluster components and the kube-apiserver
, leading to the "Waiting for probes" state.
Resolution
This issue is fixed in Rancher version 2.7.5. Starting with this version, the kube-controller-manager
and kube-scheduler
certificates in Rancher-managed RKE2 clusters are now part of the certificate rotation. When you use Rancher's Rotate Certificates feature, these certificates will also be renewed, which stops the problem described earlier.
Workaround
For versions older than 2.7.5, as well as in exceptional cases, the suggested workaround would be as follows:
-
Stop the RKE2 server:
-
systemctl stop rke2-server
-
Remove the .crt and .key file in the respective tls directories:
-
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
-
Perform a certificate rotation:
-
rke2 certificate rotate
-
Restart RKE2 server:
-
systemctl start rke2-server