RKE2 cluster provisioning failing, with failing kube-apiserver healthchecks due to inability to resolve localhost
Article Number: 000021462
Environment
- Rancher v2.6+
- A Rancher-provisioned RKE2 cluster
Situation
There are a high number of restarts for cluster component Pods in the affected downstream RKE2 cluster:
NAMESPACE NAME READY STATUS RESTARTS
cattle-fleet-system fleet-agent-cc8c97f97-bvx78 1/1 Running 185
cattle-system cattle-cluster-agent-b1460cbd-8ct5c 1/1 Running 115
cattle-system cattle-cluster-agent-b1460cbd-l2l8l 1/1 Running 168
kube-system kube-apiserver-cluster-suse-cp-f777105c-2qgvh 0/1 Running 314
kube-system kube-controller-manager-cluster-suse-cp-5c-2qgvh 1/1 Running 491
kube-system cloud-controller-manager-cluster-suse-cp-5c-2qgvh 1/1 Running 501
The kube-apiserver Pod flaps between a ready and not ready status:
NAMESPACE NAME READY STATUS RESTARTS
kube-system kube-apiserver-cluster-suse-cp-f777105c-2qgvh 0/1 Running 314
The kubelet logs register failing probes against the kube-apiserver.
Cause
The /etc/hosts file on the node was empty and did not contain any localhost references, causing DNS resolution failures for the kube-apiserver liveness probes to localhost.
Resolution
-
Enable kubelet debug logging
-
Navigate to Cluster Management
- Click Edit Config for the affected downstream RKE2 cluster
- Click the Advanced tab in the Cluster Configuration form
- Under Additional Kubelet Args click Add Global Argument
- In the new argument field enter v=9
- Click Save
-
Replicate the liveness probe and check the kubelet logs
-
Open an SSH session to a master node in the affected RKE2 downstream cluster
- Check the kubelet log (
tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log | grep kube-apiserver) for failing kube-apiserver liveness probes -
Execute the following command to simulate the liveness probe for the kube-apiserver Pod, which should fail, if encountering the issue:
4. Perform the simulated liveness probe for the kube-apiserver again, replacing localhost with 127.0.0.1, which should succeed:/var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock exec $(/var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock ps | grep kube-apiserver | awk '{print $1}') kubectl get --server=https://localhost:6443/ --client-certificate=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt --client-key=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.key --certificate-authority=/var/lib/rancher/rke2/server/tls/server-ca.crt --raw=/livez3. Fix the host or host template, to ensure a valid /etc/hosts file is present, with an entry mapping localhost to 127.0.0.1, as expected./var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock exec $(/var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock ps | grep kube-apiserver | awk '{print $1}') kubectl get --server=https://127.0.0.1:6443/ --client-certificate=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt --client-key=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.key --certificate-authority=/var/lib/rancher/rke2/server/tls/server-ca.crt --raw=/livez