Troubleshooting etcd snapshot restore blocked by webhook configurations
Article Number: 000021977
Environment
- Rancher v2.x
- A Rancher-provisioned or standalone RKE2 or K3s cluster
- Policy Engines/Webhooks: Kyverno, Kubewarden, OPA Gatekeeper, NeuVector, etc.
Situation
When attempting to restore an etcd snapshot on an RKE2 or K3s cluster, the restoration process fails to complete or enters an infinite loop.
Symptoms may include:
- The Rancher UI showing the cluster in a perpetual "Restoring" or "Updating" state.
kubeletorkube-apiserverlogs showing timeouts or connection refused errors when attempting to contact webhooks.
Cause
This issue occurs due to a circular dependency (deadlock) involving ValidatingWebhookConfiguration or MutatingWebhookConfiguration resources:
- During a restore, after the etcd state itself is restored, the
kubeletandkube-apiserverare started and begin reconciling Pods and the other cluster resources stored in the snapshot. - If the snapshot includes webhook configurations, the
kubelet/kube-apiserverattempts to send validation/mutation requests to the associated webhook Pods (e.g., Kyverno or Gatekeeper). - However, those pods have not yet been started or reached a "Ready" state because the cluster restoration is still in progress. This is particularly the case, in the event of a Disaster Recovery (DR) restore, in which the cluster is started from scratch.
- The
kubeletorkube-apiserverserver, unable to reach the webhook, denies the creation of required system resources (based on the webhook'sfailurePolicy), effectively blocking the cluster from ever reaching a healthy state.
Resolution
To resolve this, you must temporarily remove the webhook configurations during the restore process so that the kubelet and kube-apiserver can successfully initialize the cluster components, without being blocked.
1. Prepare Environment
Access the control plane (server) node via SSH where the restore is being performed and configure the environment:
RKE2:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
alias kubectl=/var/lib/rancher/rke2/bin/kubectl
K3s:
No action required to configure kubectl.
2. Backup Existing Webhook Configurations
Before the restore reaches the blocking stage, or immediately upon starting, capture the current configurations to a local directory:
kubectl get validatingwebhookconfiguration -o yaml > all-validating.yaml
kubectl get mutatingwebhookconfiguration -o yaml > all-mutating.yaml
3. Clear Webhooks During Restore
While it is possible to delete all webhooks to ensure the restore completes, a Targeted Deletion is safer as it preserves system-critical webhooks while removing the blocking third-party policy engines.
Identifying the Blocking Webhooks
Check the kubelet and kube-apiserver logs on the server nodes for "failed calling webhook" or "connection refused" errors to identify which specific webhook configurations are causing the deadlock.
DistributionComponentLog Location / CommandRKE2kube-apiserverkubectl logs -n kube-system -l component=kube-apiserver -fRKE2kubelet/var/lib/rancher/rke2/agent/logs/kubelet.logK3scombinedjournalctl -u k3s -f
Option A: Targeted Deletion (Recommended)
Modify the script below by replacing the placeholder names in the TARGET_WEBHOOKS array with the specific webhooks identified from your logs:
#!/bin/bash
# Add specific webhook names here (e.g., "kyverno-resource-validating-webhook-configuration")
TARGET_WEBHOOKS=("webhook-name-1" "webhook-name-2")
# Identify kubectl and KUBECONFIG path
if [ -f /etc/rancher/rke2/rke2.yaml ]; then
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
KUBECTL=/var/lib/rancher/rke2/bin/kubectl
elif [ -f /etc/rancher/k3s/k3s.yaml ]; then
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
KUBECTL="/usr/local/bin/k3s kubectl"
else
KUBECTL=$(which kubectl)
fi
echo "Using kubectl at: $KUBECTL"
echo "Using KUBECONFIG at: $KUBECONFIG"
echo "Monitoring and removing specific blocking webhooks..."
while true; do
for name in "${TARGET_WEBHOOKS[@]}"; do
# Check Validating
if kubectl get validatingwebhookconfiguration "$name" &>/dev/null; then
echo "Deleting validating webhook: $name"
kubectl delete validatingwebhookconfiguration "$name" --timeout=5s
fi
# Check Mutating
if kubectl get mutatingwebhookconfiguration "$name" &>/dev/null; then
echo "Deleting mutating webhook: $name"
kubectl delete mutatingwebhookconfiguration "$name" --timeout=5s
fi
done
sleep 2
done
Option B: Bulk Deletion
Use this script if the cluster is critically blocked and individual identification is not feasible:
#!/bin/bash
# Identify kubectl and KUBECONFIG path
if [ -f /etc/rancher/rke2/rke2.yaml ]; then
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
KUBECTL=/var/lib/rancher/rke2/bin/kubectl
elif [ -f /etc/rancher/k3s/k3s.yaml ]; then
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
KUBECTL="/usr/local/bin/k3s kubectl"
else
KUBECTL=$(which kubectl)
fi
echo "Using kubectl at: $KUBECTL"
echo "Using KUBECONFIG at: $KUBECONFIG"
echo "Monitoring and removing webhooks to allow restoration..."
while true; do
# Delete Mutating Webhooks
for hook in $(kubectl get mutatingwebhookconfiguration -o name 2>/dev/null); do
echo "Deleting $hook"
kubectl delete "$hook" --timeout=5s
done
# Delete Validating Webhooks
for hook in $(kubectl get validatingwebhookconfiguration -o name 2>/dev/null); do
echo "Deleting $hook"
kubectl delete "$hook" --timeout=5s
done
sleep 2
done
Note: Keep the script running until the server logs indicate that the restore has completed and nodes are "Ready". Press Ctrl+C to stop the script.
4. Restore Webhook Configurations
Once the cluster is healthy and the policy engine pods are running, re-create the configurations from your backups:
kubectl create -f all-validating.yaml
kubectl create -f all-mutating.yaml