Troubleshooting etcd snapshot restore blocked by webhook configurations

Article Number: 000021977

Environment

Rancher v2.x
A Rancher-provisioned or standalone RKE2 or K3s cluster
Policy Engines/Webhooks: Kyverno, Kubewarden, OPA Gatekeeper, NeuVector, etc.

Situation

When attempting to restore an etcd snapshot on an RKE2 or K3s cluster, the restoration process fails to complete or enters an infinite loop.

Symptoms may include:

The Rancher UI showing the cluster in a perpetual "Restoring" or "Updating" state.
kubelet or kube-apiserver logs showing timeouts or connection refused errors when attempting to contact webhooks.

Cause

This issue occurs due to a circular dependency (deadlock) involving ValidatingWebhookConfiguration or MutatingWebhookConfiguration resources:

During a restore, after the etcd state itself is restored, the kubelet and kube-apiserver are started and begin reconciling Pods and the other cluster resources stored in the snapshot.
If the snapshot includes webhook configurations, the kubelet/kube-apiserver attempts to send validation/mutation requests to the associated webhook Pods (e.g., Kyverno or Gatekeeper).
However, those pods have not yet been started or reached a "Ready" state because the cluster restoration is still in progress. This is particularly the case, in the event of a Disaster Recovery (DR) restore, in which the cluster is started from scratch.
The kubelet or kube-apiserver server, unable to reach the webhook, denies the creation of required system resources (based on the webhook's failurePolicy), effectively blocking the cluster from ever reaching a healthy state.

Resolution

To resolve this, you must temporarily remove the webhook configurations during the restore process so that the kubelet and kube-apiserver can successfully initialize the cluster components, without being blocked.

1. Prepare Environment

Access the control plane (server) node via SSH where the restore is being performed and configure the environment:

RKE2:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
alias kubectl=/var/lib/rancher/rke2/bin/kubectl

K3s:

No action required to configure kubectl.

2. Backup Existing Webhook Configurations

Before the restore reaches the blocking stage, or immediately upon starting, capture the current configurations to a local directory:

kubectl get validatingwebhookconfiguration -o yaml > all-validating.yaml
kubectl get mutatingwebhookconfiguration -o yaml > all-mutating.yaml

3. Clear Webhooks During Restore

While it is possible to delete all webhooks to ensure the restore completes, a Targeted Deletion is safer as it preserves system-critical webhooks while removing the blocking third-party policy engines.

Identifying the Blocking Webhooks

Check the kubelet and kube-apiserver logs on the server nodes for "failed calling webhook" or "connection refused" errors to identify which specific webhook configurations are causing the deadlock.

DistributionComponentLog Location / CommandRKE2kube-apiserverkubectl logs -n kube-system -l component=kube-apiserver -fRKE2kubelet/var/lib/rancher/rke2/agent/logs/kubelet.logK3scombinedjournalctl -u k3s -f

Option A: Targeted Deletion (Recommended)

Modify the script below by replacing the placeholder names in the TARGET_WEBHOOKS array with the specific webhooks identified from your logs:

#!/bin/bash
# Add specific webhook names here (e.g., "kyverno-resource-validating-webhook-configuration")
TARGET_WEBHOOKS=("webhook-name-1" "webhook-name-2")

# Identify kubectl and KUBECONFIG path
if [ -f /etc/rancher/rke2/rke2.yaml ]; then
    export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
    KUBECTL=/var/lib/rancher/rke2/bin/kubectl
elif [ -f /etc/rancher/k3s/k3s.yaml ]; then
    export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
    KUBECTL="/usr/local/bin/k3s kubectl"
else
    KUBECTL=$(which kubectl)
fi

echo "Using kubectl at: $KUBECTL"
echo "Using KUBECONFIG at: $KUBECONFIG"
echo "Monitoring and removing specific blocking webhooks..."

while true; do
  for name in "${TARGET_WEBHOOKS[@]}"; do
    # Check Validating
    if kubectl get validatingwebhookconfiguration "$name" &>/dev/null; then
        echo "Deleting validating webhook: $name"
        kubectl delete validatingwebhookconfiguration "$name" --timeout=5s
    fi
    # Check Mutating
    if kubectl get mutatingwebhookconfiguration "$name" &>/dev/null; then
        echo "Deleting mutating webhook: $name"
        kubectl delete mutatingwebhookconfiguration "$name" --timeout=5s
    fi
  done
  sleep 2
done

Option B: Bulk Deletion

Use this script if the cluster is critically blocked and individual identification is not feasible:

#!/bin/bash

# Identify kubectl and KUBECONFIG path
if [ -f /etc/rancher/rke2/rke2.yaml ]; then
    export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
    KUBECTL=/var/lib/rancher/rke2/bin/kubectl
elif [ -f /etc/rancher/k3s/k3s.yaml ]; then
    export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
    KUBECTL="/usr/local/bin/k3s kubectl"
else
    KUBECTL=$(which kubectl)
fi

echo "Using kubectl at: $KUBECTL"
echo "Using KUBECONFIG at: $KUBECONFIG"
echo "Monitoring and removing webhooks to allow restoration..."

while true; do
  # Delete Mutating Webhooks
  for hook in $(kubectl get mutatingwebhookconfiguration -o name 2>/dev/null); do
    echo "Deleting $hook"
    kubectl delete "$hook" --timeout=5s
  done

  # Delete Validating Webhooks
  for hook in $(kubectl get validatingwebhookconfiguration -o name 2>/dev/null); do
    echo "Deleting $hook"
    kubectl delete "$hook" --timeout=5s
  done

  sleep 2
done

Note: Keep the script running until the server logs indicate that the restore has completed and nodes are "Ready". Press Ctrl+C to stop the script.

4. Restore Webhook Configurations

Once the cluster is healthy and the policy engine pods are running, re-create the configurations from your backups:

kubectl create -f all-validating.yaml
kubectl create -f all-mutating.yaml