Cluster Stuck in “Paused” State After Disaster Recovery (DR) Process
Article Number: 000021399
Environment
Rancher Server 2.7.6 and above
Situation
In certain cases, a downstream cluster may enter a broken state that requires a Disaster Recovery (DR) process to restore it to an active state.
However, the DR process may occasionally fail to complete successfully, becoming stuck indefinitely. When this happens, the cluster enters a “paused” state.
This condition can be verified by inspecting the clusters.cluster.x-k8s.io
object in the fleet-default
namespace of the local (upstream) cluster:
kubectl get clusters.cluster.x-k8s.io <CLUSTER_NAME> -n fleet-default -o yaml
In the output, you will see the following field set to true
:
spec:
paused: true
Cause
The issue typically occurs due to one of the following:
- An unexpected incident (e.g., network interruption, OS failure, etc.) leading the cluster into a broken state.
- A complete outage rendering all Control Plane nodes unavailable.
Resolution
To recover the cluster from the paused state:
- Edit the
clusters.cluster.x-k8s.io
object in thefleet-default
namespace on the local (upstream) cluster:
kubectl edit clusters.cluster.x-k8s.io <CLUSTER_NAME> -n fleet-default
spec: paused: true
paused
to false
, then save and exit the editor.
spec: paused: false
These steps will instruct Rancher to unpause the cluster, allowing the restore process to continue.
Once the cluster resumes activity, it is recommended to re-run the DR process to ensure the cluster is fully recovered.
For detailed guidance, refer to the official Rancher Manager Backup and Restore documentation for your specific distribution.