How to recover a Rancher-provisioned RKE2 or K3s cluster after misconfiguring agent proxy variables

Article Number: 000021951

Environment

A Rancher-provisioned RKE2 or K3s cluster, in which a proxy is required for the downstream cluster to connect to Rancher

Situation

In a Rancher-provisioned RKE2 or K3s cluster, in which a proxy is required for the downstream cluster to connect to Rancher, the rancher-system-agent relies on proxy environment variables to communicate with the Rancher management server. These variables are typically configured via the "Agent Environment Variables" section in the Rancher UI Cluster Configuration for a given cluster.

If an incorrect or unreachable proxy is configured in this section after the cluster is already registered and operational, the communication between Rancher and the downstream cluster breaks. Even if the correct proxy settings are later re-applied in the Rancher UI, the cluster remains disconnected and unmanageable through Rancher, as the the rancher-system-agents are disconnected and unable to apply the update.

Cause

The root cause of this issue is that once a non-functional or invalid proxy is applied through the Agent Environment Variables setting, it is persisted in the local system environment of the rancher-system-agent on each node.

Even if the correct proxy settings are later configured in the Rancher UI, the rancher-system-agent processes cannot automatically reload or overwrite their existing environment file (/etc/systemd/system/rancher-system-agent.env), because they are disconnected from Rancher due to the incorrect proxy settings.

Resolution

To restore communication between Rancher and the downstream cluster:

1. Update Proxy Settings in the Rancher UI

Navigate to the affected cluster in the Cluster Management section of the Rancher UI and ensure that the correct proxy settings are configured under:

Cluster → Edit Config → Agent Environment Variables

Ensure the following variables are correctly defined:

HTTP_PROXY=http://<your-proxy>:<port>
HTTPS_PROXY=https://<your-proxy>:<port>
NO_PROXY=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,0.0.0.0,cattle-system.svc

2. Manually Fix Proxy Settings on one Control Plane Node

SSH into one of the control plane nodes of the affected downstream cluster, and update the environment file used by the rancher-system-agent.

a. Edit the file:

sudo vi /etc/systemd/system/rancher-system-agent.env

b. Add or update the following lines:

http_proxy=http://<your-correct-proxy>:<port>
https_proxy=http://<your-correct-proxy>:<port>
NO_PROXY=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,0.0.0.0,cattle-system.svc

c. Restart the agent:

sudo systemctl restart rancher-system-agent

Once the agent on this node can communicate with Rancher, it will allow Rancher to re-establish connectivity and trigger the appropriate upgrade plan via the system-upgrade-controller.

3. Automatic Rollout to Remaining Nodes

Once Rancher regains contact with one control plane node, it can deploy a system-upgrade-controller job to roll out the updated configuration to the remaining nodes. This happens even if the other nodes are temporarily disconnected, as Rancher can now orchestrate the changes via Kubernetes Jobs.