Skip to content

How to recover a Rancher-provisioned RKE2 or K3s cluster after misconfiguring agent proxy variables

This document (000021951) is provided subject to the disclaimer at the end of this document.

Environment

A Rancher-provisioned RKE2 or K3s cluster, in which a proxy is required for the downstream cluster to connect to Rancher

Situation

In a Rancher-provisioned RKE2 or K3s cluster, in which a proxy is required for the downstream cluster to connect to Rancher, the rancher-system-agent relies on proxy environment variables to communicate with the Rancher management server. These variables are typically configured via the "Agent Environment Variables" section in the Rancher UI Cluster Configuration for a given cluster.

If anincorrect or unreachable proxy is configured in this section after the cluster is already registered and operational, the communication between Rancher and the downstream cluster breaks. Even if the correct proxy settings are later re-applied in the Rancher UI, the cluster remains disconnected and unmanageable through Rancher, as the the rancher-system-agents are disconnected and unable to apply the update.

Resolution

To restore communication between Rancher and the downstream cluster:

1. Update Proxy Settings in the Rancher UI

Navigate to the affected cluster in the Cluster Management section of the Rancher UI and ensure that the correct proxy settings are configured under:

ClusterEdit ConfigAgent Environment Variables

Ensure the following variables are correctly defined:

HTTP_PROXY=http://<your-proxy>:<port>
HTTPS_PROXY=https://<your-proxy>:<port>
NO_PROXY=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,0.0.0.0,cattle-system.svc

2. Manually Fix Proxy Settings on one Control Plane Node

SSH into one of the control plane nodes of the affected downstream cluster, and update the environment file used by the rancher-system-agent.

a. Edit the file:

sudo vi /etc/systemd/system/rancher-system-agent.env

b. Add or update the following lines:

http_proxy=http://<your-correct-proxy>:<port>
https_proxy=http://<your-correct-proxy>:<port>
NO_PROXY=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,0.0.0.0,cattle-system.svc

c. Restart the agent:

sudo systemctl restart rancher-system-agent

Once the agent on this node can communicate with Rancher, it will allow Rancher to re-establish connectivity and trigger the appropriate upgrade plan via the system-upgrade-controller.

3. Automatic Rollout to Remaining Nodes

Once Rancher regains contact with one control plane node, it can deploy a system-upgrade-controller job to roll out the updated configuration to the remaining nodes. This happens even if the other nodes are temporarily disconnected, as Rancher can now orchestrate the changes via Kubernetes Jobs.

Cause

The root cause of this issue is that once a non-functional or invalid proxy is applied through the Agent Environment Variables setting, it is persisted in the local system environment of the rancher-system-agent on each node.

Even if the correct proxy settings are later configured in the Rancher UI, the rancher-system-agent processes cannot automatically reload or overwrite their existing environment file (/etc/systemd/system/rancher-system-agent.env), because they are disconnected from Rancher due to the incorrect proxy settings.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.