Downstream clusters flapping between available and unavailable state

This document (000020416) is provided subject to the disclaimer at the end of this document.

Environment

Rancher version: v2.5.6

Management cluster K8S version: 1.21.5

Situation

After upgrading the Kubernetes version of the Rancher management cluster, the downstream cluster status in the WebUI flaps between the available and unavailable states.

Rancher Pod logs show errors like the below;

Failed to connect to peer wss://x.x.x.x/v3/connect [local ID=y.y.y.y]: websocket: bad handshake

Resolution

Upgrade Rancher to v2.6.x

A workaround until Rancher upgarde is to reduce the Rancher deployment replicas to one.

Cause

Rancher is storing the service account token from the initial Pod, and then trying to reuse that on subsequent requests even though that pod has been deleted.

As of Kubernetes version v1.21, service account tokens are pod-specific, and are invalidated when the pod is deleted, which is why Rancher is unable to use it and thus unable to reach other Rancher replica instances via web-socket.

Additional Information

The issue is tracked in the GitHub issue 26082

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.