Downstream clusters flapping between available and unavailable state
This document (000020416) is provided subject to the disclaimer at the end of this document.
Environment
Rancher version: v2.5.6
Management cluster K8S version: 1.21.5
Situation
After upgrading the Kubernetes version of the Rancher management cluster, the downstream cluster status in the WebUI flaps between the available and unavailable states.
Rancher Pod logs show errors like the below;
Failed to connect to peer wss://x.x.x.x/v3/connect [local ID=y.y.y.y]: websocket: bad handshake
Resolution
Upgrade Rancher to v2.6.x
A workaround until Rancher upgarde is to reduce the Rancher deployment replicas to one.
Cause
Rancher is storing the service account token from the initial Pod, and then trying to reuse that on subsequent requests even though that pod has been deleted.
As of Kubernetes version v1.21, service account tokens are pod-specific, and are invalidated when the pod is deleted, which is why Rancher is unable to use it and thus unable to reach other Rancher replica instances via web-socket.
Additional Information
The issue is tracked in the GitHub issue 26082
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.