Node Removal After Changing Cloud Provider to vSphere in an Existing RKE2 Cluster

This document (000021870) is provided subject to the disclaimer at the end of this document.

Environment

Applicable for all versions of:

RKE2
K3s

Situation

After updating the cloud provider configuration in an existing RKE2 cluster from the default (no explicit cloud provider) to vsphere, some nodes became unavailable in the Rancher UI and appeared in the nodenotfound state.

The following log entries were recorded in the vSphere Cloud Provider Interface (CPI) logs:

E0514 06:31:06.975972   1 datacenter.go:128] Unable to find VM by UUID. VM UUID: rke2://<NODE_IDENTIFIER>
E0514 06:31:06.975992   1 search.go:181] Error while looking for vm=rke2://<NODE_IDENTIFIER>(byUUID) in vc=<VSPHERE_VCENTER> and datacenter=<VSPHERE_DATACENTER>: No VM found
I0514 06:31:06.976000   1 search.go:186] Did not find node rke2://<NODE_IDENTIFIER> in vc=<VSPHERE_VCENTER> and datacenter=<VSPHERE_DATACENTER>
I0514 06:31:06.983134   1 instances.go:177] instances.InstanceExistsByProviderID() NOT CACHED for node uid "rke2://<NODE_IDENTIFIER>"
I0514 06:31:06.983150   1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: <NODE_IDENTIFIER>
I0514 06:31:06.983268   1 event.go:389] "Event occurred" object="<NODE_IDENTIFIER>" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node <NODE_IDENTIFIER> because it does not exist in the cloud provider"

These messages indicate that the vSphere CPI attempted to locate virtual machines based on the providerID format, and nodes with the rke2:// prefix could not be matched to any virtual machines in the vSphere environment.

Resolution

Changing the cloud provider on an existing RKE2 cluster is not supported. The cloud-provider parameter must be defined during the initial provisioning of the cluster and cannot be altered afterward without consequences.

When the cloud provider is changed to vsphere post-deployment:

The vSphere CPI only recognizes and manages nodes with a providerID that begins with vsphere://.
Nodes previously registered with providerID: rke2:// are not recognized by the vSphere CPI.
The Kubernetes Node Lifecycle Controller, informed by CPI, interprets these nodes as non-existent in the cloud infrastructure and proceeds to remove them from the cluster.

To restore cluster health, affected nodes must be replaced with newly provisioned nodes that are configured with the correct cloud provider from the outset. These new nodes will be registered with the correct providerID format and managed successfully by the vSphere CPI.

Cause

The vSphere CPI expects nodes to be registered with a providerID in the format vsphere://. Nodes originally registered with the RKE2 default behavior (i.e., without a cloud provider) have a providerID format of rke2://. These identifiers do not match any VM in vSphere. As a result, the CPI reports them as non-existent, and the Kubernetes Node Lifecycle Controller initiates their deletion from the cluster.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.