Skip to content

RKE2 upgrades causing dataloss to application deployed using helm chart

Article Number: 000020726

Environment

  • On RKE2 versions below 1.21.12

Situation

We see applications managed by Helm being uninstalled and re-installed when they are in a broken state during RKE2 upgrades. This behavior is noticed on all RKE2 upgrades below 1.21.12

Cause

RKE2 upgrades packaged components using bundled HelmChart manifests. These resources trigger Jobs that wrap the Helm CLI tool. As all packaged components must be upgraded to ensure a functional system, if any Helm Releases are stuck in an invalid state (Failed, Pending, etc) at the time of the upgrade, those releases are uninstalled and reinstalled to reset the system to a known-good state.

If user-provided HelmChart manifests are used to deploy stateful applications where uninstallation of the Helm chart may cause data loss, this behavior may not be desired. For example, when Longhorn is deployed using a HelmChart manifest, an uninstall of the release will also delete all the Longhorn Custom Resources, potentially causing data loss. The actual volume content is not deleted, but Longhorn will lose the data mapping the content to Persistent Volumes.

Resolution

Recent releases of RKE2 allow customization of the Helm job behavior to reduce the probability of data loss when deploying stateful applications. Users may:

  • Set the failurePolicy: abort  on the HelmChart spec to tell Helm to leave the release in a failed state if the upgrade does not succeed.
  • Set the  helmcharts.helm.cattle.io/unmanaged  annotation on the HelmChart resource to prevent the Helm controller from acting on the chart at all, so that the HelmChart resource may be removed from the cluster without triggering uninstallation of the Helm Release.

If you are currently experiencing data loss during upgrades, it may be necessary to perform a manual upgrade of the RKE2 cluster, and coordinate the upgrade with changes to the HelmChart manifests to take advantage of the new features. However, before performing upgrades, you need to ensure that the following conditions are met.

NOTE: If you are not confident in following these steps, please open a ticket with the Rancher Support team to involve the engineering team for further assistance.

  1. Stop the rke2-server service on all server nodes.
  2. Upgrade the RKE2 binary or package to the latest patch release available for your current Kubernetes minor version.
  3. Update the affected manifests to add the new fields as necessary to obtain the desired behavior, on any nodes where the manifests are present. If no nodes contain the manifests, pick one node to deploy the manifests and place them on disk so that they are applied immediately during system startup. Details of the fields are explained below.
  4. Start the rke2-server service on all server nodes.

New Fields:

  • helmcharts.helm.cattle.io/unmanaged  annotation on the HelmChart Custom Resource HelmChart resources with this annotation present will not be processed by the Helm controller. Add this annotation if you plan to remove the HelmCharts resources and begin managing the application via another method.
  • spec.failurePolicy  on the  HelmChart and HelmChartConfig Custom Resource HelmCharts where the HelmChart or corresponding HelmChartConfig set the failurePolicy field toabort will leave the Helm release in a failed state. The administrator is expected to manually assess the failure and restore the release to a functional state, using commonly available Helm CLI tools.
  • spec.repoCA  on the  HelmChart  Custom Resource. This new field allows for use of a private CA on the Helm repository. Use this when hosting charts on a server that does not have a public CA Certificate in order to avoid certificate errors when installing or upgrading the chart.