etcd snaphots failing to save in RKE2 <v1.25.15, <1.26.10, <1.27.7 and <1.28.3 due to large configmap error
Article Number: 000021272
Environment
RKE2 <v1.25.15, <1.26.10, <1.27.7 and <1.28.3
Situation
At some point snapshots may start failing to complete. Viewing the logs in rke2-server.service
should show:
level=error msg="failed to save local snapshot data to configmap: ConfigMap \"rke2-etcd-snapshots\" is invalid: []: Too long: must have at most 1048576 bytes"
Cause
If the number of etcd nodes and snapshot retention count is too high, the rke2-etcd-snapshots configmap will grow too large and eventually the rke2-server process will be unable to save the configmap as it has grown over 1MB.
This issue was tracked in https://github.com/rancher/rke2/issues/4495
Resolution
This issue has been fixed in v1.28.3 and has been backported to 1.25.15, 1.26.10, and v1.27.7.
If an upgrade is not possible, the following steps can be taken to manually clean the config map:
- Save copies of local etcd snapshots to another folder as a precaution.
- Reduce the etcd snapshots retention on the downstream cluster configuration and disable S3 backups temporarily.
- Edit the 'rke2-etcd-snapshots' ConfigMap in the 'kube-system' namespace on the downstream cluster and remove the values beneath the data field:
kubectl edit ConfigMap -n kube-system rke2-etcd-snapshots