Increasing the RKE2 etcd snapshot S3 timeout
Article Number: 000022328
Environment
RKE2 with recurring S3 snapshots
Situation
In the rke2-server journalctl logs, errors with "deadline exceeded" are seen when attempting to reconcile and interact with the S3 endpoint for snapshots.
Jan 27 15:01:46 xxxx rke2[3120757]: time="2026-01-27T15:01:46+08:00" level=warning msg="Failed to get object metadata: Head \"https://xxx.yy.org/rancher-downstream/prod-cluster/.metadata/etcd-snapshot-xxxx.yy.org-1753833604.zip\": context deadline exceeded"
Jan 27 15:01:46 xxxx rke2[3120757]: time="2026-01-27T15:01:46+08:00" level=warning msg="Failed to get object metadata: context deadline exceeded"
Jan 27 15:01:46 xxxx rke2[3120757]: time="2026-01-27T15:01:46+08:00" level=error msg="Error retrieving S3 snapshots for reconciliation: context deadline exceeded"
If the network latency is high or the number of objects in the bucket introduces delays, the S3 snapshot handling fails to complete. For a cluster provisioned by Rancher, the dashboard may not show all of the available snapshots in S3.
Cause
The issue is caused by a network timeout when retrieving S3 snapshots. Increasing the etcd-s3-timeout parameter in the RKE2 configuration resolves the problem.
If the "failed to read metadata" is seen in the errors, this relates to RKE2 storing a small .zip or .json metadata file alongside each snapshot.
- The Timeout: When RKE2 scans the bucket, it tries to read the metadata for each snapshot file. If the S3 provider (or the network path) doesn't respond fast enough, the whole process exits, leaving you with an empty or outdated list.
- The Cause: This may need additional assistance to understand, for example - the S3 endpoint may intermittently respond slowly or timeout, network congestion or underlying hardware configuration may introduce packet loss. Investigation is needed to resolve persistent performance issues.
Verification
Once the timeout is increased and the rke2-server services have been restarted:
- Run the manual list command again:
rke2 etcd-snapshot list
Resolution
To workaround the latency, the S3 operation timeout can be increased to allow RKE2 enough time to complete the API calls. At the time of writing this timeout is not configurable in the Rancher dashboard, so configuration is added to files on rke2-server nodes, or when using the rke2 binary to run commands.
The default timeout is 5 minutes, in the below steps a 10 minute timeout has been used.
RKE2 standalone cluster
Manage this using the node's configuration file, edit /etc/rancher/rke2/config.yaml adding the timeout configuration:
etcd-s3-timeout: 10m0s
Note, the configuration must be set on all rke2-server nodes, after saving the changes restart the rke2-server service on each node to put the change into effect:
systemctl restart rke2-server
RKE2 cluster provisioned by Rancher
Add the configuration to a second file 51-rancher.yaml under /etc/rancher/rke2/config.yaml.d/
cat <<EOF > /etc/rancher/rke2/config.yaml.d/51-rancher.yaml
etcd-s3-timeout: 10m0s
EOF
Note, the configuration must be set on all rke2-server nodes, after saving the changes restart the rke2-server service on each node to put the change into effect:
systemctl restart rke2-server
These same steps can be adapted for k3s server nodes