Skip to content

Increasing the RKE2 etcd snapshot S3 timeout

Article Number: 000022328

Environment

RKE2 with recurring S3 snapshots

Situation

In the rke2-server journalctl logs, errors with "deadline exceeded" are seen when attempting to reconcile and interact with the S3 endpoint for snapshots.

Jan 27 15:01:46 xxxx rke2[3120757]: time="2026-01-27T15:01:46+08:00" level=warning msg="Failed to get object metadata: Head \"https://xxx.yy.org/rancher-downstream/prod-cluster/.metadata/etcd-snapshot-xxxx.yy.org-1753833604.zip\": context deadline exceeded"
Jan 27 15:01:46 xxxx rke2[3120757]: time="2026-01-27T15:01:46+08:00" level=warning msg="Failed to get object metadata: context deadline exceeded"
Jan 27 15:01:46 xxxx rke2[3120757]: time="2026-01-27T15:01:46+08:00" level=error msg="Error retrieving S3 snapshots for reconciliation: context deadline exceeded"

If the network latency is high or the number of objects in the bucket introduces delays, the S3 snapshot handling fails to complete. For a cluster provisioned by Rancher, the dashboard may not show all of the available snapshots in S3.

Cause

The issue is caused by a network timeout when retrieving S3 snapshots. Increasing the etcd-s3-timeout parameter in the RKE2 configuration resolves the problem.

If the "failed to read metadata" is seen in the errors, this relates to RKE2 storing a small .zip or .json metadata file alongside each snapshot.

  • The Timeout: When RKE2 scans the bucket, it tries to read the metadata for each snapshot file. If the S3 provider (or the network path) doesn't respond fast enough, the whole process exits, leaving you with an empty or outdated list.
  • The Cause: This may need additional assistance to understand, for example - the S3 endpoint may intermittently respond slowly or timeout, network congestion or underlying hardware configuration may introduce packet loss. Investigation is needed to resolve persistent performance issues.

Verification

Once the timeout is increased and the rke2-server services have been restarted:

  1. Run the manual list command again:

rke2 etcd-snapshot list


2. Check the Rancher dashboard, the snapshots should represent the current state of snapshot files in S3 once Rancher successfully synchronizes the list from the RKE2 API.

Resolution

To workaround the latency, the S3 operation timeout can be increased to allow RKE2 enough time to complete the API calls. At the time of writing this timeout is not configurable in the Rancher dashboard, so configuration is added to files on rke2-server nodes, or when using the rke2 binary to run commands.

The default timeout is 5 minutes, in the below steps a 10 minute timeout has been used.

RKE2 standalone cluster

Manage this using the node's configuration file, edit /etc/rancher/rke2/config.yaml adding the timeout configuration:


etcd-s3-timeout: 10m0s

Note, the configuration must be set on all rke2-server nodes, after saving the changes restart the rke2-server service on each node to put the change into effect: systemctl restart rke2-server

RKE2 cluster provisioned by Rancher

Add the configuration to a second file 51-rancher.yaml under /etc/rancher/rke2/config.yaml.d/

cat <<EOF > /etc/rancher/rke2/config.yaml.d/51-rancher.yaml
etcd-s3-timeout: 10m0s
EOF

Note, the configuration must be set on all rke2-server nodes, after saving the changes restart the rke2-server service on each node to put the change into effect: systemctl restart rke2-server

These same steps can be adapted for k3s server nodes