Troubleshooting etcd Time Synchronisation Issues

Article Number: 000021851

Environment

An RKE or RKE2 cluster with multiple etcd nodes.

Situation

etcd is a critical component in Kubernetes environments, serving as the distributed reliable key-value store that holds the cluster's state. Its proper functioning is highly dependent on accurate time synchronisation across all etcd nodes. Even a small clock drift can lead to significant issues, impacting cluster stability and performance.

Time synchronisation issues in etcd can manifest in various ways, often leading to a cascade of errors throughout your cluster. Common symptoms include:

High Clock Drift errors in etcd logs: You'll see warnings like "prober found high clock drift".
Slow etcd requests: Messages such as "apply request took too long" or "waiting for ReadIndex response took too long" indicate that etcd operations are being delayed, often due to inconsistencies between nodes. While these kinds of messages can also indicate other etcd issues, they will commonly be included when a time synchronisation error is present.
Kubernetes API Server timeouts: The kube-apiserver might struggle to connect to etcd, resulting in http: Handler timeout errors.
Unhealthy etcd nodes: You'll see your etcd nodes as unhealthy or flapping (losing and regaining connection).
Raft misalignment: Inconsistent Raft indexes among etcd members.

Cause

The primary cause of time synchronisation issues on etcd nodes is often a misconfigured or non-functional NTP (Network Time Protocol) client, such as chrony or ntpd. Specific problems can include:

Incorrect NTP Server Configuration: The chrony.conf or ntp.conf file might point to incorrect or unreachable NTP servers.
Firewall Rules: Necessary firewall rules (e.g., UDP port 123 for NTP) might be missing, preventing the nodes from reaching the configured time servers.
Network Connectivity Issues: General network problems can also prevent NTP synchronisation.

Resolution

In order to solve a time synchronisation issue, you can follow these steps:

Check the current time and synchronisation status on all your etcd nodes:

Check current time:

date +"%T.%N"

Execute this command simultaneously on all etcd nodes to quickly spot any significant differences. Even differences of a few seconds can be critical for etcd. - Check the status of your NTP client. E.g., for chrony:

timedatectl
chronyc sources list
chronyc tracking

Based on your verification, identify potential clock misalignments, and fix the issue:

Review your NTP configuration: Examine /etc/chrony.conf (for chrony) or /etc/ntp.conf (for ntpd) on each etcd node. Ensure that the configured NTP servers are correct and reachable. It's recommended to use reliable and accessible NTP sources.
Check Firewall Rules: Verify that UDP port 123 is open in your firewall configuration on all etcd nodes to allow NTP traffic to and from your configured time servers.

Once you've corrected the configuration, force the NTP client to resynchronize the time.

sudo systemctl restart chronyd

Important: When restarting the time synchronisation service, it's generally safer to do it one etcd node at a time. While a temporary clock adjustment might occur, restarting the service on all nodes simultaneously could introduce further instability if a significant time jump occurs across the entire etcd cluster at once.

After ensuring that time synchronisation is correct on all etcd nodes, verify the health of your etcd cluster and monitor the etcd logs for any recurring "high clock drift" or "took too long" errors. These should disappear once the time synchronisation is stable.