Slow etcd performance (performance testing and optimization)

Article Number: 000020100

Environment

A Rancher-provisioned or standalone RKE2 or K3s cluster

Situation

If the etcd logs in a cluster contain messages of the following format, this indicates that the backing storage is too slow or that the server is too highly loaded for etcd queries to complete in a timely manner:

2019-08-11 23:27:04.344948 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:293" took too long (1.530802357s) to executeetcd queries are expected to complete in a low (sub-10 ms) timeframe. Where requests take longer than 100 ms, etcd logs "request took too long messages" of the format above. Slow etcd queries will result in slow cluster controlplane operation and, if slow enough, kube-apiserver request timeouts and a degraded controlplane.

This article outlines how to verify etcd I/O performance at a single point in time, as well as how to use rancher-monitoring to monitor this over time. It also includes recommendations on how to increase etcd I/O performance.

Testing etcd performance (point-in-time test with fio)

Install fio using the package manager on your distribution, e.g. apt install fio, yum install fio or zypper install fio.
Test the storage by creating a directory on the device you want to test and running the fio command, as shown below:

mkdir test-data
fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=100m --bs=2300 --name=mytest

3. Below is an example output from a node with all roles (etcd, controlplane and worker) of a Rancher local RKE2 cluster running on an AWS EC2 instance type of t2.large:

[root@ip-172-31-14-184 ~]# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=100m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.15-23-g937e
Starting 1 process
mytest: Laying out IO file (1 file / 100MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=2684KiB/s][w=1195 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=21203: Sun Aug 11 23:47:30 2019
  write: IOPS=1196, BW=2687KiB/s (2752kB/s)(99.0MiB/38105msec)
    clat (nsec): min=2840, max=99026, avg=8551.56, stdev=3187.53
      lat (nsec): min=3337, max=99664, avg=9191.92, stdev=3285.92
    clat percentiles (nsec):
      |  1.00th=[ 4640],  5.00th=[ 5536], 10.00th=[ 5728], 20.00th=[ 6176],
      | 30.00th=[ 6624], 40.00th=[ 7264], 50.00th=[ 7968], 60.00th=[ 8768],
      | 70.00th=[ 9408], 80.00th=[10304], 90.00th=[11840], 95.00th=[13760],
      | 99.00th=[19328], 99.50th=[23168], 99.90th=[35584], 99.95th=[44288],
      | 99.99th=[63744]
    bw (  KiB/s): min= 2398, max= 2852, per=99.95%, avg=2685.79, stdev=104.84, samples=76
    iops        : min= 1068, max= 1270, avg=1195.96, stdev=46.66, samples=76
  lat (usec)   : 4=0.52%, 10=76.28%, 20=22.34%, 50=0.82%, 100=0.04%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=352, max=21253, avg=822.36, stdev=652.94
    sync percentiles (usec):
      |  1.00th=[  400],  5.00th=[  420], 10.00th=[  437], 20.00th=[  457],
      | 30.00th=[  478], 40.00th=[  529], 50.00th=[  906], 60.00th=[  947],
      | 70.00th=[  988], 80.00th=[ 1020], 90.00th=[ 1090], 95.00th=[ 1156],
      | 99.00th=[ 2245], 99.50th=[ 5932], 99.90th=[ 8717], 99.95th=[11600],
      | 99.99th=[16581]
  cpu          : usr=0.79%, sys=7.38%, ctx=119920, majf=0, minf=35
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,45590,0,0 short=45590,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
  WRITE: bw=2687KiB/s (2752kB/s), 2687KiB/s-2687KiB/s (2752kB/s-2752kB/s), io=99.0MiB (105MB), run=38105-38105msec
Disk stats (read/write):
  xvda: ios=0/96829, merge=0/3, ticks=0/47440, in_queue=47432, util=92.25%

In the fsync data section, you can see that the 99th (99.00th) percentile is 2,245 microseconds, or approximately 2.2 milliseconds, of latency. The etcd documentation suggests that, for the storage to be considered fast enough for etcd, the 99th percentile of fdatasync invocations when writing to the WAL file must be less than 10 ms.

Whilst this value might suggest that the storage is fast enough for etcd, it is important to note that fio measures the latency at the single point in time when the test is run. If higher disk latency is intermittent - for example, due to a regularly scheduled task running on the same host, or another host in the case of shared storage - the fio results may indicate lower latency because the storage was not heavily loaded at the time of testing. To better understand the latency over time, you can use rancher-monitoring to monitor etcd.

Monitoring etcd performance (over time with rancher-monitoring)

Install rancher-monitoring into the affected cluster, per the Rancher documentation.
Within the Rancher UI, navigate to the affected cluster and open the rancher-monitoring Grafana UI by selecting Monitoring -> Grafana.
Within the Grafana UI, navigate to Dashboards and click on etcd to load the etcd dashboard.
The Disk Sync Duration panel contains the relevant data. Mouse over this panel and click the three-dot menu that appears in the top-right corner of the panel, then click View to expand it, per the following screenshot:
You can adjust the time range in the top-right corner of the window to view data over a longer period or to inspect data at times when there is a known issue with etcd performance or warnings. The graph displays the 99th percentile WAL fsync and DB fsync values per etcd node. The WAL fsync value should be below 10 ms. If 10 ms is consistently exceeded, or if there are periods where the WAL fsync value spikes substantially above 10 ms, you should look to increase the storage performance.

Resolution

If the testing above indicates that the storage I/O performance is not sufficient for etcd, the simple solution is to upgrade the storage to provide higher I/O capacity. If performance is close to acceptable, there are several steps you can take to optimize storage:

Always use SSDs for etcd storage, whether on dedicated or in the cloud.
Do not run etcd on a node with other roles. As a general rule, never run the worker role on the same node as etcd. Many environments have etcd and controlplane roles on the same node; if this is the case in your environment and etcd performance is insufficient, consider separating etcd and controlplane roles onto different nodes.
If the etcd and the controlplane roles have already been separated and etcd is still not performant, mount a separate volume for etcd so that storage I/O operations from other processes on the node do not impact etcd performance. This is mostly applicable to cloud-hosted nodes, where each volume mounted has its own allocated set of resources.
If you using dedicated servers and want to separate etcd storage I/O operations from the rest of the system, install a separate storage device for etcd data directory.