ETCD poor latency performance and failure under load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Jim Gauld |
Bug Description
Brief Description
-----------------
The etcd tasks are suffering from both CPU scheduling and disk IO scheduling problem. The etcd server is a critical "interactive" process that requires low-latency.
This is a critical process. There is interaction of this with kube-apiserver. When the problem manifests, observed terrible responsiveness of kubectl commands and kube-apiserver. When etcd errors occur, this causes pods failing, the clients are unable to renew lease, applications fail to apply. This causes negative feedback -- it cannot retry/recover.
Problem seems to occur even when HW using SSD on root disk. Scheduling poor, but not failing on fast disk systems that have root disk using HW-RAID or NVMe.
See ETCD server error logs indicating 'timed out wating for read index response', 'request timed out took too long to execute', 'wal: sync duration X, expected less than 1s', 'context cancelled - took too long to execute' type messages in kern.log. The ETCD exceeds both heartbeat interval and election timeouts. This occurs under load: when applying apps and especially worse when there is additional disk stress such as a 'dd' writer to the root disk.
i.e., exceeding 100ms heartbeat timeout, exceeding 1000ms election timeout.
Severity
--------
Provide the severity of the defect.
Critical: Applications fail to apply. Overall system responsiveness poor and failing.
Steps to Reproduce
------------------
Apply a large application.
Run 'dd' writer to root disk that does 'fsync'.
dd if=/dev/zero of=./test.dd bs=200K count=20000 conv=fsync
Expected Behavior
------------------
Should be able to apply applications successfully.
ETCD should be protected against stress.
System should behave under stress, eg, reasonable latency, not fail.
Actual Behavior
----------------
Apps fail to apply.
Pods fail and clients unable to renew lease.
Poor terminal response. Poor kubectl response, or no response.
Reproducibility
---------------
100% on SSD root disk labs
Not seen with HW-RAID, or NVMe drives.
System Configuration
-------
AIO-SX, AIO-DX. 2 platform cpus.
Branch/Pull Time/Commit
-------
Recent load.
Last Pass
---------
Intermittent. Fails depending on disk type, application, and stress.
Timestamp/Logs
--------------
Types of logs in kern.log:
1. etcdserver: timed out waiting for read index response (local node might have slow network)
2. "error:etcdserver: request timed out" took too long (7.002480373s) to execute
3. wal: sync duration of 7.125655597s, expected less than 1s
4. "error:context canceled" took too long (1.999580305s) to execute
5. took too long (5.292740636s) to execute
Test Activity
-------------
Sanity.
Workaround
----------
Modify etcd service configuration file: change
* tuned IO-scheduler: set to 'cfq' on root disk (eg, /sys/block/
* tuned ETCD ionice policy=best-effort, priority=0, Nice=-19
CVE References
Changed in starlingx: | |
assignee: | nobody → Jim Gauld (jgauld) |
status: | New → Confirmed |
status: | Confirmed → In Progress |
tags: | added: stx.containers |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.6.0 |
Fix proposed to branch: master /review. opendev. org/c/starlingx /utilities/ +/790094
Review: https:/