ETCD poor latency performance and failure under load

Bug #1927515 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
The etcd tasks are suffering from both CPU scheduling and disk IO scheduling problem. The etcd server is a critical "interactive" process that requires low-latency.

This is a critical process. There is interaction of this with kube-apiserver. When the problem manifests, observed terrible responsiveness of kubectl commands and kube-apiserver. When etcd errors occur, this causes pods failing, the clients are unable to renew lease, applications fail to apply. This causes negative feedback -- it cannot retry/recover.

Problem seems to occur even when HW using SSD on root disk. Scheduling poor, but not failing on fast disk systems that have root disk using HW-RAID or NVMe.

See ETCD server error logs indicating 'timed out wating for read index response', 'request timed out took too long to execute', 'wal: sync duration X, expected less than 1s', 'context cancelled - took too long to execute' type messages in kern.log. The ETCD exceeds both heartbeat interval and election timeouts. This occurs under load: when applying apps and especially worse when there is additional disk stress such as a 'dd' writer to the root disk.
i.e., exceeding 100ms heartbeat timeout, exceeding 1000ms election timeout.

Severity
--------
Provide the severity of the defect.
Critical: Applications fail to apply. Overall system responsiveness poor and failing.

Steps to Reproduce
------------------
Apply a large application.
Run 'dd' writer to root disk that does 'fsync'.
dd if=/dev/zero of=./test.dd bs=200K count=20000 conv=fsync

Expected Behavior
------------------
Should be able to apply applications successfully.
ETCD should be protected against stress.
System should behave under stress, eg, reasonable latency, not fail.

Actual Behavior
----------------
Apps fail to apply.
Pods fail and clients unable to renew lease.
Poor terminal response. Poor kubectl response, or no response.

Reproducibility
---------------
100% on SSD root disk labs
Not seen with HW-RAID, or NVMe drives.

System Configuration
--------------------
AIO-SX, AIO-DX. 2 platform cpus.

Branch/Pull Time/Commit
-----------------------
Recent load.

Last Pass
---------
Intermittent. Fails depending on disk type, application, and stress.

Timestamp/Logs
--------------
Types of logs in kern.log:
1. etcdserver: timed out waiting for read index response (local node might have slow network)
2. "error:etcdserver: request timed out" took too long (7.002480373s) to execute
3. wal: sync duration of 7.125655597s, expected less than 1s
4. "error:context canceled" took too long (1.999580305s) to execute
5. took too long (5.292740636s) to execute

Test Activity
-------------
Sanity.

Workaround
----------
Modify etcd service configuration file: change
* tuned IO-scheduler: set to 'cfq' on root disk (eg, /sys/block/sda/queue/scheduler)
* tuned ETCD ionice policy=best-effort, priority=0, Nice=-19

CVE References

Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
status: New → Confirmed
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/utilities/+/790094

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/790132

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/790094
Committed: https://opendev.org/starlingx/utilities/commit/6045b1b8a0d8ed6a94d06cdfc994bf1a5fa9dbb5
Submitter: "Zuul (22348)"
Branch: master

commit 6045b1b8a0d8ed6a94d06cdfc994bf1a5fa9dbb5
Author: Jim Gauld <email address hidden>
Date: Thu May 6 11:58:34 2021 -0400

    Provide utility script is-rootdisk-device.sh

    This provides a utility script to determine which disk contains the root
    filesystem. This can also be used as a helper function for io-scheduler
    udev rules that require specific configuration for root disk.

    Example usage:
    /usr/local/bin/is-rootdisk-device.sh
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sda
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sdb
    (i.e., no output)

    Partial-Bug: 1927515
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Ib0d4a161a407b08d294c5ff9aa0b7590961e18c9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/c/starlingx/config-files/+/790098
Committed: https://opendev.org/starlingx/config-files/commit/e82d1b9e70dd50fbec76db7cfc51e433c5b6bf9e
Submitter: "Zuul (22348)"
Branch: master

commit e82d1b9e70dd50fbec76db7cfc51e433c5b6bf9e
Author: Jim Gauld <email address hidden>
Date: Thu May 6 12:14:39 2021 -0400

    Configure io-scheduler udev rules for ETCD and HW-RAID

    This configures io-scheduler udev rules for etcd and hw-raid
    performance.

    This sets the io-scheduler to 'cfq' tuned parameters for 'controller'
    nodetype with root file-system disk.

    This sets io-scheduler to 'noop' for HW-RAID Dell PowerEdge R720,
    this was a missing commit from pre-starlingx.

    Partial-Bug: 1927515
    Depends-On: https://review.opendev.org/c/starlingx/utilities/+/790094
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Iaf1de8d962d1e8d253c72e680370666a2aed8c8e

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/790132
Committed: https://opendev.org/starlingx/stx-puppet/commit/0b429c7cb0c16e34755c1b1e146ebb8b006d44dc
Submitter: "Zuul (22348)"
Branch: master

commit 0b429c7cb0c16e34755c1b1e146ebb8b006d44dc
Author: Jim Gauld <email address hidden>
Date: Thu May 6 12:33:24 2021 -0400

    Configure etcd service critical process nice and ionice

    The etcd server is a critical "interactive" process that requires
    low-latency. This process has many etcd threads, each worker does
    minimal work and wakes up frequently. The threads do small amount of
    writes to commit.

    The etcd server will start exceeding heartbeat interval of 100ms and
    the election timeout of 1000ms under load and independent disk stress,
    if not properly tuned as a critical process. This cascades into many
    failures.

    This requires io-scheduler 'cfq' to take advantage of io-nice policy
    and priority. This bumps up to best-effort/0 from best-effort/4.

    This sets nice -19 from nice 0. This helps tremendously with
    interactive processes for linux CFS (completely-fair-scheduler).

    With tuned settings, under application load and additional disk stress,
    we see a dramatic reduction of 'blocked_max' and no more kern.log
    etcdserver related errors for exceeding the timeouts.
    We see dramatic improvement to system responsiveness for kubectl,
    kube-apiserver. This prevents pods from failing when clients they
    cannot renew lease.

    Note that 'blocked_max' scheduler stats for this process represents
    involuntary wait for disk related delay, scheduling delay, etc.

    Testing coverage:
    - various root disk HW: RAID, NVMe, SSD, VBox
    - sanity on multiple labs: R730_1 with RAID, WFP13_14

    Configuration change used in testing:
    - baseline: deadline, best-effort/4,
    - system under test: cfq, best-effort/0, nice -19
    - dd stress was single writer to root disk:
      while true; do
        dd if=/dev/zero of=./test.dd bs=200K count=20000 conv=fsync
      done

    Compared results and observe system behaviour:
    - watch kern.log for etcserver 'took too long', and 'wal: sync'
    - watch fm alarms
    - watch kubectl pod status
    - observe performance with: iotop, schedtop, iostat

    Tests performed:
    - DRBD resync with and without dd writer stress
    - swact with and without dd stress
    - large application apply + dd writer stress
    - launch large number of pods (eg, scale nginx with 80 pods),
      watch systemctl status commands using strace to check for hang
    - copy very large files, create big tarballs, write mkisofs iso
    - host install

    Closes-Bug: 1927515
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/790098
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Ieeeba5c1375d8d99401f839c7409a9de356fda87

Ghada Khalil (gkhalil)
tags: added: stx.containers
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/utilities/+/792213

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (f/centos8)
Download full text (29.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/792213
Committed: https://opendev.org/starlingx/utilities/commit/c4d042615e6fe8944a4628fa1a29e86e012a9bf5
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 557cada006fd5a3bd81ad5af387c37657801f8c5
Author: Fernando Theirs <email address hidden>
Date: Thu May 13 16:21:47 2021 -0300

    Collect is missing etcdctl output

    When the collect tool is run, it does not include the contents
    of the etcd database. Fixes have been made for this to dump the
    contents in "etcd_database.dump" file.

    Verify if etcd access is secured. In that case, certificates
    will be used.

    Closes-Bug: 1911935

    Signed-off-by: Fernando Theirs <email address hidden>
    Change-Id: Idbc60edffa978a7a6bead939a4eb54f4abae29a6

commit 6045b1b8a0d8ed6a94d06cdfc994bf1a5fa9dbb5
Author: Jim Gauld <email address hidden>
Date: Thu May 6 11:58:34 2021 -0400

    Provide utility script is-rootdisk-device.sh

    This provides a utility script to determine which disk contains the root
    filesystem. This can also be used as a helper function for io-scheduler
    udev rules that require specific configuration for root disk.

    Example usage:
    /usr/local/bin/is-rootdisk-device.sh
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sda
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sdb
    (i.e., no output)

    Partial-Bug: 1927515
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Ib0d4a161a407b08d294c5ff9aa0b7590961e18c9

commit 88a678f142cfe86c58b6405aae6babbc08de0e8f
Author: Chen, Haochuan Z <email address hidden>
Date: Fri Mar 26 09:09:41 2021 +0800

    Add packages to stx-ceph-manager image

    This update installs ceph-mgr, ceph-mon, ceph-osd packages as part
    of stx-ceph-manager image.

    Partial-Bug: 1920882

    Change-Id: I4afde8b1476e14453fac8561f1edde7360b8ee96
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 09b3542fcc6cc0300a9cae0d302225e6977780f3
Author: Scott Little <email address hidden>
Date: Thu Mar 25 11:49:49 2021 -0400

    Set SW_VERSION 21.05

    Prep for the StarlingX 5.0 release.
    SW_VERSION, also known as PLATFORM_RELEASE, uses YY.MM format.

    Story: 2008055
    Task: 42115
    Signed-off-by: Scott Little <email address hidden>
    Change-Id: If7c91a2b523358269ae4850961cf4189ffcd7a75

commit ae4cefd0e2a0001476782c31e1003810da2b4838
Author: Chris Friesen <email address hidden>
Date: Thu Mar 4 18:04:12 2021 -0500

    add dcmanager-audit-worker to patch restart script

    Need to add the new process to the patch restart script.

    Story: 2007267
    Task: 41999
    Signed-off-by: Chris Friesen <email address hidden>
    Change-Id: If5faa806bd0d52ddbf1343b064959f4207cf975a

commit 27fce5a52321f3014fa8ae9181d344bc774289da
Author: Enzo Candotti <email address hidden>
Date: Mon Feb 1 12:47:38 2021 -0300

    Add resource CPU and memory info in collect

    This adds commands to collect more data to debug
    resource allocations and...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config-files/+/793634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (f/centos8)
Download full text (4.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config-files/+/793634
Committed: https://opendev.org/starlingx/config-files/commit/03c3f68b2a1477da3dbc7d351e8bf9e2cff2acf1
Submitter: "Zuul (22348)"
Branch: f/centos8

commit e82d1b9e70dd50fbec76db7cfc51e433c5b6bf9e
Author: Jim Gauld <email address hidden>
Date: Thu May 6 12:14:39 2021 -0400

    Configure io-scheduler udev rules for ETCD and HW-RAID

    This configures io-scheduler udev rules for etcd and hw-raid
    performance.

    This sets the io-scheduler to 'cfq' tuned parameters for 'controller'
    nodetype with root file-system disk.

    This sets io-scheduler to 'noop' for HW-RAID Dell PowerEdge R720,
    this was a missing commit from pre-starlingx.

    Partial-Bug: 1927515
    Depends-On: https://review.opendev.org/c/starlingx/utilities/+/790094
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Iaf1de8d962d1e8d253c72e680370666a2aed8c8e

commit efb718e03171580c43702a01f7c103e590832ab7
Author: Li Zhou <email address hidden>
Date: Tue Apr 13 04:48:46 2021 -0400

    systemd: Upgrade to version 219-78.el7_9.3

    Change the BuildRequires to refer to the new systemd version.

    Depends-On: https://review.opendev.org/c/starlingx/tools/+/786601
    Closes-Bug: #1924691
    Signed-off-by: Li Zhou <email address hidden>
    Change-Id: I76169b7fd85069e26cfb37de8889cea006c57238

commit 7877dbc6baec4e3214a12ac0ae44db5491a22e9d
Author: Andy Ning <email address hidden>
Date: Fri Apr 16 10:46:13 2021 -0400

    Enforce "cannot reuse the last 2 passwords" password rule

    Currently the "remember" attribute in pam_pwhistory configuration
    is set to "2", which enforces "cannot reuse the last 1 passwords"
    in history instead of "cannot reuse the last 2 passwords" stated
    in security document.

    This update changed "remember" attribute to "3" so that the rule
    complies with the document.

    Closes-Bug: 1924772
    Signed-off-by: Andy Ning <email address hidden>
    Change-Id: I340152f8b8a572bc1e86f1eb4a14eb8e392f6334

commit e87383f6c328efeab2a9407daa33076a85739b96
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 6 08:44:26 2021 -0400

    Comment out 'dateext' setting in logrotate.conf file

    This update comments out the 'dateext' setting to avoid
    log files being rotated with date as a default.

    Test Plan:

    PASS: Verify log rotation config files that don't
          specifically set dateext option are rotated
          by number rather than date.
    PASS: Verify system install

    Partial-Bug: 1918979
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: Ib68d86d1ec3f15abedce4c4059c3a8ec34b7d196

commit 35160afbdada2efe0ff567dd94ca1419903c87ad
Author: Nicolas Alvarez <email address hidden>
Date: Tue Dec 8 17:26:18 2020 -0300

    Disable SNMP Host-Based from config-files repo.

    Due to SNMP is going to be containerized, we disable
    it from starlingx/config-files repo.

    Story: 2008132
    Task: 41381
    Depends-On: https://review.opendev.org/765381
    Signed-off-by: Nicolas Alvarez <nic...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.