dbmon timeouts are too low

Bug #1837919 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Qian

Bug Description

Brief Description
-----------------
The dbmon OCF script timeouts are too low, resulting in unnecessary failures in heavily loaded system (e.g. in an AIO-DX system during a swact).

Severity
--------
Major: I believe this can delay failure recovery (e.g. in spontaneous controller reboots).

Steps to Reproduce
------------------
1. Install an AIO-DX (two node) system.
2. Launch a good number of instances (e.g. at least 8).
3. Perform controller maintenance actions (e.g. lock/unlock, force reboot, etc...)

Expected Behavior
------------------
The dbmon OCF script (and associated SM service) should only report failures when there is an actual failure. The fix may involve changes to the dbmon OCF script and possibly changes to the timeouts configured in SM for dbmon.

Actual Behavior
----------------
Due to the low timeouts in the dbmon OCF script (e.g. 5s for a kubectl command to complete), errors are sometimes incorrectly reported.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
AIO-DX (two node system)

Branch/Pull Time/Commit
-----------------------
Designer load:
BUILD_DATE="2019-07-24 14:49:52 -0500"

Last Pass
---------
Unknown

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer testing

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as high priority / stx.2.0 gating as this results in delayed recovery.

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.2.0 stx.ha
Revision history for this message
Ghada Khalil (gkhalil) wrote :

To determine the acceptable timeouts, a characterization exercise is needed on AIO-DX to measure how long the typical controller actions are taking. See steps to reproduce in the bug description.

Changed in starlingx:
assignee: nobody → wanghao (wanghao749)
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: wanghao (wanghao749) → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :

dbmon is a minor service. The failure of dbmon will not directly cause system failover or panic. Although the failure of dbmon will cause "false" warning in above described busy system, the procedure of host going active or standby is not blocked. As dbmon is not part of the mariadb cluster creation, it should recover eventually after system calm down and the warning will then be cleared.

Changed in starlingx:
assignee: Bin Qian (bqian20) → David Sullivan (dsullivanwr)
Bin Qian (bqian20)
Changed in starlingx:
assignee: David Sullivan (dsullivanwr) → Bin Qian (bqian20)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to upstream (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679774

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on upstream (master)

Change abandoned by Bin Qian (<email address hidden>) on branch: master
Review: https://review.opendev.org/679774
Reason: the target file has been relocated

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/680499

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/680499
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=cfd3686d8ec191053be061aacff8187dcef329c0
Submitter: Zuul
Branch: master

commit cfd3686d8ec191053be061aacff8187dcef329c0
Author: Bin Qian <email address hidden>
Date: Thu Sep 5 15:32:39 2019 -0400

    Extend timeout for the kubectl cmds in dbmon

    In AIO-DX, during the swact, dbmon experiences kubectl commands
    respond slower than expected. dbmon reports error while the kubectl
    commands not responding within 5 seconds, the 5 seconds timeout is too
    short.

    Extend the timeout to 10 seconds, to avoid reporting unnecessary error.

    Change-Id: Ie07c84e0a53c00ac78970bf6b06e6cf0b19479e1
    Closes-Bug: 1837919
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to upstream (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/681255

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to upstream (r/stx.2.0)

Reviewed: https://review.opendev.org/681255
Committed: https://git.openstack.org/cgit/starlingx/upstream/commit/?id=2b42ddf7d176f6d92ab08cd1d586c8eccc494f5f
Submitter: Zuul
Branch: r/stx.2.0

commit 2b42ddf7d176f6d92ab08cd1d586c8eccc494f5f
Author: Bin Qian <email address hidden>
Date: Tue Sep 10 09:23:42 2019 -0400

    Extend timeout for the kubectl cmds in dbmon

    In AIO-DX, during the swact, dbmon experiences kubectl commands
    respond slower than expected. dbmon reports error while the kubectl
    commands not responding within 5 seconds, the 5 seconds timeout is too
    short.

    Extend the timeout to 10 seconds, to avoid reporting unnecessary error.

    This changes originally merged into master from review below.
    https://review.opendev.org/#/c/680499/
    Note that the file has been reloacated in the master branch.

    Change-Id: I243a751437e7e2910f0fb3ae31b2facc4b7b74ab
    Closes-Bug: 1837919
    Signed-off-by: Bin Qian <email address hidden>

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.