dbmon timeouts are too low
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Bin Qian |
Bug Description
Brief Description
-----------------
The dbmon OCF script timeouts are too low, resulting in unnecessary failures in heavily loaded system (e.g. in an AIO-DX system during a swact).
Severity
--------
Major: I believe this can delay failure recovery (e.g. in spontaneous controller reboots).
Steps to Reproduce
------------------
1. Install an AIO-DX (two node) system.
2. Launch a good number of instances (e.g. at least 8).
3. Perform controller maintenance actions (e.g. lock/unlock, force reboot, etc...)
Expected Behavior
------------------
The dbmon OCF script (and associated SM service) should only report failures when there is an actual failure. The fix may involve changes to the dbmon OCF script and possibly changes to the timeouts configured in SM for dbmon.
Actual Behavior
----------------
Due to the low timeouts in the dbmon OCF script (e.g. 5s for a kubectl command to complete), errors are sometimes incorrectly reported.
Reproducibility
---------------
Intermittent
System Configuration
-------
AIO-DX (two node system)
Branch/Pull Time/Commit
-------
Designer load:
BUILD_DATE=
Last Pass
---------
Unknown
Timestamp/Logs
--------------
N/A
Test Activity
-------------
Developer testing
Changed in starlingx: | |
assignee: | wanghao (wanghao749) → Bin Qian (bqian20) |
Changed in starlingx: | |
assignee: | Bin Qian (bqian20) → David Sullivan (dsullivanwr) |
Changed in starlingx: | |
assignee: | David Sullivan (dsullivanwr) → Bin Qian (bqian20) |
tags: | added: in-r-stx20 |
Marking as high priority / stx.2.0 gating as this results in delayed recovery.