Distributed Cloud: host swact failed as the controller services group could not be disabled for the node to go standby
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Matt Peters |
Bug Description
Brief Description
-----------------
Controller swact failed as sm could not disable the controller services group in order to go standby.
Severity
--------
Critical - following the swact failure, the system got into a stale mate state. Platform services and system commands were not available until sm issued a node reboot 50 minutes or so later.
Steps to Reproduce
------------------
1. Configure a distributed cloud with stx-monitor app applied on the SystemController
2. Lock the standby controller
3. Change platform memory allocation and wait for the platform-integ-apps reapply to complete
4. Unlock the standby controller
5. Perform host-swact
6. Lock the new standby controller to update its memory then unlock
7. Perform host-swact
Expected Behavior
------------------
Host swact is successful
Actual Behavior
----------------
Swact would fail in step 5. It would take close to 1 hour for the system to self recover. Swact would fail again in step 7.
Snippet of sm-customer.log
| 2020-03-
| 2020-03-
| 2020-03-
| 2020-03-
| 2020-03-
| 2020-03-
first host-swact command issued against controller-0:
2020-03-
first node reboot initiated by sm against controller-1 to recover the system:
2020-03-
second host-swact command issued against controller-1:
2020-03-
second node reboot initiated by sm against controller-0 to recover the system:
2020-03-
Reproducibility
---------------
very reproducible
System Configuration
-------
IPv6 distributed cloud with stx-monitor
Branch/Pull Time/Commit
-------
Feb 22 master code
Last Pass
---------
Not certain if host-swact in the configuration as described above was verified before
Timestamp/Logs
--------------
See attached for more logs
Test Activity
-------------
Evaluation
Workaround
----------
Describe workaround if available
Changed in starlingx: | |
assignee: | Dariush Eslimi (deslimi) → Al Bailey (albailey1974) |
Changed in starlingx: | |
assignee: | nobody → Kevin Smith (kevin.smith.wrs) |
Changed in starlingx: | |
assignee: | Kevin Smith (kevin.smith.wrs) → Matt Peters (mpeters-wrs) |
Changed in starlingx: | |
status: | Triaged → In Progress |
Key kernel logs indicate this is a "device is held open by someone" type issue. Sample logs at the time of the failure:
2020-03- 04T12:44: 54.819 controller-0 kernel: err [157361.484752] block drbd8: State change failed: Device is held open by someone 04T12:44: 54.819 controller-0 kernel: err [157361.487387] block drbd6: State change failed: Device is held open by someone 04T12:44: 55.373 controller-0 kernel: err [157361.986267] block drbd9: State change failed: Device is held open by someone 04T12:44: 55.373 controller-0 kernel: err [157361.987222] block drbd7: State change failed: Device is held open by someone 04T12:44: 55.373 controller-0 kernel: err [157361.987475] block drbd5: State change failed: Device is held open by someone 04T12:45: 09.006 controller-0 kernel: err [157375.714197] block drbd2: State change failed: Device is held open by someone 04T12:45: 09.472 controller-0 kernel: err [157376.178824] block drbd0: State change failed: Device is held open by someone 04T12:45: 10.991 controller-0 kernel: err [157377.694175] block drbd1: State change failed: Device is held open by someone
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-