Active controller became degraded after lock/unlock compute node

Bug #1856064 reported by Peng Peng on 2019-12-11
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Dan Voiculeasa

Bug Description

Brief Description
-----------------
After lock/unlock one compute node, the active controller became degraded. 200.006 alarm raised.
After active controller force reboot, the system was recovered and alarm was cleared.

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: mtc/test_lock_unlock_host.py::test_lock_unlock_host[compute]

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
Multi-node system
IPv4

Lab-name: WCP_3-6

Branch/Pull Time/Commit
-----------------------
2019-12-10_20-00-00

Last Pass
---------
2019-12-10_20-00-00 on (WP_8-12)

Timestamp/Logs
--------------
[2019-12-11 08:58:20,124] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-12-11 08:58:21,300] 433 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-12-11 08:58:22,661] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

[2019-12-11 08:59:40,320] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2019-12-11 09:05:59,264] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-12-11 09:06:00,442] 433 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$

[2019-12-11 09:11:08,717] 311 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-12-11 09:11:09,693] 433 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+--------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+--------------------------------+----------+----------------------------+
| 26e10dab-15dd-45ee-b5ac-4ae73bb5db8d | 200.006 | controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress. | host=controller-0.process=ceph | major | 2019-12-11T09:00:12.697608 |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+--------------------------------+----------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Sanity

Peng Peng (ppeng) wrote :
Yang Liu (yliu12) on 2019-12-11
description: updated
Ghada Khalil (gkhalil) wrote :

Waiting from triage by Dan to understand if this issue is introduced by recent code changes related to: https://review.opendev.org/#/c/695917/

Changed in starlingx:
status: New → Triaged
tags: added: stx.config stx.storage
Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
status: Triaged → New
tags: removed: stx.config
Yang Liu (yliu12) on 2019-12-12
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

As per Frank Miller, this was introduced by https://review.opendev.org/#/c/695917/
Given that this change is in stx.3.0, we need this LP to be fixed in the next stx.3.0 maintenance release.

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.3.0
Peng Peng (ppeng) wrote :

Issue seems reproduced on
Lab: WCP_71_75
Load: 2019-12-22_20-00-00

After compute node force reboot, activer controller became degraded.

[2019-12-23 09:13:41,521] 166 INFO MainThread host_helper.reboot_hosts:: Rebooting compute-0
[2019-12-23 09:13:41,521] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-12-23 09:15:51,899] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-12-23 09:15:51,899] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-12-23 09:15:53,064] 433 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | disabled | offline |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | degraded |
+----+--------------+-------------+----------------+-------------+--------------+

Frank Miller (sensfan22) wrote :

Dan has a proposed fix in stx-ceph:
https://github.com/starlingx-staging/stx-ceph/pull/36

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers