Secondary controller is administratively locked

Bug #1832269 reported by Cristopher Lemus
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Al Bailey

Bug Description

Brief Description
-----------------
During the provisioning part of StarlingX using 20190609T233000Z, the secondary controller is administratively locked automatically.

Severity
--------
Critical. stx-openstack cannot be applied unless all nodes are online.

Steps to Reproduce
------------------
Following up the wiki setup. During the apply of stx-openstack, controller-1 (secondary controller) is administratively locked automatically.

Expected Behavior
------------------
controller-1 is unlocked/enabled/available during the stx-openstack apply.

Actual Behavior
----------------
controller-1 is automatically (administratively) locked, and is locked/disabled/online. stx-openstack apply fails.

Reproducibility
---------------
100%

System Configuration
--------------------
Duplex and Controller storage

Branch/Pull Time/Commit
-----------------------
20190609T233000Z

Last Pass
---------
This didn't happened with CENGN ISO from: 20190604T144018Z

Timestamp/Logs
--------------
A full collect log is attached from a standard configuration. I couldn't find any relevant error message stating why controller-1 was automatically put offline. Here are the messages:

System host-list:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

From fm alarm-list:
+----------+-------------------------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=410ccc89-653a- | major | 2019-06-10T02:56: |
| | | 44b6-84c3-1d2341c9c6c9.peergroup= | | 59.627582 |
| | | group-0.host=controller-1 | | |
| | | | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' | cluster=410ccc89-653a- | warning | 2019-06-10T02:56: |
| | for more details. | 44b6-84c3-1d2341c9c6c9 | | 59.347853 |
| | | | | |
| 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=oam-services | | 18.800418 |
| | | | | |
| 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major | 2019-06-10T02:56: |
| | members available | service_group=controller-services | | 18.639414 |
| | | | | |
| 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=cloud-services | | 18.476499 |
| | | | | |
| 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=vim-services | | 18.314474 |
| | | | | |
| 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=patching-services | | 18.153427 |
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active | service_domain=controller. | major | 2019-06-10T02:56: |
| | member available | service_group=directory-services | | 17.989413 |
| | | | | |
| 200.001 | controller-1 was administratively locked to take it out-of-service. | host=controller-1 | warning | 2019-06-10T02:56: |
| | | | | 17.981421 |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=web-services | | 17.829433 |
| | | | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=storage-services | | 17.668428 |
| | | | | |
| 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no | service_domain=controller. | major | 2019-06-10T02:56: |
| | standby members available | service_group=storage-monitoring- | | 17.546452 |
| | | services | | |
| | | | | |
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major | 2019-06-10T02:00: |
| | | | | 37.098627 |
| | | | | |
| 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2019-06-10T02:00: |
| | | | | 37.035417 |

A system host-show is as follows:
[wrsroot@controller-0 ~(keystone_admin)]$ fm alarm-show 07fb0079-cf12-4117-b939-95c520a99eec
+------------------------+---------------------------------------------------------------------+
| Property | Value |
+------------------------+---------------------------------------------------------------------+
| alarm_id | 200.001 |
| alarm_state | set |
| alarm_type | operational-violation |
| degrade_affecting | False |
| entity_instance_id | host=controller-1 |
| entity_type_id | system.host |
| mgmt_affecting | True |
| probable_cause | out-of-service |
| proposed_repair_action | Administratively unlock Host to bring it back in-service. |
| reason_text | controller-1 was administratively locked to take it out-of-service. |
| service_affecting | True |
| severity | warning |
| suppression | False |
| suppression_status | unsuppressed |
| timestamp | 2019-06-10T02:56:17.981421 |
| uuid | 07fb0079-cf12-4117-b939-95c520a99eec |
+------------------------+---------------------------------------------------------------------+

Test Activity
-------------
Sanity

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

According to the alarms, ceph might be related, and, another thing that I noticed on a locked controller-1, is that calico interfaces are not created, is this expected because is locked?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Please re-test again with a newer load with the fix for https://bugs.launchpad.net/starlingx/+bug/1832237
The code (https://review.opendev.org/#/c/664263/) merged on June 10.

It's possible that the initial unlock didn't go thru during the initial setup due to the bug above.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This was tested again with: 20190611T000451Z.

Tested on all affected configurations. Not reproduced. I think that we can close this one.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Pretty sure this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1832237
Christopher also confirmed that it's not reproducible in loads including the fix

Changed in starlingx:
status: Incomplete → Fix Released
importance: Undecided → Critical
tags: added: stx.2.0 stx.containers
Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.