StarlingX

Secondary controller is administratively locked

Bug #1832269 reported by Cristopher Lemus on 2019-06-10

This bug report is a duplicate of: Bug #1832237: Host unlock is failing with the error helm_override_get. Edit Remove

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Critical	Al Bailey

Bug Description

Brief Description
-----------------
During the provisioning part of StarlingX using 20190609T233000Z, the secondary controller is administratively locked automatically.

Severity
--------
Critical. stx-openstack cannot be applied unless all nodes are online.

Steps to Reproduce
------------------
Following up the wiki setup. During the apply of stx-openstack, controller-1 (secondary controller) is administratively locked automatically.

Expected Behavior
------------------
controller-1 is unlocked/enabled/available during the stx-openstack apply.

Actual Behavior
----------------
controller-1 is automatically (administratively) locked, and is locked/disabled/online. stx-openstack apply fails.

Reproducibility
---------------
100%

System Configuration
--------------------
Duplex and Controller storage

Branch/Pull Time/Commit
-----------------------
20190609T233000Z

Last Pass
---------
This didn't happened with CENGN ISO from: 20190604T144018Z

Timestamp/Logs
--------------
A full collect log is attached from a standard configuration. I couldn't find any relevant error message stating why controller-1 was automatically put offline. Here are the messages:

System host-list:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

From fm alarm-list:
+----------+-------------------------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=410ccc89-653a- | major | 2019-06-10T02:56: |
| | | 44b6-84c3-1d2341c9c6c9.peergroup= | | 59.627582 |
| | | group-0.host=controller-1 | | |
| | | | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' | cluster=410ccc89-653a- | warning | 2019-06-10T02:56: |
| | for more details. | 44b6-84c3-1d2341c9c6c9 | | 59.347853 |
| | | | | |
| 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=oam-services | | 18.800418 |
| | | | | |
| 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major | 2019-06-10T02:56: |
| | members available | service_group=controller-services | | 18.639414 |
| | | | | |
| 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=cloud-services | | 18.476499 |
| | | | | |
| 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=vim-services | | 18.314474 |
| | | | | |
| 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=patching-services | | 18.153427 |
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active | service_domain=controller. | major | 2019-06-10T02:56: |
| | member available | service_group=directory-services | | 17.989413 |
| | | | | |
| 200.001 | controller-1 was administratively locked to take it out-of-service. | host=controller-1 | warning | 2019-06-10T02:56: |
| | | | | 17.981421 |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=web-services | | 17.829433 |
| | | | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major | 2019-06-10T02:56: |
| | available | service_group=storage-services | | 17.668428 |
| | | | | |
| 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no | service_domain=controller. | major | 2019-06-10T02:56: |
| | standby members available | service_group=storage-monitoring- | | 17.546452 |
| | | services | | |
| | | | | |
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major | 2019-06-10T02:00: |
| | | | | 37.098627 |
| | | | | |
| 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2019-06-10T02:00: |
| | | | | 37.035417 |

A system host-show is as follows:
[wrsroot@controller-0 ~(keystone_admin)]$ fm alarm-show 07fb0079-cf12-4117-b939-95c520a99eec
+------------------------+---------------------------------------------------------------------+
| Property | Value |
+------------------------+---------------------------------------------------------------------+
| alarm_id | 200.001 |
| alarm_state | set |
| alarm_type | operational-violation |
| degrade_affecting | False |
| entity_instance_id | host=controller-1 |
| entity_type_id | system.host |
| mgmt_affecting | True |
| probable_cause | out-of-service |
| proposed_repair_action | Administratively unlock Host to bring it back in-service. |
| reason_text | controller-1 was administratively locked to take it out-of-service. |
| service_affecting | True |
| severity | warning |
| suppression | False |
| suppression_status | unsuppressed |
| timestamp | 2019-06-10T02:56:17.981421 |
| uuid | 07fb0079-cf12-4117-b939-95c520a99eec |
+------------------------+---------------------------------------------------------------------+

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-06-10:

#1

controller-0_20190610.124901.tar Edit (43.3 MiB, application/x-tar)

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-06-10:

#2

According to the alarms, ceph might be related, and, another thing that I noticed on a locked controller-1, is that calico interfaces are not created, is this expected because is locked?

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-11:

#3

Please re-test again with a newer load with the fix for https://bugs.launchpad.net/starlingx/+bug/1832237
The code (https://review.opendev.org/#/c/664263/) merged on June 10.

It's possible that the initial unlock didn't go thru during the initial setup due to the bug above.

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-06-12:

#4

This was tested again with: 20190611T000451Z.

Tested on all affected configurations. Not reproduced. I think that we can close this one.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-12:

#5

Pretty sure this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1832237
Christopher also confirmed that it's not reproducible in loads including the fix

Changed in starlingx:
status:	Incomplete → Fix Released
importance:	Undecided → Critical
tags:	added: stx.2.0 stx.containers
Changed in starlingx:
assignee:	nobody → Al Bailey (albailey1974)

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1832237 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

controller-0_20190610.124901.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.