StarlingX

IPv6 Distributed Cloud: system controller controller-1 in reboot loop after lock/unlock

Bug #1844832 reported by Peng Peng on 2019-09-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Low	Ghada Khalil

Bug Description

Brief Description
-----------------
IPv6 DC system controller installed and subcloud added. To clear system controller controller-1 alarm "controller-1 Configuration is out-of-date.", lock/unlock controller-1. But after controller-1 unlocked, controller-1 was in reboot loop.

Severity
--------
Critical

Steps to Reproduce
------------------
IPv6 DC system controller installed
subcloud added
lock system controller controller-1
unlock controller-1

TC-name: DC installation

Expected Behavior
------------------
after unlock controller-1, controller-1 will reboot and come back online. Alarm "controller-1 Configuration is out-of-date." will be cleared

Actual Behavior
----------------
controller-1 in reboot loop

Reproducibility
---------------
Seen once

System Configuration
--------------------
Distributed cloud
IPv6

Lab-name: DC

Branch/Pull Time/Commit
-----------------------
stx master as of "2019-09-17_20-00-00"

Last Pass
---------

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system host-lock controller-1
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | offline |
| bm_ip | None |
| bm_type | None |
| bm_username | None |
| boot_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 |
| capabilities | {u'stor_function': u'monitor'} |
| config_applied | 5306f693-3216-4471-8e46-e0636023d0e5 |
| config_status | Config out-of-date |
| config_target | 423eebaf-0888-49ba-9366-74ea518abd03 |
| console | ttyS0,115200n8 |
| created_at | 2019-09-19T14:51:26.324167+00:00 |
| hostname | controller-1 |
| id | 2 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | fd01:1::4 |
| mgmt_mac | 90:e2:ba:b0:e9:f4 |
| operational | disabled |
| personality | controller |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 |
| serialid | None |
| software_load | 19.10 |
| subfunction_avail | failed |
| subfunction_oper | disabled |
| subfunctions | controller,worker |
| task | Locking |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2019-09-20T13:53:06.342010+00:00 |
| uptime | 0 |
| uuid | 4c149540-883f-4925-abd9-e41e39eeb288 |
| vim_progress_status | services-disabled |
+---------------------+--------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
| Alarm ID | Reason Text | Entity ID | Severit
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
| 200.001 | controller-1 was administratively locked to take it out-of-service. | host=controller-1 | warning
| | | |
| | | |
| 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=oam-services |
| | | |
| 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=controller-services |
| | | |
| 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=cloud-services |
| | | |
| 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=vim-services |
| | | |
| 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=patching-services |
| | | |
| 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major
| | members available | service_group=storage-monitoring- |
| | | services |
| | | |
| 400.002 | Service group distributed-cloud-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major
| | members available | service_group=distributed-cloud- |
| | | services |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.133 on host controller-0 | host=controller-0.network=mgmt | major
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam | major
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.134 on host controller-0 | host=controller-0.network=cluster- | major
| | | host |
| | | |
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major
| | | |
| | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active | service_domain=controller. | major
| | member available | service_group=directory-services |
| | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major
| | available | service_group=web-services |
| | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major
| | available | service_group=storage-services |
| | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' | cluster= | warning
| | for more details. | cdba6cc6-74c7-4586-a865-d57caa744c62 |
| | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster= | major
| | | cdba6cc6-74c7-4586-a865-d57caa744c62. |
| | | peergroup=group-0.host=controller-1 |
| | | |
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
[sysadmin@controller-0 ~(keystone_admin)]$

collect log attached

Test Activity
-------------
DC installation

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-09-20:

ALL_NODES_20190920.152355.tar Edit (32.6 MiB, application/x-tar)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-09-20:

controller-1_20190920.152837.tgz Edit (27.2 MiB, application/x-tar)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-20:

Controller-1 failed configuration. Networking issue I think.

2019-09-20T14:16:45.408 [15547.00049] controller-1 mtcClient com nodeUtil.cpp (1214) get_node_health :Error : controller-1 is UnHealthy

controller-1:~$ ls /var/run/.
./ .ceph_osd_service .node_locked
../ .ceph_osd_status .rpmdb_cleaned
.ceph_mon_service .config_fail .sm_watchdog_heartbeat
.ceph_mon_status .goenabled

Controller-1 cannot reach off the host; sudo stalls out.

From User.log

2019-09-20T14:16:41.000 controller-1 root: notice Error: Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:47.000 controller-1 root: notice Worker is not configured

controller-1:/var/log/puppet# cat /var/log/daemon.log | grep 14:16:4
2019-09-20T14:16:41.992 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.001 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.011 controller-1 worker_config[15544]: info Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:42.024 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.034 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.044 controller-1 worker_config[15544]: info Pausing for 5 seconds...
2019-09-20T14:16:46.994 controller-1 systemd[1]: notice workerconfig.service: main process exited, code=exited, status=1/FAILURE

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-20:

Can't ping between controllers.

[sysadmin@controller-0 ~(keystone_admin)]$ ping -6 fd01:1::3
PING fd01:1::3(fd01:1::3) 56 data bytes
64 bytes from fd01:1::3: icmp_seq=1 ttl=64 time=0.033 ms
64 bytes from fd01:1::3: icmp_seq=2 ttl=64 time=0.036 ms

[sysadmin@controller-0 ~(keystone_admin)]$ ping -6 fd01:1::4
PING fd01:1::4(fd01:1::4) 56 data bytes

Same from C1 viewpoint.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-23:

Controller-1 failed configuration (/var/run/.config_fail present)

Controller-1 cannot reach off the host; sudo stalls out.

user.log

daemon.log stating that it can't reach the peer controller

Networking issue I think.

Ghada Khalil (gkhalil) on 2019-09-23

tags:

added: stx.distcloud

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-23:

This looks like a lab configuration/networking issue; waiting for further information from the lab team.

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-27:

Confirmed that this was a temporary lab networking issue; closing as this doesn't appear to be a software issue.

Changed in starlingx:
importance:	Undecided → Low
status:	Incomplete → Invalid
assignee:	nobody → Ghada Khalil (gkhalil)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.