IPv6 Distributed Cloud: system controller controller-1 in reboot loop after lock/unlock

Bug #1844832 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Low
Ghada Khalil

Bug Description

Brief Description
-----------------
IPv6 DC system controller installed and subcloud added. To clear system controller controller-1 alarm "controller-1 Configuration is out-of-date.", lock/unlock controller-1. But after controller-1 unlocked, controller-1 was in reboot loop.

Severity
--------
Critical

Steps to Reproduce
------------------
IPv6 DC system controller installed
subcloud added
lock system controller controller-1
unlock controller-1

TC-name: DC installation

Expected Behavior
------------------
after unlock controller-1, controller-1 will reboot and come back online. Alarm "controller-1 Configuration is out-of-date." will be cleared

Actual Behavior
----------------
controller-1 in reboot loop

Reproducibility
---------------
Seen once

System Configuration
--------------------
Distributed cloud
IPv6

Lab-name: DC

Branch/Pull Time/Commit
-----------------------
stx master as of "2019-09-17_20-00-00"

Last Pass
---------

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system host-lock controller-1
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | offline |
| bm_ip | None |
| bm_type | None |
| bm_username | None |
| boot_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 |
| capabilities | {u'stor_function': u'monitor'} |
| config_applied | 5306f693-3216-4471-8e46-e0636023d0e5 |
| config_status | Config out-of-date |
| config_target | 423eebaf-0888-49ba-9366-74ea518abd03 |
| console | ttyS0,115200n8 |
| created_at | 2019-09-19T14:51:26.324167+00:00 |
| hostname | controller-1 |
| id | 2 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | fd01:1::4 |
| mgmt_mac | 90:e2:ba:b0:e9:f4 |
| operational | disabled |
| personality | controller |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 |
| serialid | None |
| software_load | 19.10 |
| subfunction_avail | failed |
| subfunction_oper | disabled |
| subfunctions | controller,worker |
| task | Locking |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2019-09-20T13:53:06.342010+00:00 |
| uptime | 0 |
| uuid | 4c149540-883f-4925-abd9-e41e39eeb288 |
| vim_progress_status | services-disabled |
+---------------------+--------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
| Alarm ID | Reason Text | Entity ID | Severit
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
| 200.001 | controller-1 was administratively locked to take it out-of-service. | host=controller-1 | warning
| | | |
| | | |
| 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=oam-services |
| | | |
| 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=controller-services |
| | | |
| 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=cloud-services |
| | | |
| 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=vim-services |
| | | |
| 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=patching-services |
| | | |
| 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major
| | members available | service_group=storage-monitoring- |
| | | services |
| | | |
| 400.002 | Service group distributed-cloud-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major
| | members available | service_group=distributed-cloud- |
| | | services |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.133 on host controller-0 | host=controller-0.network=mgmt | major
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam | major
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.134 on host controller-0 | host=controller-0.network=cluster- | major
| | | host |
| | | |
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major
| | | |
| | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active | service_domain=controller. | major
| | member available | service_group=directory-services |
| | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major
| | available | service_group=web-services |
| | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major
| | available | service_group=storage-services |
| | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' | cluster= | warning
| | for more details. | cdba6cc6-74c7-4586-a865-d57caa744c62 |
| | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster= | major
| | | cdba6cc6-74c7-4586-a865-d57caa744c62. |
| | | peergroup=group-0.host=controller-1 |
| | | |
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
[sysadmin@controller-0 ~(keystone_admin)]$

collect log attached

Test Activity
-------------
DC installation

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Controller-1 failed configuration. Networking issue I think.

2019-09-20T14:16:45.408 [15547.00049] controller-1 mtcClient com nodeUtil.cpp (1214) get_node_health :Error : controller-1 is UnHealthy

controller-1:~$ ls /var/run/.
./ .ceph_osd_service .node_locked
../ .ceph_osd_status .rpmdb_cleaned
.ceph_mon_service .config_fail .sm_watchdog_heartbeat
.ceph_mon_status .goenabled

Controller-1 cannot reach off the host; sudo stalls out.

From User.log

2019-09-20T14:16:41.000 controller-1 root: notice Error: Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:47.000 controller-1 root: notice Worker is not configured

controller-1:/var/log/puppet# cat /var/log/daemon.log | grep 14:16:4
2019-09-20T14:16:41.992 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.001 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.011 controller-1 worker_config[15544]: info Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:42.024 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.034 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.044 controller-1 worker_config[15544]: info Pausing for 5 seconds...
2019-09-20T14:16:46.994 controller-1 systemd[1]: notice workerconfig.service: main process exited, code=exited, status=1/FAILURE

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Can't ping between controllers.

[sysadmin@controller-0 ~(keystone_admin)]$ ping -6 fd01:1::3
PING fd01:1::3(fd01:1::3) 56 data bytes
64 bytes from fd01:1::3: icmp_seq=1 ttl=64 time=0.033 ms
64 bytes from fd01:1::3: icmp_seq=2 ttl=64 time=0.036 ms

[sysadmin@controller-0 ~(keystone_admin)]$ ping -6 fd01:1::4
PING fd01:1::4(fd01:1::4) 56 data bytes

Same from C1 viewpoint.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Controller-1 failed configuration (/var/run/.config_fail present)

Controller-1 cannot reach off the host; sudo stalls out.

user.log

2019-09-20T14:16:41.000 controller-1 root: notice Error: Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:47.000 controller-1 root: notice Worker is not configured

daemon.log stating that it can't reach the peer controller

controller-1:/var/log/puppet# cat /var/log/daemon.log | grep 14:16:4
2019-09-20T14:16:41.992 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.001 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.011 controller-1 worker_config[15544]: info Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:42.024 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.034 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.044 controller-1 worker_config[15544]: info Pausing for 5 seconds...
2019-09-20T14:16:46.994 controller-1 systemd[1]: notice workerconfig.service: main process exited, code=exited, status=1/FAILURE

Networking issue I think.

Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This looks like a lab configuration/networking issue; waiting for further information from the lab team.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Confirmed that this was a temporary lab networking issue; closing as this doesn't appear to be a software issue.

Changed in starlingx:
importance: Undecided → Low
status: Incomplete → Invalid
assignee: nobody → Ghada Khalil (gkhalil)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.