IPv6 Distributed Cloud: system controller controller-1 in reboot loop after lock/unlock

Bug #1844832 reported by Peng Peng on 2019-09-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Low
Ghada Khalil

Bug Description

Brief Description
-----------------
IPv6 DC system controller installed and subcloud added. To clear system controller controller-1 alarm "controller-1 Configuration is out-of-date.", lock/unlock controller-1. But after controller-1 unlocked, controller-1 was in reboot loop.

Severity
--------
Critical

Steps to Reproduce
------------------
IPv6 DC system controller installed
subcloud added
lock system controller controller-1
unlock controller-1

TC-name: DC installation

Expected Behavior
------------------
after unlock controller-1, controller-1 will reboot and come back online. Alarm "controller-1 Configuration is out-of-date." will be cleared

Actual Behavior
----------------
controller-1 in reboot loop

Reproducibility
---------------
Seen once

System Configuration
--------------------
Distributed cloud
IPv6

Lab-name: DC

Branch/Pull Time/Commit
-----------------------
stx master as of "2019-09-17_20-00-00"

Last Pass
---------

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system host-lock controller-1
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | offline |
| bm_ip | None |
| bm_type | None |
| bm_username | None |
| boot_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 |
| capabilities | {u'stor_function': u'monitor'} |
| config_applied | 5306f693-3216-4471-8e46-e0636023d0e5 |
| config_status | Config out-of-date |
| config_target | 423eebaf-0888-49ba-9366-74ea518abd03 |
| console | ttyS0,115200n8 |
| created_at | 2019-09-19T14:51:26.324167+00:00 |
| hostname | controller-1 |
| id | 2 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | fd01:1::4 |
| mgmt_mac | 90:e2:ba:b0:e9:f4 |
| operational | disabled |
| personality | controller |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 |
| serialid | None |
| software_load | 19.10 |
| subfunction_avail | failed |
| subfunction_oper | disabled |
| subfunctions | controller,worker |
| task | Locking |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2019-09-20T13:53:06.342010+00:00 |
| uptime | 0 |
| uuid | 4c149540-883f-4925-abd9-e41e39eeb288 |
| vim_progress_status | services-disabled |
+---------------------+--------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
| Alarm ID | Reason Text | Entity ID | Severit
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
| 200.001 | controller-1 was administratively locked to take it out-of-service. | host=controller-1 | warning
| | | |
| | | |
| 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=oam-services |
| | | |
| 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=controller-services |
| | | |
| 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=cloud-services |
| | | |
| 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=vim-services |
| | | |
| 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members | service_domain=controller. | major
| | available | service_group=patching-services |
| | | |
| 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major
| | members available | service_group=storage-monitoring- |
| | | services |
| | | |
| 400.002 | Service group distributed-cloud-services loss of redundancy; expected 1 standby member but no standby | service_domain=controller. | major
| | members available | service_group=distributed-cloud- |
| | | services |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.133 on host controller-0 | host=controller-0.network=mgmt | major
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam | major
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.134 on host controller-0 | host=controller-0.network=cluster- | major
| | | host |
| | | |
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major
| | | |
| | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active | service_domain=controller. | major
| | member available | service_group=directory-services |
| | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major
| | available | service_group=web-services |
| | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member | service_domain=controller. | major
| | available | service_group=storage-services |
| | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' | cluster= | warning
| | for more details. | cdba6cc6-74c7-4586-a865-d57caa744c62 |
| | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster= | major
| | | cdba6cc6-74c7-4586-a865-d57caa744c62. |
| | | peergroup=group-0.host=controller-1 |
| | | |
+----------+--------------------------------------------------------------------------------------------------------+---------------------------------------+--------
[sysadmin@controller-0 ~(keystone_admin)]$

collect log attached

Test Activity
-------------
DC installation

Peng Peng (ppeng) wrote :
Peng Peng (ppeng) wrote :
Eric MacDonald (rocksolidmtce) wrote :

Controller-1 failed configuration. Networking issue I think.

2019-09-20T14:16:45.408 [15547.00049] controller-1 mtcClient com nodeUtil.cpp (1214) get_node_health :Error : controller-1 is UnHealthy

controller-1:~$ ls /var/run/.
./ .ceph_osd_service .node_locked
../ .ceph_osd_status .rpmdb_cleaned
.ceph_mon_service .config_fail .sm_watchdog_heartbeat
.ceph_mon_status .goenabled

Controller-1 cannot reach off the host; sudo stalls out.

From User.log

2019-09-20T14:16:41.000 controller-1 root: notice Error: Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:47.000 controller-1 root: notice Worker is not configured

controller-1:/var/log/puppet# cat /var/log/daemon.log | grep 14:16:4
2019-09-20T14:16:41.992 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.001 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.011 controller-1 worker_config[15544]: info Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:42.024 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.034 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.044 controller-1 worker_config[15544]: info Pausing for 5 seconds...
2019-09-20T14:16:46.994 controller-1 systemd[1]: notice workerconfig.service: main process exited, code=exited, status=1/FAILURE

Eric MacDonald (rocksolidmtce) wrote :

Can't ping between controllers.

[sysadmin@controller-0 ~(keystone_admin)]$ ping -6 fd01:1::3
PING fd01:1::3(fd01:1::3) 56 data bytes
64 bytes from fd01:1::3: icmp_seq=1 ttl=64 time=0.033 ms
64 bytes from fd01:1::3: icmp_seq=2 ttl=64 time=0.036 ms

[sysadmin@controller-0 ~(keystone_admin)]$ ping -6 fd01:1::4
PING fd01:1::4(fd01:1::4) 56 data bytes

Same from C1 viewpoint.

Eric MacDonald (rocksolidmtce) wrote :

Controller-1 failed configuration (/var/run/.config_fail present)

Controller-1 cannot reach off the host; sudo stalls out.

user.log

2019-09-20T14:16:41.000 controller-1 root: notice Error: Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:47.000 controller-1 root: notice Worker is not configured

daemon.log stating that it can't reach the peer controller

controller-1:/var/log/puppet# cat /var/log/daemon.log | grep 14:16:4
2019-09-20T14:16:41.992 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.001 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.011 controller-1 worker_config[15544]: info Unable to contact active controller (controller-platform-nfs) from management address
2019-09-20T14:16:42.024 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.034 controller-1 worker_config[15544]: info *****************************************************
2019-09-20T14:16:42.044 controller-1 worker_config[15544]: info Pausing for 5 seconds...
2019-09-20T14:16:46.994 controller-1 systemd[1]: notice workerconfig.service: main process exited, code=exited, status=1/FAILURE

Networking issue I think.

Ghada Khalil (gkhalil) on 2019-09-23
tags: added: stx.distcloud
Ghada Khalil (gkhalil) wrote :

This looks like a lab configuration/networking issue; waiting for further information from the lab team.

Changed in starlingx:
status: New → Incomplete
Ghada Khalil (gkhalil) wrote :

Confirmed that this was a temporary lab networking issue; closing as this doesn't appear to be a software issue.

Changed in starlingx:
importance: Undecided → Low
status: Incomplete → Invalid
assignee: nobody → Ghada Khalil (gkhalil)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers