IPv6 Distributed Cloud: system controller controller-1 in reboot loop after lock/unlock
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Invalid
|
Low
|
Ghada Khalil |
Bug Description
Brief Description
-----------------
IPv6 DC system controller installed and subcloud added. To clear system controller controller-1 alarm "controller-1 Configuration is out-of-date.", lock/unlock controller-1. But after controller-1 unlocked, controller-1 was in reboot loop.
Severity
--------
Critical
Steps to Reproduce
------------------
IPv6 DC system controller installed
subcloud added
lock system controller controller-1
unlock controller-1
TC-name: DC installation
Expected Behavior
------------------
after unlock controller-1, controller-1 will reboot and come back online. Alarm "controller-1 Configuration is out-of-date." will be cleared
Actual Behavior
----------------
controller-1 in reboot loop
Reproducibility
---------------
Seen once
System Configuration
-------
Distributed cloud
IPv6
Lab-name: DC
Branch/Pull Time/Commit
-------
stx master as of "2019-09-
Last Pass
---------
Timestamp/Logs
--------------
[sysadmin@
+------
| Property | Value |
+------
| action | none |
| administrative | unlocked |
| availability | offline |
| bm_ip | None |
| bm_type | None |
| bm_username | None |
| boot_device | /dev/disk/
| capabilities | {u'stor_function': u'monitor'} |
| config_applied | 5306f693-
| config_status | Config out-of-date |
| config_target | 423eebaf-
| console | ttyS0,115200n8 |
| created_at | 2019-09-
| hostname | controller-1 |
| id | 2 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | fd01:1::4 |
| mgmt_mac | 90:e2:ba:b0:e9:f4 |
| operational | disabled |
| personality | controller |
| reserved | False |
| rootfs_device | /dev/disk/
| serialid | None |
| software_load | 19.10 |
| subfunction_avail | failed |
| subfunction_oper | disabled |
| subfunctions | controller,worker |
| task | Locking |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2019-09-
| uptime | 0 |
| uuid | 4c149540-
| vim_progress_status | services-disabled |
+------
[sysadmin@
+------
| Alarm ID | Reason Text | Entity ID | Severit
+------
| 200.001 | controller-1 was administratively locked to take it out-of-service. | host=controller-1 | warning
| | | |
| | | |
| 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members | service_
| | available | service_
| | | |
| 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members | service_
| | available | service_
| | | |
| 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members | service_
| | available | service_
| | | |
| 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members | service_
| | available | service_
| | | |
| 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members | service_
| | available | service_
| | | |
| 400.002 | Service group storage-
| | members available | service_
| | | services |
| | | |
| 400.002 | Service group distributed-
| | members available | service_
| | | services |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.133 on host controller-0 | host=controller
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller
| | | |
| | | |
| 400.005 | Communication failure detected with peer over port ens801f0.134 on host controller-0 | host=controller
| | | host |
| | | |
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major
| | | |
| | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active | service_
| | member available | service_
| | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member | service_
| | available | service_
| | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member | service_
| | available | service_
| | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' | cluster= | warning
| | for more details. | cdba6cc6-
| | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster= | major
| | | cdba6cc6-
| | | peergroup=
| | | |
+------
[sysadmin@
collect log attached
Test Activity
-------------
DC installation
tags: | added: stx.distcloud |
Controller-1 failed configuration. Networking issue I think.
2019-09- 20T14:16: 45.408 [15547.00049] controller-1 mtcClient com nodeUtil.cpp (1214) get_node_health :Error : controller-1 is UnHealthy
controller-1:~$ ls /var/run/. heartbeat
./ .ceph_osd_service .node_locked
../ .ceph_osd_status .rpmdb_cleaned
.ceph_mon_service .config_fail .sm_watchdog_
.ceph_mon_status .goenabled
Controller-1 cannot reach off the host; sudo stalls out.
From User.log
2019-09- 20T14:16: 41.000 controller-1 root: notice Error: Unable to contact active controller (controller- platform- nfs) from management address 20T14:16: 47.000 controller-1 root: notice Worker is not configured
2019-09-
controller- 1:/var/ log/puppet# cat /var/log/daemon.log | grep 14:16:4 20T14:16: 41.992 controller-1 worker_ config[ 15544]: info ******* ******* ******* ******* ******* ******* ******* **** 20T14:16: 42.001 controller-1 worker_ config[ 15544]: info ******* ******* ******* ******* ******* ******* ******* **** 20T14:16: 42.011 controller-1 worker_ config[ 15544]: info Unable to contact active controller (controller- platform- nfs) from management address 20T14:16: 42.024 controller-1 worker_ config[ 15544]: info ******* ******* ******* ******* ******* ******* ******* **** 20T14:16: 42.034 controller-1 worker_ config[ 15544]: info ******* ******* ******* ******* ******* ******* ******* **** 20T14:16: 42.044 controller-1 worker_ config[ 15544]: info Pausing for 5 seconds... 20T14:16: 46.994 controller-1 systemd[1]: notice workerconfig. service: main process exited, code=exited, status=1/FAILURE
2019-09-
2019-09-
2019-09-
2019-09-
2019-09-
2019-09-
2019-09-