controller-0 experienced a configuration failure after active controller reboot

Bug #1891658 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
In SX system, after controller force reboot, the status of controller became "avaialble" in a very short period, then stuck at "degraded". And 200.011 alarm " controller-0 experienced a configuration failure" raised.

Severity
--------
Major

Steps to Reproduce
------------------
reboot -f active cpntroller

TC-name: mtc/test_ungraceful_reboot.py::test_force_reboot_host[controller]

Expected Behavior
------------------
controller status become avaiable after reboot

Actual Behavior
----------------
controller status stuck at "degraded" after reboot

Reproducibility
---------------
Unknown - first time this is seen in sanity

System Configuration
--------------------
One node system

Lab-name: WCP_112

Branch/Pull Time/Commit
-----------------------
2020-08-13_20-00-00

Last Pass
---------
2020-08-11_20-00-00

Timestamp/Logs
--------------
[2020-08-14 04:10:07,782] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2020-08-14 04:10:09,370] 436 DEBUG MainThread ssh.expect :: Output:
+-----------------------+----------------------------------------------------------------------+
| Property | Value |
+-----------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

[2020-08-14 04:16:41,485] 314 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2020-08-14 04:38:42,016] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show'
[2020-08-14 04:38:44,912] 436 DEBUG MainThread ssh.expect :: Output:
[Errno 111] Connection refused

[2020-08-14 04:39:09,370] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2020-08-14 04:39:10,896] 436 DEBUG MainThread ssh.expect :: Output:
+-----------------------+----------------------------------------------------------------------+
| Property | Value |
+-----------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

[2020-08-14 04:39:22,645] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2020-08-14 04:39:24,156] 436 DEBUG MainThread ssh.expect :: Output:
+-----------------------+----------------------------------------------------------------------+
| Property | Value |
+-----------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | degraded |

[2020-08-14 05:50:37,247] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2020-08-14 05:50:38,869] 436 DEBUG MainThread ssh.expect :: Output:
+-----------------------+----------------------------------------------------------------------+
| Property | Value |
+-----------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | degraded |

2020-08-14 05:56:50,214] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-08-14 05:56:51,834] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+----------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------+-------------------+----------+----------------------------+
| bbdf4e6d-2329-44f6-8a1c-72abf1208cca | 200.011 | controller-0 experienced a configuration failure. | host=controller-0 | critical | 2020-08-14T04:39:22.763330 |
+--------------------------------------+----------+----------------------------------------------------+-------------------+----------+----------------------------+

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
tags: added: stx.retestneeded
Revision history for this message
Difu Hu (difuhu) wrote :

Similar issue occurred on AIO-DX wp_13_14 build 2020-08-27_20-00-00
After initial installation controller-1 was in "degraded" status, and 200.011 alarm " controller-0 experienced a configuration failure" raised.
log: https://files.starlingx.kube.cengn.ca/launchpad/1891658/ wp_13_14_ALL_NODES_20200828.133937.tar

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | degraded |
+----+--------------+-------------+----------------+-------------+--------------+

+--------------------------------------+----------+----------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------+-------------------+----------+----------------------------+
| deba472a-6437-480c-86a3-0846993e675f | 200.011 | controller-1 experienced a configuration failure. | host=controller-1 | critical | 2020-08-28T03:44:03.261544 |
+--------------------------------------+----------+----------------------------------------------------+-------------------+----------+----------------------------+

Ghada Khalil (gkhalil)
tags: added: stx.config
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This likely a failure in puppet/sysinv. Since the initial occurrences in August, it's not clear if this was seen again.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.5.0 / medium priority - logs should be reviewed to determine what failed in the two cases above

tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :
Download full text (31.4 KiB)

Boot
2020-08-14T03:01:25.255 localhost kernel: notice [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.4.tis.x86_64 root=UUID=170af5d4-98ab-4909-b0da-dd83f3da53dc ro module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto console=ttyS0,115200n8 iommu=pt usbcore.autosuspend=-1 hugepagesz=2M hugepages=0 default_hugepagesz=2M rcu_nocbs=2-43 kthread_cpus=0,1 irqaffinity=0,1 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 intel_iommu=on biosdevname=0 kvm-intel.eptad=0 user_namespace.enable=1

Runtime manifests
2020-08-14 03:11:16.861 82634 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=674e7e14-de1c-426b-82dc-43ba718d555b
2020-08-14 03:11:16.882 82634 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 674e7e14-de1c-426b-82dc-43ba718d555b {'force': False, 'personalities': ['worker', 'storage']} to agent
2020-08-14 03:11:16.893 82634 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=70559c72-155b-47f0-8217-83aa7e8d5bd1, classes: ['openstack::horizon::runtime']
2020-08-14 03:11:16.912 82634 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 70559c72-155b-47f0-8217-83aa7e8d5bd1 {'classes': ['openstack::horizon::runtime'], 'force': False, 'personalities': ['controller']} to agent
2020-08-14 03:11:29.353 82634 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=28b15c54-9267-4edd-8c54-e3cd212e8dea, classes: ['platform::compute::grub::runtime', 'platform::compute::config::runtime']
2020-08-14 03:11:29.372 82634 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 28b15c54-9267-4edd-8c54-e3cd212e8dea {'classes': ['platform::compute::grub::runtime', 'platform::compute::config::runtime'], 'force': False, 'personalities': ['controller', 'worker'], 'host_uuids': [u'535a16f2-525f-44ca-b6d6-27a52165dc12']} to agent
2020-08-14 03:11:33.847 82634 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=1621013d-1648-49fa-9293-1d2561a8d419, classes: ['openstack::keystone::endpoint::runtime', 'openstack::barbican::runtime']
2020-08-14 03:11:36.530 82634 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 1621013d-1648-49fa-9293-1d2561a8d419 {'classes': ['openstack::keystone::endpoint::runtime', 'openstack::barbican::runtime'], 'force': True, 'personalities': ['controller'], 'host_uuids': [u'535a16f2-525f-44ca-b6d6-27a52165dc12']} to agent
2020-08-14 03:11:36.535 75323 INFO sysinv.agent.manager [-] config_apply_runtime_manifest: 1621013d-1648-49fa-9293-1d2561a8d419 {u'classes': [u'openstack::keystone::endpoint::runtime', u'openstack::barbican::runtime'], u'force': True, u'personalities': [u'controller'], u'host_uuids': [u'535a16f2-525f-44ca-b6d6-27a52165dc12']} controller
2020-08-14 03:11:36.536 75323 INFO sysinv.agent.manager [-] _apply_runtime_manifest with hieradata_path = '/opt/platform/puppet/20.06/hieradata'
2020-08-14 03:14:24.214 75323 INFO sysinv.agent.manager [-] Runtime manifest apply completed for classes [u'openstack::keystone::endpoint::runtime', u'openstack::barbican::runtime'].
...

Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

The second issue on the DX is due to a
PUppet manifest failed with tiller failed to restart

var/log/puppet/2020-08-28-03-39-41_controller/puppet.log

-----
2020-08-28T03:40:46.947 Notice: 2020-08-28 03:40:46 +0000 /Stage[main]/Platform::Helm::Tiller::Config/Exec[restart tiller for helm]/returns: Job for tiller.service failed because the control process exited with error c│
│2020-08-28T03:40:46.949 Error: 2020-08-28 03:40:46 +0000 /Stage[main]/Platform::Helm::Tiller::Config/Exec[restart tiller for helm]: Failed to call refresh: systemctl restart tiller.service returned 1 instead of one of │
│2020-08-28T03:40:46.952 Error: 2020-08-28 03:40:46 +0000 /Stage[main]/Platform::Helm::Tiller::Config/Exec[restart tiller for helm]: systemctl restart tiller.service returned 1 instead of one of [0] │
│2020-08-28T03:40:46.954 /usr/share/ruby/vendor_ruby/puppet/util/errors.rb:106:in `fail' │------

Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

We did not see this in the last 1-2 months so, I suggest to open an other issue if we see it in the future.

Changed in starlingx:
status: Triaged → Invalid
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.