AIO-DX Host controller compute services failed to get openstack token from keystone after reboot

Bug #1830421 reported by Peng Peng
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Austin Sun

Bug Description

Brief Description
-----------------
In a DX system, after force reboot active controller, alarm "270.001 Host controller-0 compute services failure, failed to get openstack token from keystone" was raised, and and never cleared.

Severity
--------
Major

Steps to Reproduce
------------------
sudo -f reboot

TC-name: mtc/test_evacuate.py::TestTisGuest::test_evacuate_vms

Expected Behavior
------------------
eventually 270.001 alarm should be cleared

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-23_18-37-00

Last Pass
---------
2019-05-18_06-36-50

Timestamp/Logs
--------------

[2019-05-24 14:57:16,230] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-24 14:57:18,716] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| 4332dfc7-86f4-4de8-a5bf-bad51b6838a1 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-0 | minor | 2019-05-24T14:57:12.373192 |
| a13d78a1-0f5e-4603-a94f-e6f739edbeef | 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-05-24T14:54:59.746219 |
| 95f73f81-cc23-4cf3-83c5-06a0ad586a06 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-05-24T14:14:18.881022 |
| e6b48246-2c3b-472a-8285-71fcfe615d7a | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-24T14:08:47.419154 |
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

[wrsroot@controller-0 ~(keystone_admin)]$
[2019-05-24 15:06:15,644] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-24 15:06:15,644] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-24 15:13:02,193] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-24 15:13:04,267] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+------------------------------------+----------+----------------------------+
| 51afc9b4-91ef-4d28-a29e-24c39b0b5145 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-24T15:08:34.204058 |
| 2bc24217-3f96-46fd-9687-5f88f17b6aaa | 270.001 | Host controller-0 compute services failure, failed to get openstack token from keystone | host=controller-0.services=compute | critical | 2019-05-24T15:07:30.646680 |
| 95f73f81-cc23-4cf3-83c5-06a0ad586a06 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-05-24T14:14:18.881022 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+------------------------------------+----------+----------------------------+
controller-1:~$

[2019-05-24 16:04:45,891] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-24 16:04:48,281] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| 8c9b9e65-c341-428d-bf74-fb6601ef6618 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-05-24T16:04:23.328927 |
| 51afc9b4-91ef-4d28-a29e-24c39b0b5145 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-24T15:08:34.204058 |
| 2bc24217-3f96-46fd-9687-5f88f17b6aaa | 270.001 | Host controller-0 compute services failure, failed to get openstack token from keystone | host=controller-0.services=compute | critical | 2019-05-24T15:07:30.646680 |
| 95f73f81-cc23-4cf3-83c5-06a0ad586a06 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-05-24T14:14:18.881022 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Is this a case of a stale alarm or is the compute service in a failed state?

Changed in starlingx:
status: New → Incomplete
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Peng Peng (ppeng) wrote :

Before reboot, both hypervisor were up.

[2019-05-24 15:05:52,384] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-24 15:05:54,574] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-0 | QEMU | 192.168.206.3 | up |
| 5 | controller-1 | QEMU | 192.168.206.4 | up |
+----+---------------------+-----------------+---------------+-------+

After reboot,
hypervisor list cmd not working due to another issue
https://bugs.launchpad.net/starlingx/+bug/1829931 - Standby controller not up in hypervisor list in 15 mins after host-unlock

There is another 270.001 alarm
| 270.001 | Host controller-1 compute services failure, failed to disable nova services
also pop up in alarm list sometimes, but it was cleared eventually.
Only alarm "270.001 Host controller-0 compute services failure, failed to get openstack token from keystone" stays in the list until the end of entire test suite

Revision history for this message
Ghada Khalil (gkhalil) wrote :

It appears that the alarm indicates an issue with nova compute; assigning to the distro.openstack team to triage

tags: added: stx.2.0 stx.distro.openstack
Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
assignee: nobody → Bruce Jones (brucej)
Ghada Khalil (gkhalil)
summary: - DX Host controller compute services failed to get openstack token from
- keystone after reboot
+ AIO-DX Host controller compute services failed to get openstack token
+ from keystone after reboot
Revision history for this message
Bruce Jones (brucej) wrote :

Cindy, please have someone reproduce and root cause this issue.

Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Austin Sun (sunausti)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Austin Sun (sunausti)
Revision history for this message
Austin Sun (sunausti) wrote :
Download full text (4.3 KiB)

with cengn "20190604T144018Z", once "reboot -f",active host switch to controller-1 , but some pods are abnormal. need further check why pod has Error

[wrsroot@controller-1 ~(keystone_admin)]$ kubectl get pods -n openstack | grep Error
aodh-api-65d447f66d-ct77l 0/1 Error 0 54m
aodh-evaluator-77d464bb74-6x92j 0/1 Error 0 54m
aodh-listener-9dfb8f677-6zjsm 0/1 Error 0 54m
aodh-notifier-5cd9f88587-fd5zr 0/1 Error 0 54m
barbican-api-7fdd595748-fp89v 0/1 Error 0 74m
cinder-backup-786c58bd79-hh6hh 0/1 Error 0 57m
cinder-scheduler-7c5ddf9976-zjffc 0/1 Error 0 57m
cinder-volume-b6cbbb476-ksjkl 0/1 Error 0 57m
glance-api-99c864d6f-x298r 0/1 Error 0 73m
gnocchi-api-55798d5b5b-xgzwd 0/1 Error 0 52m
gnocchi-metricd-dxnv9 0/1 Error 0 52m
heat-api-858945db9f-g9cpk 0/1 Error 0 61m
heat-engine-56bd4c9bc9-xwrhc 0/1 Error 0 61m
keystone-api-bf59bb9f6-bnbqw 0/1 Error 0 77m
libvirt-libvirt-default-2nhj5 0/1 Error 0 70m
neutron-dhcp-agent-controller-0-9626473e-mh9z8 0/1 Error 0 70m
neutron-metadata-agent-controller-0-a762cb46-x9f65 0/1 Error 0 70m
neutron-server-5cfdcd846c-j7vbm 0/1 Error 0 70m
nova-api-metadata-6d9b66759b-g2sbm 0/1 Error 1 70m
nova-api-osapi-6db69588f5-6gcdg 0/1 Error 0 70m
nova-api-proxy-6ff4c8f7c7-xqjzx 0/1 Error 0 70m
nova-compute-controller-0-dec13249-cp46z 0/2 Error 0 70m
nova-conductor-7bffdff5dd-s4trs 0/1 Error 0 70m
nova-novncproxy-5fb4d6958d-tvwqs 0/1 Error 0 70m
nova-placement-api-799489447d-v8prf 0/1 Error 0 70m
openvswitch-db-ngc5r 0/1 Error 0 70m
openvswitch-vswitchd-t9ztn 0/1 Error 0 70m
osh-openstack-rabbitmq-rabbitmq-0 0/1 Error 0 79m
panko-api-5f66468d49-pxfgx 0/1 Error 0 50m

after a while, the pods get normal, and

controller-1:~$ openstack server list
+--------------------------------------+------+--------+---------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------+--------+---------------------+--------+----------+
| df91f4c5-7209-4a3b-bde4-a2b878e...

Read more...

Revision history for this message
Austin Sun (sunausti) wrote :

and after 5 mins. the fm alarm is clean-up too.
[wrsroot@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+---------------------------------------------------------------------+----------------------+----------+---------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------+----------------------+----------+---------------+
| 400.001 | Service group cloud-services warning; dbmon(enabled-go-active, ) | service_domain= | minor | 2019-06-11T09 |
| | | controller. | | :11:42.572391 |
| | | service_group=cloud- | | |
| | | services.host= | | |
| | | controller-1 | | |
| | | | | |
| 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain= | minor | 2019-06-11T09 |
| | | controller. | | :11:36.519572 |
| | | service_group=cloud- | | |
| | | services.host= | | |
| | | controller-0 | | |
| | | | | |
| 100.114 | NTP address 104.236.52.16 is not a valid or a reachable NTP server. | host=controller-0. | minor | 2019-06-11T09 |
| | | ntp=104.236.52.16 | | :11:15.763393 |
| | | | | |
+----------+---------------------------------------------------------------------+----------------------+----------+---------------+

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Austin Sun (sunausti) wrote :

could you reproduce latest version if this issue is gone?

Revision history for this message
yong hu (yhu6) wrote :

@Austin, please check this on the latest build, if the issue is gone, we can close this issue.

Revision history for this message
Austin Sun (sunausti) wrote :

with cengn build 20190707T233000Z, this issue can not be reproduced .

Revision history for this message
yong hu (yhu6) wrote :

keep monitoring and reproducing for a few more days.

Revision history for this message
Peng Peng (ppeng) wrote :

We have not observed this issue recently

tags: removed: stx.retestneeded
Revision history for this message
yong hu (yhu6) wrote :

there was a similar issue on multi-nodes (https://bugs.launchpad.net/starlingx/+bug/1823375, which was not seen on 7/18 and 07/24 Cengn build).
Keep monitoring this till to RC1.

Revision history for this message
Austin Sun (sunausti) wrote :

test 10 times reboot on AIO-DX system. did not reproduce this issue.
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="builder"
BUILD_NUMBER="n/a"
BUILD_HOST="jenkins-starlingx-stx-daily-build-106-8gdx9-vlrrh"
BUILD_DATE="2019-08-05 07:21:20 +0000"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/builder/starlingx/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/builder/starlingx/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Revision history for this message
yong hu (yhu6) wrote :

this LP is more about the stability related to "reboot" and "helm-chart re-apply", and recently there were several fixes on this part.
so, I will set this issue as "Fix Release", based on Austin latest verification result.

Changed in starlingx:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.