StarlingX

AIO-DX Host controller compute services failed to get openstack token from keystone after reboot

Bug #1830421 reported by Peng Peng on 2019-05-24

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Austin Sun

Bug Description

Brief Description
-----------------
In a DX system, after force reboot active controller, alarm "270.001 Host controller-0 compute services failure, failed to get openstack token from keystone" was raised, and and never cleared.

Severity
--------
Major

Steps to Reproduce
------------------
sudo -f reboot

TC-name: mtc/test_evacuate.py::TestTisGuest::test_evacuate_vms

Expected Behavior
------------------
eventually 270.001 alarm should be cleared

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-23_18-37-00

Last Pass
---------
2019-05-18_06-36-50

Timestamp/Logs
--------------

[2019-05-24 14:57:16,230] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-24 14:57:18,716] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| 4332dfc7-86f4-4de8-a5bf-bad51b6838a1 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-0 | minor | 2019-05-24T14:57:12.373192 |
| a13d78a1-0f5e-4603-a94f-e6f739edbeef | 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-05-24T14:54:59.746219 |
| 95f73f81-cc23-4cf3-83c5-06a0ad586a06 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-05-24T14:14:18.881022 |
| e6b48246-2c3b-472a-8285-71fcfe615d7a | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-24T14:08:47.419154 |
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

[wrsroot@controller-0 ~(keystone_admin)]$
[2019-05-24 15:06:15,644] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-24 15:06:15,644] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-24 15:13:02,193] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-24 15:13:04,267] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+------------------------------------+----------+----------------------------+
| 51afc9b4-91ef-4d28-a29e-24c39b0b5145 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-24T15:08:34.204058 |
| 2bc24217-3f96-46fd-9687-5f88f17b6aaa | 270.001 | Host controller-0 compute services failure, failed to get openstack token from keystone | host=controller-0.services=compute | critical | 2019-05-24T15:07:30.646680 |
| 95f73f81-cc23-4cf3-83c5-06a0ad586a06 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-05-24T14:14:18.881022 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+------------------------------------+----------+----------------------------+
controller-1:~$

[2019-05-24 16:04:45,891] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-24 16:04:48,281] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| 8c9b9e65-c341-428d-bf74-fb6601ef6618 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-05-24T16:04:23.328927 |
| 51afc9b4-91ef-4d28-a29e-24c39b0b5145 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-24T15:08:34.204058 |
| 2bc24217-3f96-46fd-9687-5f88f17b6aaa | 270.001 | Host controller-0 compute services failure, failed to get openstack token from keystone | host=controller-0.services=compute | critical | 2019-05-24T15:07:30.646680 |
| 95f73f81-cc23-4cf3-83c5-06a0ad586a06 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-05-24T14:14:18.881022 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-24:

ALL_NODES_20190524.160736.tar Edit (76.1 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-24:

Is this a case of a stale alarm or is the compute service in a failed state?

Changed in starlingx:
status:	New → Incomplete

Numan Waheed (nwaheed) on 2019-05-24

tags:

added: stx.retestneeded

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-24:

Before reboot, both hypervisor were up.

[2019-05-24 15:05:52,384] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-24 15:05:54,574] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-0 | QEMU | 192.168.206.3 | up |
| 5 | controller-1 | QEMU | 192.168.206.4 | up |
+----+---------------------+-----------------+---------------+-------+

After reboot,
hypervisor list cmd not working due to another issue
https://bugs.launchpad.net/starlingx/+bug/1829931 - Standby controller not up in hypervisor list in 15 mins after host-unlock

There is another 270.001 alarm
| 270.001 | Host controller-1 compute services failure, failed to disable nova services
also pop up in alarm list sometimes, but it was cleared eventually.
Only alarm "270.001 Host controller-0 compute services failure, failed to get openstack token from keystone" stays in the list until the end of entire test suite

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-05:

It appears that the alarm indicates an issue with nova compute; assigning to the distro.openstack team to triage

tags:	added: stx.2.0 stx.distro.openstack
Changed in starlingx:
importance:	Undecided → Medium
status:	Incomplete → Triaged
assignee:	nobody → Bruce Jones (brucej)

Ghada Khalil (gkhalil) on 2019-06-05

summary:

- DX Host controller compute services failed to get openstack token from
- keystone after reboot
+ AIO-DX Host controller compute services failed to get openstack token
+ from keystone after reboot

Revision history for this message

Bruce Jones (brucej) wrote on 2019-06-10:

Cindy, please have someone reproduce and root cause this issue.

Changed in starlingx:
assignee:	Bruce Jones (brucej) → Cindy Xie (xxie1)

Austin Sun (sunausti) on 2019-06-11

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Austin Sun (sunausti)

Revision history for this message

Austin Sun (sunausti) wrote on 2019-06-11:

Download full text (4.3 KiB)

with cengn "20190604T144018Z", once "reboot -f",active host switch to controller-1 , but some pods are abnormal. need further check why pod has Error

[wrsroot@controller-1 aodh-api-65d447f66d-ct77l aodh-evaluator-77d464bb74-6x92j aodh-listener-9dfb8f677-6zjsm aodh-notifier-5cd9f88587-fd5zr barbican-api-7fdd595748-fp89v cinder-backup-786c58bd79-hh6hh cinder-scheduler-7c5ddf9976-zjffc cinder-volume-b6cbbb476-ksjkl glance-api-99c864d6f-x298r gnocchi-api-55798d5b5b-xgzwd gnocchi-metricd-dxnv9 heat-api-858945db9f-g9cpk heat-engine-56bd4c9bc9-xwrhc keystone-api-bf59bb9f6-bnbqw libvirt-libvirt-default-2nhj5 neutron-dhcp-agent-controller-metadata-agent-controlleserver-5cfdcd846c-j7vbm nova-api-metadata-6d9b66759b-g2sbm nova-api-osapi-6db69588f5-6gcdg nova-api-proxy-6ff4c8f7c7-xqjzx nova-compute-controller-7bffdff5dd-s4trs nova-novncproxy-5fb4d6958d-tvwqs nova-placement-api-799489447d-v8prf openvswitch-db-ngc5r openvswitch-vswitchd-t9ztn osh-openstack-rabbitmq-rabbitmq-0 panko-api-5f66468d49-pxfgx ~(keystone_admin)]$ kubectl get pods -n openstack | grep Error
0/1 Error 0 54m
0/1 Error 0 54m
0/1 Error 0 54m
0/1 Error 0 54m
0/1 Error 0 74m
0/1 Error 0 57m
0/1 Error 0 57m
0/1 Error 0 57m
0/1 Error 0 73m
0/1 Error 0 52m
0/1 Error 0 52m
0/1 Error 0 61m
0/1 Error 0 61m
0/1 Error 0 77m
0/1 Error 0 70m
/>0-9626473e-mh9z8 0/1 Error 0 70m
/>r-0-a762cb46-x9f65 0/1 Error 0 70m
0/1 Error 0 70m
0/1 Error 1 70m
0/1 Error 0 70m
0/1 Error 0 70m
/>0-dec13249-cp46z 0/2 Error 0 70m
0/1 Error 0 70m
0/1 Error 0 70m
0/1 Error 0 70m
0/1 Error 0 70m
0/1 Error 0 70m
0/1 Error 0 79m
0/1 Error 0 50m

after a while, the pods get normal, and

with cengn "20190604T144018Z", once "reboot -f",active host switch to controller-1 , but some pods are abnormal. need further check why pod has Error

[wrsroot@controller-1 ~(keystone_admin)]$ kubectl get pods -n openstack | grep Error
aodh-api-65d447f66d-ct77l                            0/1     Error       0          54m
aodh-evaluator-77d464bb74-6x92j                      0/1     Error       0          54m
aodh-listener-9dfb8f677-6zjsm                        0/1     Error       0          54m
aodh-notifier-5cd9f88587-fd5zr                       0/1     Error       0          54m
barbican-api-7fdd595748-fp89v                        0/1     Error       0          74m
cinder-backup-786c58bd79-hh6hh                       0/1     Error       0          57m
cinder-scheduler-7c5ddf9976-zjffc                    0/1     Error       0          57m
cinder-volume-b6cbbb476-ksjkl                        0/1     Error       0          57m
glance-api-99c864d6f-x298r                           0/1     Error       0          73m
gnocchi-api-55798d5b5b-xgzwd                         0/1     Error       0          52m
gnocchi-metricd-dxnv9                                0/1     Error       0          52m
heat-api-858945db9f-g9cpk                            0/1     Error       0          61m
heat-engine-56bd4c9bc9-xwrhc                         0/1     Error       0          61m
keystone-api-bf59bb9f6-bnbqw                         0/1     Error       0          77m
libvirt-libvirt-default-2nhj5                        0/1     Error       0          70m
neutron-dhcp-agent-controller-0-9626473e-mh9z8       0/1     Error       0          70m
neutron-metadata-agent-controller-0-a762cb46-x9f65   0/1     Error       0          70m
neutron-server-5cfdcd846c-j7vbm                      0/1     Error       0          70m
nova-api-metadata-6d9b66759b-g2sbm                   0/1     Error       1          70m
nova-api-osapi-6db69588f5-6gcdg                      0/1     Error       0          70m
nova-api-proxy-6ff4c8f7c7-xqjzx                      0/1     Error       0          70m
nova-compute-controller-0-dec13249-cp46z             0/2     Error       0          70m
nova-conductor-7bffdff5dd-s4trs                      0/1     Error       0          70m
nova-novncproxy-5fb4d6958d-tvwqs                     0/1     Error       0          70m
nova-placement-api-799489447d-v8prf                  0/1     Error       0          70m
openvswitch-db-ngc5r                                 0/1     Error       0          70m
openvswitch-vswitchd-t9ztn                           0/1     Error       0          70m
osh-openstack-rabbitmq-rabbitmq-0                    0/1     Error       0          79m
panko-api-5f66468d49-pxfgx                           0/1     Error       0          50m

after a while, the pods get normal, and

need check how to clear up this fm alarm

Revision history for this message

Austin Sun (sunausti) wrote on 2019-06-11:

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Austin Sun (sunausti) wrote on 2019-06-11:

could you reproduce latest version if this issue is gone?

Revision history for this message

yong hu (yhu6) wrote on 2019-07-08:

@Austin, please check this on the latest build, if the issue is gone, we can close this issue.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-07-09:

#10

with cengn build 20190707T233000Z, this issue can not be reproduced .

Revision history for this message

yong hu (yhu6) wrote on 2019-07-16:

#11

keep monitoring and reproducing for a few more days.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-25:

#12

We have not observed this issue recently

tags:

removed: stx.retestneeded

Revision history for this message

yong hu (yhu6) wrote on 2019-07-25:

#13

there was a similar issue on multi-nodes (https://bugs.launchpad.net/starlingx/+bug/1823375, which was not seen on 7/18 and 07/24 Cengn build).
Keep monitoring this till to RC1.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-08-08:

#14

test 10 times reboot on AIO-DX system. did not reproduce this issue.
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="builder"
BUILD_NUMBER="n/a"
BUILD_HOST="jenkins-starlingx-stx-daily-build-106-8gdx9-vlrrh"
BUILD_DATE="2019-08-05 07:21:20 +0000"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/builder/starlingx/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/builder/starlingx/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Revision history for this message

yong hu (yhu6) wrote on 2019-08-08:

#15

this LP is more about the stability related to "reboot" and "helm-chart re-apply", and recently there were several fixes on this part.
so, I will set this issue as "Fix Release", based on Austin latest verification result.

Changed in starlingx:
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

ALL_NODES_20190524.160736.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.