StarlingX

stx-openstack fails to come back up after controllers reboot

Bug #1881899 reported by Ovidiu Poncea on 2020-06-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	zhipeng liu

Bug Description

Brief Description
-----------------
Rebooted both controllers but Openstack fails to come back up.

[root@controller-1 sysadmin(keystone_admin)]# openstack endpoint list
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Service Unavailable (HTTP 503)

[root@controller-1 sysadmin(keystone_admin)]# kubectl get pods -o wide -n openstack | grep -v Running | grep -v Completed
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cinder-api-59979594ff-25hrl 0/1 Init:0/2 0 26m 172.16.192.109 controller-0 <none> <none>
cinder-backup-6dd95fc9dd-svp5r 0/1 Init:0/4 0 26m 172.16.192.123 controller-0 <none> <none>
cinder-scheduler-76c65f6979-5tmt5 0/1 Init:0/2 0 26m 172.16.192.82 controller-0 <none> <none>
cinder-volume-b7dfbb7b9-f47bk 0/1 Init:0/4 0 26m 172.16.192.90 controller-0 <none> <none>
cinder-volume-b7dfbb7b9-mtkz8 0/1 Init:3/4 7 9h 172.16.166.185 controller-1 <none> <none>
cinder-volume-usage-audit-1591187700-g64xf 0/1 Init:0/1 0 27m 172.16.166.157 controller-1 <none> <none>
fm-rest-api-8b5b97bf8-qdlbx 0/1 Init:0/1 0 26m 172.16.192.67 controller-0 <none> <none>
fm-rest-api-8b5b97bf8-v5wz9 0/1 CrashLoopBackOff 9 8h 172.16.166.140 controller-1 <none> <none>
glance-api-6b74f659d-w9t4g 0/1 Init:0/3 0 26m 172.16.192.101 controller-0 <none> <none>
heat-api-846d848bd9-hd46z 0/1 Init:0/1 0 26m 172.16.192.108 controller-0 <none> <none>
heat-cfn-9d6f7ffc5-rvb4d 0/1 Init:0/1 0 26m 172.16.192.73 controller-0 <none> <none>
heat-engine-6487ff65c6-zk4n7 0/1 Init:0/1 0 26m 172.16.192.80 controller-0 <none> <none>
heat-engine-cleaner-1591187700-kd2pn 0/1 Init:0/1 0 27m 172.16.166.156 controller-1 <none> <none>
horizon-65d4b5bdcf-ltms2 0/1 Init:0/1 0 21m 172.16.192.83 controller-0 <none> <none>
keystone-api-6c76774bf7-l7c4d 0/1 Init:0/1 0 26m 172.16.192.113 controller-0 <none> <none>
libvirt-libvirt-default-4mf4v 0/1 Init:0/3 1 9h 192.168.204.2 controller-0 <none> <none>
mariadb-server-0 0/1 CrashLoopBackOff 8 9h 172.16.166.158 controller-1 <none> <none>
neutron-dhcp-agent-controller-0-937646f6-r5skk 0/1 Init:0/1 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-l3-agent-controller-0-937646f6-fk5hc 0/1 Init:0/1 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-metadata-agent-controller-0-937646f6-nzkxz 0/1 Init:0/2 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-ovs-agent-controller-0-937646f6-dbfc2 0/1 Init:0/3 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-server-7c9678cf58-dq52p 0/1 Init:0/1 0 26m 172.16.192.74 controller-0 <none> <none>
neutron-server-7c9678cf58-s85dg 0/1 CrashLoopBackOff 8 9h 172.16.166.191 controller-1 <none> <none>
neutron-sriov-agent-controller-0-937646f6-qxt9g 0/1 Init:0/2 1 9h 192.168.204.2 controller-0 <none> <none>
nova-api-metadata-b9b4fdb9b-d2gr6 0/1 CrashLoopBackOff 8 9h 172.16.166.177 controller-1 <none> <none>
nova-api-metadata-b9b4fdb9b-kg859 0/1 Init:0/2 0 26m 172.16.192.110 controller-0 <none> <none>
nova-api-osapi-856679d49f-4ljnl 0/1 Init:0/1 0 26m 172.16.192.117 controller-0 <none> <none>
nova-compute-controller-0-937646f6-9lrqs 0/2 Init:0/6 1 9h 192.168.204.2 controller-0 <none> <none>
nova-conductor-6cbc75dd89-nxvwc 0/1 Init:0/1 0 26m 172.16.192.66 controller-0 <none> <none>
nova-novncproxy-5bd676cfc4-82r8x 0/1 Init:0/3 0 26m 172.16.192.72 controller-0 <none> <none>
nova-scheduler-7fbf5cdd4-ckmkd 0/1 CrashLoopBackOff 6 9h 172.16.166.172 controller-1 <none> <none>
nova-scheduler-7fbf5cdd4-h65j5 0/1 Init:0/1 0 26m 172.16.192.121 controller-0 <none> <none>
nova-service-cleaner-1591189200-927h5 0/1 Init:0/1 0 6m56s 172.16.192.92 controller-0 <none> <none>

Severity
--------
Critical: openstack is unusable

Steps to Reproduce
------------------
1. Reboot both controllers with reboot -f
2. Wait for them to come back up

Expected Behavior
------------------
'openstack endpoint list' should work

Actual Behavior
----------------
[root@controller-1 sysadmin(keystone_admin)]# openstack endpoint list
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Service Unavailable (HTTP 503)

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIO-DX ipv4

Branch/Pull Time/Commit
-----------------------
master

Test Activity
-------------
Developer Testing

Workaround
----------
None

See original description

Tags:

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2020-06-03:

Issue seems to be caused by MariaDB not recovering. This leads to all Openstack services not responding.

Ovidiu Poncea (ovidiuponcea) on 2020-06-03

description:

updated

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2020-06-03:

Collect logs: https://files.starlingx.kube.cengn.ca/download_file/199

Ghada Khalil (gkhalil) on 2020-06-03

tags:

added: stx.distro.openstack

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-03:

stx.4.0 / high priority - stability issue w/ openstack

tags:	added: stx.4.0
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → yong hu (yhu6)

zhipeng liu (zhipengs) on 2020-06-05

Changed in starlingx:
assignee:	yong hu (yhu6) → zhipeng liu (zhipengs)

Revision history for this message

zhipeng liu (zhipengs) wrote on 2020-06-05:

Hi all,

I could not reproduce this issue with my Ussuri upgrade EB.
However, I could reproduce this issue with DB 20200516T080009Z.
From error logs, it is an old issue analyzed by Chris Friesen early last year.
https://bugs.launchpad.net/starlingx/+bug/1816842/comments/3

In ussuri upgrade EB, we rebased openstack-helm-infra/mariadb.
It includes below 2 patches which fixed this stability issue.
https://review.opendev.org/#/c/704034/ Prevent splitbrain during full Galera restart https://review.opendev.org/#/c/708071/ mariadb: avoid state management thread death

So, we'd better rebase openstack-helm/openstack-helm-infra, as it already fixed many
stability issue.

Thanks!
Zhipeng

zhipeng liu (zhipengs) on 2020-06-05

Changed in starlingx:
status:	Triaged → Confirmed

Revision history for this message

yong hu (yhu6) wrote on 2020-06-16:

"So, we'd better rebase openstack-helm/openstack-helm-infra, as it already fixed many" the upgrade patch actually took the "March" version, which included Mariadb fixes.

Revision history for this message

yong hu (yhu6) wrote on 2020-06-30:

It's worthy re-testing with latest build (with "U") once it is available.

yong hu (yhu6) on 2020-07-14

tags:

added: stx.retestneeded

Revision history for this message

zhipeng liu (zhipengs) wrote on 2020-07-15:

Test pass with ussuri build.
Engineering build: stx.master-20200701T120139Z + https://review.opendev.org/#/c/739046

Zhipeng

Revision history for this message

zhipeng liu (zhipengs) wrote on 2020-07-20:

Hi Ovidiu,

Could you help to confirm if we can close this ticket now?

Thanks!
Zhipeng

yong hu (yhu6) on 2020-07-23

Changed in starlingx:
status:	Confirmed → Fix Released

Ghada Khalil (gkhalil) on 2021-10-27

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.