stx-openstack fails to come back up after controllers reboot

Bug #1881899 reported by Ovidiu Poncea
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
zhipeng liu

Bug Description

Brief Description
-----------------
Rebooted both controllers but Openstack fails to come back up.

[root@controller-1 sysadmin(keystone_admin)]# openstack endpoint list
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Service Unavailable (HTTP 503)

Node status:
[root@controller-1 sysadmin(keystone_admin)]# system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[root@controller-1 sysadmin(keystone_admin)]# fm alarm-list
+----------+---------------------------------------------------------------------+------------------------+----------+--------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------+------------------------+----------+--------------+
| 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain= | minor | 2020-06-03T1 |
| | | controller. | | 3:02:13. |
| | | service_group=cloud- | | 385662 |
| | | services.host= | | |
| | | controller-1 | | |
| | | | | |
| 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain= | minor | 2020-06-03T1 |
| | | controller. | | 3:01:13. |
| | | service_group=cloud- | | 112434 |
| | | services.host= | | |
| | | controller-0 | | |
| | | | | |
| 200.006 | controller-0 is degraded due to the failure of its 'pci-irq- | host=controller-0. | major | 2020-06-03T1 |
| | affinity-agent' process. Auto recovery of this major process is in | process=pci-irq- | | 2:46:13. |
| | progress. | affinity-agent | | 918380 |
| | | | | |
+----------+---------------------------------------------------------------------+------------------------+----------+--------------+

[root@controller-1 sysadmin(keystone_admin)]# kubectl get pods -o wide -n openstack | grep -v Running | grep -v Completed
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cinder-api-59979594ff-25hrl 0/1 Init:0/2 0 26m 172.16.192.109 controller-0 <none> <none>
cinder-backup-6dd95fc9dd-svp5r 0/1 Init:0/4 0 26m 172.16.192.123 controller-0 <none> <none>
cinder-scheduler-76c65f6979-5tmt5 0/1 Init:0/2 0 26m 172.16.192.82 controller-0 <none> <none>
cinder-volume-b7dfbb7b9-f47bk 0/1 Init:0/4 0 26m 172.16.192.90 controller-0 <none> <none>
cinder-volume-b7dfbb7b9-mtkz8 0/1 Init:3/4 7 9h 172.16.166.185 controller-1 <none> <none>
cinder-volume-usage-audit-1591187700-g64xf 0/1 Init:0/1 0 27m 172.16.166.157 controller-1 <none> <none>
fm-rest-api-8b5b97bf8-qdlbx 0/1 Init:0/1 0 26m 172.16.192.67 controller-0 <none> <none>
fm-rest-api-8b5b97bf8-v5wz9 0/1 CrashLoopBackOff 9 8h 172.16.166.140 controller-1 <none> <none>
glance-api-6b74f659d-w9t4g 0/1 Init:0/3 0 26m 172.16.192.101 controller-0 <none> <none>
heat-api-846d848bd9-hd46z 0/1 Init:0/1 0 26m 172.16.192.108 controller-0 <none> <none>
heat-cfn-9d6f7ffc5-rvb4d 0/1 Init:0/1 0 26m 172.16.192.73 controller-0 <none> <none>
heat-engine-6487ff65c6-zk4n7 0/1 Init:0/1 0 26m 172.16.192.80 controller-0 <none> <none>
heat-engine-cleaner-1591187700-kd2pn 0/1 Init:0/1 0 27m 172.16.166.156 controller-1 <none> <none>
horizon-65d4b5bdcf-ltms2 0/1 Init:0/1 0 21m 172.16.192.83 controller-0 <none> <none>
keystone-api-6c76774bf7-l7c4d 0/1 Init:0/1 0 26m 172.16.192.113 controller-0 <none> <none>
libvirt-libvirt-default-4mf4v 0/1 Init:0/3 1 9h 192.168.204.2 controller-0 <none> <none>
mariadb-server-0 0/1 CrashLoopBackOff 8 9h 172.16.166.158 controller-1 <none> <none>
neutron-dhcp-agent-controller-0-937646f6-r5skk 0/1 Init:0/1 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-l3-agent-controller-0-937646f6-fk5hc 0/1 Init:0/1 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-metadata-agent-controller-0-937646f6-nzkxz 0/1 Init:0/2 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-ovs-agent-controller-0-937646f6-dbfc2 0/1 Init:0/3 1 9h 192.168.204.2 controller-0 <none> <none>
neutron-server-7c9678cf58-dq52p 0/1 Init:0/1 0 26m 172.16.192.74 controller-0 <none> <none>
neutron-server-7c9678cf58-s85dg 0/1 CrashLoopBackOff 8 9h 172.16.166.191 controller-1 <none> <none>
neutron-sriov-agent-controller-0-937646f6-qxt9g 0/1 Init:0/2 1 9h 192.168.204.2 controller-0 <none> <none>
nova-api-metadata-b9b4fdb9b-d2gr6 0/1 CrashLoopBackOff 8 9h 172.16.166.177 controller-1 <none> <none>
nova-api-metadata-b9b4fdb9b-kg859 0/1 Init:0/2 0 26m 172.16.192.110 controller-0 <none> <none>
nova-api-osapi-856679d49f-4ljnl 0/1 Init:0/1 0 26m 172.16.192.117 controller-0 <none> <none>
nova-compute-controller-0-937646f6-9lrqs 0/2 Init:0/6 1 9h 192.168.204.2 controller-0 <none> <none>
nova-conductor-6cbc75dd89-nxvwc 0/1 Init:0/1 0 26m 172.16.192.66 controller-0 <none> <none>
nova-novncproxy-5bd676cfc4-82r8x 0/1 Init:0/3 0 26m 172.16.192.72 controller-0 <none> <none>
nova-scheduler-7fbf5cdd4-ckmkd 0/1 CrashLoopBackOff 6 9h 172.16.166.172 controller-1 <none> <none>
nova-scheduler-7fbf5cdd4-h65j5 0/1 Init:0/1 0 26m 172.16.192.121 controller-0 <none> <none>
nova-service-cleaner-1591189200-927h5 0/1 Init:0/1 0 6m56s 172.16.192.92 controller-0 <none> <none>

Severity
--------
Critical: openstack is unusable

Steps to Reproduce
------------------
1. Reboot both controllers with reboot -f
2. Wait for them to come back up

Expected Behavior
------------------
'openstack endpoint list' should work

Actual Behavior
----------------
[root@controller-1 sysadmin(keystone_admin)]# openstack endpoint list
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Service Unavailable (HTTP 503)

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIO-DX ipv4

Branch/Pull Time/Commit
-----------------------
master

Test Activity
-------------
Developer Testing

 Workaround
 ----------
None

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Issue seems to be caused by MariaDB not recovering. This leads to all Openstack services not responding.

description: updated
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.distro.openstack
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - stability issue w/ openstack

tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → yong hu (yhu6)
zhipeng liu (zhipengs)
Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi all,

I could not reproduce this issue with my Ussuri upgrade EB.
However, I could reproduce this issue with DB 20200516T080009Z.
From error logs, it is an old issue analyzed by Chris Friesen early last year.
https://bugs.launchpad.net/starlingx/+bug/1816842/comments/3

In ussuri upgrade EB, we rebased openstack-helm-infra/mariadb.
It includes below 2 patches which fixed this stability issue.
https://review.opendev.org/#/c/704034/ Prevent splitbrain during full Galera restart https://review.opendev.org/#/c/708071/ mariadb: avoid state management thread death

So, we'd better rebase openstack-helm/openstack-helm-infra, as it already fixed many
stability issue.

Thanks!
Zhipeng

zhipeng liu (zhipengs)
Changed in starlingx:
status: Triaged → Confirmed
Revision history for this message
yong hu (yhu6) wrote :

"So, we'd better rebase openstack-helm/openstack-helm-infra, as it already fixed many" the upgrade patch actually took the "March" version, which included Mariadb fixes.

Revision history for this message
yong hu (yhu6) wrote :

It's worthy re-testing with latest build (with "U") once it is available.

yong hu (yhu6)
tags: added: stx.retestneeded
Revision history for this message
zhipeng liu (zhipengs) wrote :

Test pass with ussuri build.
Engineering build: stx.master-20200701T120139Z + https://review.opendev.org/#/c/739046

Zhipeng

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Ovidiu,

Could you help to confirm if we can close this ticket now?

Thanks!
Zhipeng

yong hu (yhu6)
Changed in starlingx:
status: Confirmed → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.