OpenStack pods were not recovered soon after force reboot active controller

Bug #1881722 reported by zhipeng liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
zhipeng liu

Bug Description

Brief Description
-----------------
It usually takes around 10 minutes to get all pods ready again after force rebooting the active/standby controller

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure system, apply stx-openstack application
- Lock/unlock of standby controller
- reset (ie: reboot -f) of the standby controller
- reset (ie: reboot -f) of the active controller
- reapply of stx-openstack after the above scenarios

Expected Behavior
------------------
- All OpenStack pods recover to Running or Completed states soon.

Actual Behavior
----------------
- From send "reboot -f" to the time when we see all pods are ready, it takes around 10 minutes.

Reproducibility
---------------
 100% performance issue?

System Configuration
--------------------
Duplex

Branch/Pull Time/Commit
-----------------------
stx master daily build 20200530T013359Z

zhipeng liu (zhipengs)
Changed in starlingx:
assignee: nobody → zhipeng liu (zhipengs)
status: New → In Progress
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi all,

I have done too much test on 4 different set up including daily build 0530.
It usually takes 8~10min.
mariadb pod and ovs-db pod take some time to get them ready.

Below is the time statistics.
4min for host restart and get ready.
3min for mariadb terminating, initialization, get ready. (then configmap sync is ready)
2min for ovs-db ready (reduce probe live/ready timer can improve a little, as it can retry quickly to connect ovs-vsctl: unix:/var/run/openvswitch/db.sock)
1min for other pods ready, like neutron-ovs-agent which depends on ovs-db. )

Thanks!
Zhipeng

Ghada Khalil (gkhalil)
tags: added: stx.distro.openstack
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - As per PTG discussion, openstack stability is a priority

tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → High
zhipeng liu (zhipengs)
Changed in starlingx:
status: In Progress → Confirmed
Revision history for this message
zhipeng liu (zhipengs) wrote :

No obvious issue found.
If we have specific performance requirement, we can try to do some further analysis and to see if it
is reasonable and close the gap.

Thanks!
Zhipeng

Revision history for this message
Akshay (akshay346) wrote :

Hi zhipeng,

Can you please tell me the specs of you host machine. Like RAM, cores etc.

Revision history for this message
zhipeng liu (zhipengs) wrote :
Revision history for this message
yong hu (yhu6) wrote :

@zhipeng: to make a latest test. During the active controller rebooting, the openstack service should be accessible because there is another set of services running.

Revision history for this message
zhipeng liu (zhipengs) wrote :

Could not reproduce it with latest master build (ussuri)
BUILD_ID="20200712T080011Z"
Command "OpenStack endpoint list" can work before all pod get ready.
All pods can recover after 2 controllers force reboot.

Zhipeng

Changed in starlingx:
status: Confirmed → Invalid
Revision history for this message
Frank Miller (sensfan22) wrote :

How did you test this? Did you have openstack commands running on the standby controller in a loop and then reset the active controller?

And how long were the commands not working when the active controller was reset? It is expected they will stop working until the standby controller becomes active. But they should not take 10 minutes before they start to work.

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Frank,

If only reset active controller, it takes no more than 3 min before "openstack endpoint list" can work again on the standby controller.

if we reset both controllers, it need take around 10 min before "openstack endpoint list" can work again on the controller.

Thanks
Zhipeng

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.