StarlingX

Openstack horizon not available on floating IP/active node's IP during reboot/shutdown of standby node for 4-5 mins while testing HA

Bug #1852395 reported by Akshay on 2019-11-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Low	chen haochuan

Bug Description

Brief Description
-----------------
Setup: I have deployed Bare Metal StarlingX R2 duplex mode.

Test Case: While testing HA, I tested a case in which I simply rebooted/switched off the standby node.

Issue: But when I tried to access the OpenStack horizon on floating IP or active node's IP, the horizon was unavailable for 4-5 mins from as soon as I rebooted/switched off the standby node.

I tried this case many times with same result.
Is it the expected behavior ? If not, please guide me to find the real issue.

Severity
--------

Critical

Steps to Reproduce
------------------
1. Deploy Bare Metal StarlingX R2 duplex mode.
2. Reboot/Switch off standby node.
3. Access OpenStack horizon.

Expected Behavior
------------------
Horizon should be always available on floating IP at least.

Actual Behavior
----------------
Horizon becomes unavailable for 4-5 mins.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Last Pass
---------
NO

Tags:

Revision history for this message

Akshay (yadavakshay58) wrote on 2019-11-15:

Also if it is the expected behavior, then by which reasons it is taking this much time ? Is it like all OpenStack service corresponding pods/containers re-initiates or what ?

Revision history for this message

ANIRUDH GUPTA (anyrude10) wrote on 2019-11-19:

Please find below the behavior I have observed corresponding to this issue.

When both Controller Nodes are up and running, I can see 2 Pods of each service in my system

controller-0:~$ kubectl get po -n openstack | grep glance
glance-api-55fd4664c5-8jgpn 1/1 Running 0 23h
glance-api-55fd4664c5-rklv6 1/1 Running 0 89m

Case 1: When one Controller is Rebooted

One of the Pod remains as it is in Running State.
The pod on the node which goes down, goes in Terminating State and another Pod start getting Creating, which is currently in Pending State

controller-0:~$ kubectl get po -n openstack | grep glance
glance-api-55fd4664c5-8jgpn 1/1 Running 0 23h
glance-api-55fd4664c5-d42kg 0/1 Pending 0 49s
glance-api-55fd4664c5-rklv6 1/1 Terminating 0 92m

Once the system reboots successfully, There are again only 2 Running Pods

Case 2: When One Controller is Poweroff

One of the Pod remains as it is in Running State.
The pod on the node which goes down, goes in Terminating State and another Pod start getting Creating, which is currently in Pending State.

controller-0:~$ kubectl get po -n openstack | grep glance
glance-api-55fd4664c5-8jgpn 1/1 Running 0 23h
glance-api-55fd4664c5-d42kg 0/1 Pending 0 49s
glance-api-55fd4664c5-rklv6 1/1 Terminating 0 92m

It remains in this situation only.

In both the cases, all the services on running node also are not accessible for around 4-5 mins.

yong hu (yhu6) on 2019-12-04

tags:	added: stx.2.0
tags:	added: stx.distro.openstack

Revision history for this message

ANIRUDH GUPTA (anyrude10) wrote on 2019-12-04:

controller-0_20191204.035130.tar Edit (44.4 MiB, application/x-tar)

Hi Yong,

As discussed in yesterday's distro openstack call, I am sharing the "collect" logs of Controller-0 which is currently active, when Standby controller is rebooted.

Test Scenario:

Controller-0 is Active and Controller-1 is in StandBy.

When StandBy node Controller-1 is rebooted, even then Active Controller Services gets stopped for around 4-5 mins.
I am unable to access any Openstack Service and even the Horizon is not accessible neither from Controller-0 OAM IP nor from Floating IP.

chen haochuan (martin1982) on 2019-12-05

Changed in starlingx:
assignee:	nobody → chen haochuan (martin1982)

Revision history for this message

chen haochuan (martin1982) wrote on 2019-12-11:

not reproduced on latest image. will check r2 release

yong hu (yhu6) on 2019-12-19

Changed in starlingx:
importance:	Undecided → Low

Revision history for this message

yong hu (yhu6) wrote on 2019-12-19:

This LP is similar to https://bugs.launchpad.net/starlingx/+bug/1855474
the root cause was "mariadb-server" pods were not timely recovered.
Before the solution is worked for 3.x maintenance release, pls do NOT run this "rebooting controller" tests.

Ghada Khalil (gkhalil) on 2020-01-31