Inactive controller stays in booting after mgmtnetwork down

Bug #1796751 reported by Jose Perez Carranza on 2018-10-08
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Eric MacDonald

Bug Description

Title
-----
Inactive controller host stays in booting after management network is admin down

Brief Description
-----------------
After management network is down on Standby controller (unlocked/enabled) controller is rebooted and after that status keeps in 'booting' all the time (2hsr on the test), interface remains down all the time.

Severity
--------
Major

Steps to Reproduce
------------------
1. ssh to inactive (aka standby) controller host
2. disable management network (eno2)
     $ sudo ifconfig eth1 down
3. Verify that controller start rebooting
4. Verify that controller rebooted and become active

Expected Behavior
------------------
controller rebooted and become active

Actual Behavior
----------------
controller remains in 'booting' for long time (at least 2hrs)

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-DUPLEX Two node system redundant

Branch/Pull Time/Commit
-----------------------
Branch: r-2018.10

Timestamp/Logs
--------------
Logs attached and controller output here http://paste.openstack.org/show/731710/

Jose Perez Carranza (jgperezc) wrote :
Ada Cabrales (acabrale) on 2018-10-09
tags: added: stx.2018.10

Issue can reproduce in Virtual Multinode Local Storage in all hosts.

Eric MacDonald (rocksolidmtce) wrote :

Was this reproduced on real hardware or only in Virtual Box ?

Was there an separate infrastructure network provisioned ?

If on real hardware, is there a BMC (board management controller) provisioned ?

Please note that if there is no BMC provisioned and there is no separate Infrastructure network provisioned then downing the only connection to the inactive controller isolates it from the active controller.
While isolated there is no recovery method maintenance can use to reboot/reset that host.

Try, up-ing the management interface on that failed controller and see if maintenance is able to recover it.

Jose Perez Carranza (jgperezc) wrote :

- This is reproduced on a real HW

- I think there is no separate infra network, I used the default steps on a config_controller.

- BMC is setup but seems like OAM network cannot reach the BMC network

- If I manually UP the mgmt interface and the system is recovered correctly

- Are you able to elaborate more on the steps and correct configuration to provision a separate infrastructure network?

Ghada Khalil (gkhalil) on 2018-10-09
tags: added: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Eric MacDonald (rocksolidmtce) wrote :

Thank you for your response.

The failure to auto-recover this host after you downed the management interface is due to complete isolation of that server from the active controller.

An infrastructure network is optional and easily configured in config_controller but net necessary for your test case.

Once you fix access to the BMC over your OAM interface then maintenance will be able to recover the host by BMC reset.

The system is behaving as expected.

Ghada Khalil (gkhalil) on 2018-10-10
Changed in starlingx:
importance: Undecided → High
Changed in starlingx:
status: New → Invalid
Jose Perez Carranza (jgperezc) wrote :

Is there any extra step that I need to do on the controller? Right now OAM and BCM are in the same vlan and actually they can be seen each other.. on the controller I did below steps

================
~(keystone_admin)$ system host-update hostname bm_username=user_name \
bm_password=password bm_type=bmc

~(keystone_admin)$ system host-update hostname bm_ip=ip_address
===================

But I'm still unable to ping from controller to the BMC.. Do I need to apply manually a route rule to do this or the system should automatically do this rule?

Eric MacDonald (rocksolidmtce) wrote :

There is no rule that is needed.

You should be able to `ping <bmc_ip>` from the active controller.
No point provisioning the BMC if that does not work.

Feel free to add the /var/log/mtcAgent.log entries that pertain to the provisioning process. Tail this log while you provision the BMC.

Eric MacDonald (rocksolidmtce) wrote :

Oh, you mention a vlan.
This might be your issue.
There is no provisioning for the vlan id so if the BMC is on a vlan then it likely won't work.

Jose Perez Carranza (jgperezc) wrote :

We just made some changes on the infrastructure and now BMC and OAM are on the same network, the controller is recovered correctly so confirming this as Invalid

Ken Young (kenyis) on 2019-04-06
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers