Inactive controller stays in booting after mgmtnetwork down

Bug #1796751 reported by Jose Perez Carranza
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
Eric MacDonald

Bug Description

Title
-----
Inactive controller host stays in booting after management network is admin down

Brief Description
-----------------
After management network is down on Standby controller (unlocked/enabled) controller is rebooted and after that status keeps in 'booting' all the time (2hsr on the test), interface remains down all the time.

Severity
--------
Major

Steps to Reproduce
------------------
1. ssh to inactive (aka standby) controller host
2. disable management network (eno2)
     $ sudo ifconfig eth1 down
3. Verify that controller start rebooting
4. Verify that controller rebooted and become active

Expected Behavior
------------------
controller rebooted and become active

Actual Behavior
----------------
controller remains in 'booting' for long time (at least 2hrs)

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-DUPLEX Two node system redundant

Branch/Pull Time/Commit
-----------------------
Branch: r-2018.10

Timestamp/Logs
--------------
Logs attached and controller output here http://paste.openstack.org/show/731710/

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
Ada Cabrales (acabrale)
tags: added: stx.2018.10
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Issue can reproduce in Virtual Multinode Local Storage in all hosts.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Was this reproduced on real hardware or only in Virtual Box ?

Was there an separate infrastructure network provisioned ?

If on real hardware, is there a BMC (board management controller) provisioned ?

Please note that if there is no BMC provisioned and there is no separate Infrastructure network provisioned then downing the only connection to the inactive controller isolates it from the active controller.
While isolated there is no recovery method maintenance can use to reboot/reset that host.

Try, up-ing the management interface on that failed controller and see if maintenance is able to recover it.

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

- This is reproduced on a real HW

- I think there is no separate infra network, I used the default steps on a config_controller.

- BMC is setup but seems like OAM network cannot reach the BMC network

- If I manually UP the mgmt interface and the system is recovered correctly

- Are you able to elaborate more on the steps and correct configuration to provision a separate infrastructure network?

Ghada Khalil (gkhalil)
tags: added: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Thank you for your response.

The failure to auto-recover this host after you downed the management interface is due to complete isolation of that server from the active controller.

An infrastructure network is optional and easily configured in config_controller but net necessary for your test case.

Once you fix access to the BMC over your OAM interface then maintenance will be able to recover the host by BMC reset.

The system is behaving as expected.

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
Changed in starlingx:
status: New → Invalid
Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

Is there any extra step that I need to do on the controller? Right now OAM and BCM are in the same vlan and actually they can be seen each other.. on the controller I did below steps

================
~(keystone_admin)$ system host-update hostname bm_username=user_name \
bm_password=password bm_type=bmc

~(keystone_admin)$ system host-update hostname bm_ip=ip_address
===================

But I'm still unable to ping from controller to the BMC.. Do I need to apply manually a route rule to do this or the system should automatically do this rule?

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

There is no rule that is needed.

You should be able to `ping <bmc_ip>` from the active controller.
No point provisioning the BMC if that does not work.

Feel free to add the /var/log/mtcAgent.log entries that pertain to the provisioning process. Tail this log while you provision the BMC.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Oh, you mention a vlan.
This might be your issue.
There is no provisioning for the vlan id so if the BMC is on a vlan then it likely won't work.

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

We just made some changes on the infrastructure and now BMC and OAM are on the same network, the controller is recovered correctly so confirming this as Invalid

Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
Revision history for this message
Jose Perez Carranza (jgperezc) wrote : Re: GDC sync up

Hi Victor / Saul

Cristopher, Fernando, JC and myself form QA team won’t be able to attend meeting due conflict with a training. If something is needed from our side please let us know.

Regards
Jose
--

From: "Rodriguez Bahena, Victor" <email address hidden>
Date: Tuesday, August 27, 2019 at 8:00 AM
To: OTC Edge GDC <email address hidden>, "Wold, Saul" <email address hidden>, "Troyer, Dean" <email address hidden>
Cc: "Martinez Monroy, Elio" <email address hidden>
Subject: GDC sync up

When: Tuesday, August 27, 2019 8:00 AM-9:00 AM. (UTC-08:00) Pacific Time (US & Canada)

Where: Online Meeting

*~*~*~*~*~*~*~*~*~*
Hi GDC team I will be on training all week , Saul will be leading the meeting.

Thanks

.........................................................................................................................................
Join online meeting <https://meet.intel.com/victor.rodriguez.bahena/CF44SGKX>
Trouble Joining? Try Skype Web App <https://meet.intel.com/victor.rodriguez.bahena/CF44SGKX?sl=1>

Join by Phone
+1(916)356-2663 (or your local bridge access #) Choose bridge 5.,, 7637563400#
Find a local number <https://dial.intel.com>

Conference ID: 7637563400

Forgot your dial-in PIN? <https://dial.intel.com> | First online meeting? <http://r.office.microsoft.com/r/rlidOC10?clid=1033&p1=4&p2=1041&pc=oc&ver=4&subver=0&bld=7185&bldver=0>

.........................................................................................................................................

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
  • unnamed Edit (2.8 KiB, text/calendar; charset="utf-8"; method=REQUEST)

When: Tuesday, October 01, 2019 3:00 PM-3:30 PM. (UTC-06:00) Guadalajara, Mexico City, Monterrey
Where: CR-ZPN1-007-6

*~*~*~*~*~*~*~*~*~*

.........................................................................................................................................
Join online meeting <https://meet.intel.com/jose.perez.carranza/KUGJ71D9>
Trouble Joining? Try Skype Web App <https://meet.intel.com/jose.perez.carranza/KUGJ71D9?sl=1>

Join by Phone
+1(916)356-2663 (or your local bridge access #) Choose bridge 5.,, 788566622#
Find a local number <https://dial.intel.com>

Conference ID: 788566622

Forgot your dial-in PIN? <https://dial.intel.com> | First online meeting? <https://support.office.microsoft.com/en-us/article/join-a-skype-for-business-meeting-3862be6d-758a-4064-a016-67c0febf3cd5?ui=en-US&rs=en-US&ad=US#OS_Type=Mac>

.........................................................................................................................................

We will review how to send patches to external repo using “gi format patch and git am”

Revision history for this message
Jose Perez Carranza (jgperezc) wrote : Canceled: How to send patches to the outside repo
  • unnamed Edit (2.5 KiB, text/calendar; charset="utf-8"; method=CANCEL)

When: Tuesday, October 01, 2019 3:00 PM-3:30 PM. (UTC-06:00) Guadalajara, Mexico City, Monterrey
Where: CR-ZPN1-007-6

*~*~*~*~*~*~*~*~*~*

.........................................................................................................................................
Join online meeting <https://meet.intel.com/jose.perez.carranza/KUGJ71D9>
Trouble Joining? Try Skype Web App <https://meet.intel.com/jose.perez.carranza/KUGJ71D9?sl=1>

Join by Phone
+1(916)356-2663 (or your local bridge access #) Choose bridge 5.,, 788566622#
Find a local number <https://dial.intel.com>

Conference ID: 788566622

Forgot your dial-in PIN? <https://dial.intel.com> | First online meeting? <https://support.office.microsoft.com/en-us/article/join-a-skype-for-business-meeting-3862be6d-758a-4064-a016-67c0febf3cd5?ui=en-US&rs=en-US&ad=US#OS_Type=Mac>

.........................................................................................................................................

We will review how to send patches to external repo using “gi format patch and git am”

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.