StarlingX

host-list show Incorrect state on controller after Management and infra cables are pulled and put back

Bug #1815513 reported by Anujeyan Manokeran on 2019-02-11

This bug report is a duplicate of: Bug #1835268: Maintenance not messaging over cluster-host network. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Yi Wang

Bug Description

Bug Description : After pulling the cable for management and infra on active controller(controller-1) it was swact but once the cable was put back within 30 seconds still controller was in offline state 15 to 20 min then got rebooted. While system host-list was showing offline able to ping controller-1 from controller-0 and ssh .

2019-02-11T13:45:55.886 [8316.00249] controller-0 mtcAgent |-| nodeClass.cpp (2473) start_offline_handler : Info : controller-1 starting offline handler (unlocked-disabled-failed) (stage:0)
2019-02-11T13:45:55.886 [8316.00250] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1903) recovery_handler : Warn : controller-1 cannot issue Reset
2019-02-11T13:45:55.886 [8316.00251] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1904) recovery_handler : Warn : controller-1 ... board management not provisioned or accessible
2019-02-11T13:45:55.886 [8316.00252] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1925) recovery_handler : Info : controller-1 Graceful Recovery Wait (1200 secs) (uptime was 0)

Severity
--------
Major

Steps to Reproduce
------------------
1. Pull the Management cable on active controller(controller-1) that have vlan shared management and infra network.
2. Verify swact to controller-0
3. Verify controller-1 host state in system host-list as description it was in offline state but able to ping and ssh

Expected Behavior
------------------
Reboot after putting back the cable and getting to correct state.

Actual Behavior
----------------
As per description wrong state for controller-1 on host-display

Reproducibility
---------------
100% reproduceable

System Configuration
--------------------
storage system

Branch/Pull Time/Commit
-----------------------
StarlingX_Upstream_build release branch build as of 2019-02-10_20-18-00"

Timestamp/Logs
--------------
11 13:46:41 UTC 2019

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-02-13:

Marking as release gating; requires further investigation

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2019.05 stx.metal

Ken Young (kenyis) on 2019-03-18

Changed in starlingx:
assignee:	Eric MacDonald (rocksolidmtce) → Cindy Xie (xxie1)

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Ghada Khalil (gkhalil) on 2019-04-09

tags:

added: stx.retestneeded

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-04-24:

4.17 codebase, duplex configuration virtual env, not reproduced this issue.

Detailed steps:
1. built an iso with 4.17 codebase, deploy the iso with duplex configuration, virtual env

2. swact from controller-0 to controller-1
system host-swact controller-0

3. unplug and plug controller-1 interface
(1) sudo brctl delif virbr2 vnet5
wait for a while, "system host-list" on controller-1 show that the availability of controller-0 is "failed". then controller-0 reboot.
The availability of controller-0 become "offline".
(2) sudo brctl addif virbr2 vnet1 (less than 30s between (1) and (2))
(3) sudo ifconfig vnet1 up
after controller-0 reboot, "system host-list" on controller-1 show controller-0 is online

3. we can't swact from controller-1 to controller-0. it shows controller-0 has config not applied, suggest to lock/unlock controller-0.
(1) system host-lock -f controller-0
controller-0 reboot, wait for controller-0 up.
(2) system host-unlock controller-0
wait for controller-0 up again.

4. swact from controller-1 to controller-0
system host-swact controller-1

5. unplug and plug controller-1 interface
(1) sudo brctl delif virbr2 vnet5
(2) sudo brctl addif virbr2 vnet1
(3) sudo ifconfig vnet1 up
controller-1 reboot. when it is up, "system host-list" on controller-0 show controller-1 online

6. perform similar operations on controller-0 (unplug&plug its interface when it is/isn't active controller), we can get same result.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-04-24:

4.17 codebase, duplex configuration, bare metal env, reproduced this issue.

unplug controller-1 (active) management interface cable for a while and plug it.
Controller-0 became active controller. On controller-0, "system host-list" showed controller-1 offline
Controller-1 still seems to be active controller. On controller-1, "system host-list" showed controller-0 offline.

The connection between controller-0 and controller-1 is okay. I can ssh from controller to the other. But the status reported by "system host-list" is incorrect.

Locked controller-0 by controller-1, restarted controller-0, then unlocked controller-0 by controller-1, the status become correct.

Cindy Xie (xxie1) on 2019-04-26

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Yi Wang (wangyi4)

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-06-04:

Download full text (4.5 KiB)

@Anujeyan, as I commented before. With duplex configuration, I saw different behavior. After lock/unlock operations, the system recovered.

I made a multi-node deployment to check this issue again. I don't have enough machines as yours. My configuration is 2 controllers + 1 storage + 1 compute. The codebase is early of May. Below are my steps.

after ~20 mins, drbd sync finished. controller-1 became active controller, and other
nodes restarted. after a while

after a while, controller-0 become active controller.

@Anujeyan, as I commented before. With duplex configuration, I saw different behavior. After lock/unlock operations, the system recovered.

after ~20 mins, drbd sync finished. controller-1 became active controller, and other
nodes restarted. after a while

after a while, controller-0 become active controller.

drbd start to sync again. I waited for ~25 mins, the drbd sync didn't finish. I checked the drbd status on both controllers. They were both in the state "cs:StandAlone ro:Primary/Unknown".

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-06-04:

@Anujeyan, I didn't see the same phenomenon as yours. Could you check if you can reproduce this issue?

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-06-06:

mtcAgent.log Edit (2.8 MiB, text/plain)

@Anujeyan, I did one more round test since I didn't see the same phenomenon as what you reported. Below is my steps. I made some changes on step 2 for reproducing your result.

1. swact from controller-0 to controller-1. verify controller-1 become active
2. unplug controller-1's management cable. verify controller-0 become active. Wait until controller-1 become offline on controller-0 as below (in your step, you only waited for less than 30s), then plug the cable back.

At that time, I can connect to controller-1 from controller-0, but the status showed is offline.

4. After several minutes, controller-1 rebooted. Finally controller-1 recovered. It is different from what you reported, keeping offline even after rebooting)

Details can be found in attached mtcAgent.log of controller-0.

So the whole behavior is expected. Could you check it again?

@Anujeyan, I did one more round test since I didn't see the same phenomenon as what you reported. Below is my steps. I made some changes on step 2 for reproducing your result.

At that time, I can connect to controller-1 from controller-0, but the status showed is offline.

4. After several minutes, controller-1 rebooted. Finally controller-1 recovered. It is different from what you reported, keeping offline even after rebooting)

Details can be found in attached mtcAgent.log of controller-0.

So the whole behavior is expected. Could you check it again?

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-06-14:

@Anujeyan, any updates from your side?

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-06-18:

I will try when I get a similar lab. In the above test I see only one storage I would suggest to have similar configuration with 2 storage .

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-06-19:

@Anujeyan Thanks for your help! I don't have enough machines. That's why I chose 2+1+1 configuration. Is the number of storage nodes related to this issue?

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-06-19:

#10

I am not sure storage nodes related issue. But just to make sure we test in same configuration .
I was able to test in similar configuration . When I pull controller-0 Mgt ment cable for less than 30 seconds as I described above description controller-0 got rebooted and came as failed state below . Then after long wait ~15min it got rebooted again to recover the failed state . This was observed cable pull for active and standby controller. But if we have the cable pull for more than 30 seconds with one reboot host is recovered as available.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-06-20:

#11

@Anujeyan, it seems we both can see the same phenomenon for "pulling cable for more than 30s". The whole behavior is expected.

For the case of "pulling cable for less than 30s", in my test, we can use lock/unlock operation to recover the system. Do you think it is acceptable behavior?

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-05:

#12

Per Eric's analysis, this LP may be a duplicate of LP 1815969 which depends on LP 1835268. I will retest this issue once the other two LPs are resolved.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-19:

#13

LP1835268 has just been fixed. I will use tomorrow daily build to re-verify this issue.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-23:

#14

Re-verified this issue with 20190721T013000Z build plus bare metal 2+1+1 configuration. Tried three times, not reproduced this issue. So I think this issue has been resolved by https://review.opendev.org/#/c/670615/. Will set the bug to "Fixed Released".