host-list show Incorrect state on controller after Management and infra cables are pulled and put back

Bug #1815513 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Yi Wang

Bug Description

Bug Description : After pulling the cable for management and infra on active controller(controller-1) it was swact but once the cable was put back within 30 seconds still controller was in offline state 15 to 20 min then got rebooted. While system host-list was showing offline able to ping controller-1 from controller-0 and ssh .

2019-02-11T13:45:55.886 [8316.00249] controller-0 mtcAgent |-| nodeClass.cpp (2473) start_offline_handler : Info : controller-1 starting offline handler (unlocked-disabled-failed) (stage:0)
2019-02-11T13:45:55.886 [8316.00250] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1903) recovery_handler : Warn : controller-1 cannot issue Reset
2019-02-11T13:45:55.886 [8316.00251] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1904) recovery_handler : Warn : controller-1 ... board management not provisioned or accessible
2019-02-11T13:45:55.886 [8316.00252] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1925) recovery_handler : Info : controller-1 Graceful Recovery Wait (1200 secs) (uptime was 0)

system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | controller-1 | controller | unlocked | disabled | offline |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

system service-parameter-list
+--------------------------------------+----------+-------------+-----------------------------+-------+-------------+----------+
| uuid | service | section | name | value | personality | resource |
+--------------------------------------+----------+-------------+-----------------------------+-------+-------------+----------+
| ba7a62bc-a38f-40a9-bbd9-b4566b33b87d | identity | assignment | driver | sql | None | None |
| 71fc8993-3c5a-4f2b-95d8-129ffb9c6945 | horizon | auth | lockout_retries | 3 | None | None |
| e91fabbc-ec67-4000-8c34-2b95bd9c3eaf | horizon | auth | lockout_seconds | 300 | None | None |
| 1ad073ae-3cff-4ea7-a792-7cef6fced0e6 | swift | config | fs_size_mb | 25 | None | None |
| 778b7be0-41eb-484e-922a-ebe241f0e275 | http | config | http_port | 8080 | None | None |
| 8fce8fa0-8b45-4a55-839c-91d5368fc027 | http | config | https_port | 8443 | None | None |
| 5112802a-005d-414d-9222-c95709130081 | swift | config | service_enabled | false | None | None |
| 493f0d08-b88a-4bf7-abf6-117af08fc6c0 | identity | config | token_expiration | 3600 | None | None |
| c401d443-7694-4927-9525-84f49741e949 | aodh | database | alarm_history_time_to_live | 86400 | None | None |
| c083339b-fd3e-4000-920f-a724b20b392e | panko | database | event_time_to_live | 86400 | None | None |
| 456d240f-d251-4457-a16d-2ab60f01e372 | cinder | emc_vnx | enabled | false | None | None |
| 20569cc2-dd68-46a7-a6d3-176cd2443e8e | cinder | hpe3par | enabled | false | None | None |
| 6dcd8ec9-6df3-4055-a9e3-ca2b6c894aae | cinder | hpe3par10 | enabled | false | None | None |
| 55af06dc-db63-4109-a20f-44a84551f6b9 | cinder | hpe3par11 | enabled | false | None | None |
| fe880c3d-23cb-4b35-9848-c8f40aa86f5d | cinder | hpe3par12 | enabled | false | None | None |
| dd1e2864-7367-4fce-b03c-02c29f20e4bb | cinder | hpe3par2 | enabled | false | None | None |
| 3c5b0081-cb53-4ee9-8b7a-64b3f7ce80f0 | cinder | hpe3par3 | enabled | false | None | None |
| 539cfcc4-51fb-439a-9b5a-0e6f2a302924 | cinder | hpe3par4 | enabled | false | None | None |
| 29fca92a-e954-45b5-bdb4-1d290e8ddf49 | cinder | hpe3par5 | enabled | false | None | None |
| e911b5b9-9fdd-496e-9a69-f4df9a33b416 | cinder | hpe3par6 | enabled | false | None | None |
| 14c3fe2c-017c-4901-a568-d2b81278d9d2 | cinder | hpe3par7 | enabled | false | None | None |
| 3a15fc60-e54f-4c34-bead-3d26fad00e7d | cinder | hpe3par8 | enabled | false | None | None |
| 65711f82-3b09-4a03-83dc-469200227c5c | cinder | hpe3par9 | enabled | false | None | None |
| 1392c4a3-ea35-4e1c-af3b-86d6da51be71 | cinder | hpelefthand | enabled | false | None | None |
| e234748c-0c00-4cb6-b937-fda03694de09 | identity | identity | driver | sql | None | None |
| c68b8e30-8b28-4497-9b98-a15eada95ed7 | platform | maintenance | controller_boot_timeout | 1200 | None | None |
| dfc1b687-0dd8-4302-a50f-c5468956679d | platform | maintenance | heartbeat_degrade_threshold | 6 | None | None |
| 98f371fc-433e-48c8-b39b-7d55894c0f46 | platform | maintenance | heartbeat_failure_action | fail | None | None |
| f1d330df-bfe7-426c-8f99-1f5711886431 | platform | maintenance | heartbeat_failure_threshold | 10 | None | None |
| c3e4e458-f564-4d06-9ab1-8ea75a39aaf1 | platform | maintenance | heartbeat_period | 100 | None | None |
| bba069d0-f3f7-4036-a490-c49752a2119e | platform | maintenance | mnfa_threshold | 2 | None | None |
| 07d05bf3-352e-47c4-9036-c3f7fbbc5ce5 | platform | maintenance | mnfa_timeout | 0 | None | None |
| b9e1be7c-536e-4d2d-9de8-16e247439e36 | platform | maintenance | worker_boot_timeout | 720 | None | None |

Severity
--------
Major

Steps to Reproduce
------------------
1. Pull the Management cable on active controller(controller-1) that have vlan shared management and infra network.
2. Verify swact to controller-0
3. Verify controller-1 host state in system host-list as description it was in offline state but able to ping and ssh

Expected Behavior
------------------
Reboot after putting back the cable and getting to correct state.

Actual Behavior
----------------
As per description wrong state for controller-1 on host-display

Reproducibility
---------------
100% reproduceable

System Configuration
--------------------
storage system

Branch/Pull Time/Commit
-----------------------
StarlingX_Upstream_build release branch build as of 2019-02-10_20-18-00"

Timestamp/Logs
--------------
11 13:46:41 UTC 2019

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; requires further investigation

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05 stx.metal
Ken Young (kenyis)
Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Cindy Xie (xxie1)
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Yi Wang (wangyi4) wrote :

4.17 codebase, duplex configuration virtual env, not reproduced this issue.

Detailed steps:
1. built an iso with 4.17 codebase, deploy the iso with duplex configuration, virtual env

2. swact from controller-0 to controller-1
system host-swact controller-0

3. unplug and plug controller-1 interface
(1) sudo brctl delif virbr2 vnet5
wait for a while, "system host-list" on controller-1 show that the availability of controller-0 is "failed". then controller-0 reboot.
The availability of controller-0 become "offline".
(2) sudo brctl addif virbr2 vnet1 (less than 30s between (1) and (2))
(3) sudo ifconfig vnet1 up
after controller-0 reboot, "system host-list" on controller-1 show controller-0 is online

3. we can't swact from controller-1 to controller-0. it shows controller-0 has config not applied, suggest to lock/unlock controller-0.
(1) system host-lock -f controller-0
controller-0 reboot, wait for controller-0 up.
(2) system host-unlock controller-0
wait for controller-0 up again.

4. swact from controller-1 to controller-0
system host-swact controller-1

5. unplug and plug controller-1 interface
(1) sudo brctl delif virbr2 vnet5
(2) sudo brctl addif virbr2 vnet1
(3) sudo ifconfig vnet1 up
controller-1 reboot. when it is up, "system host-list" on controller-0 show controller-1 online

6. perform similar operations on controller-0 (unplug&plug its interface when it is/isn't active controller), we can get same result.

Revision history for this message
Yi Wang (wangyi4) wrote :

4.17 codebase, duplex configuration, bare metal env, reproduced this issue.

unplug controller-1 (active) management interface cable for a while and plug it.
Controller-0 became active controller. On controller-0, "system host-list" showed controller-1 offline
Controller-1 still seems to be active controller. On controller-1, "system host-list" showed controller-0 offline.

The connection between controller-0 and controller-1 is okay. I can ssh from controller to the other. But the status reported by "system host-list" is incorrect.

Locked controller-0 by controller-1, restarted controller-0, then unlocked controller-0 by controller-1, the status become correct.

Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Yi Wang (wangyi4)
Revision history for this message
Yi Wang (wangyi4) wrote :
Download full text (4.5 KiB)

@Anujeyan, as I commented before. With duplex configuration, I saw different behavior. After lock/unlock operations, the system recovered.

I made a multi-node deployment to check this issue again. I don't have enough machines as yours. My configuration is 2 controllers + 1 storage + 1 compute. The codebase is early of May. Below are my steps.

1. swact from controller-0 to controller-1. verify controller-1 become active
2. unplug controller-1's management cable, wait ~20s, plug it back.
3. verify controller-0 become active
4. check host status from active controller (controller-0)
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 3 | controller-1 | controller | unlocked | enabled | degraded |
| 4 | storage-0 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

after ~20 mins, drbd sync finished. controller-1 became active controller, and other
nodes restarted. after a while

[wrsroot@controller-1 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | intest |
| 3 | controller-1 | controller | unlocked | enabled | available |
| 4 | storage-0 | storage | unlocked | disabled | intest |
| 5 | compute-0 | worker | unlocked | disabled | intest |
+----+--------------+-------------+----------------+-------------+--------------+

[wrsroot@controller-1 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | enabled | degraded |
| 4 | storage-0 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

after a while, controller-0 become active controller.

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked ...

Read more...

Revision history for this message
Yi Wang (wangyi4) wrote :

@Anujeyan, I didn't see the same phenomenon as yours. Could you check if you can reproduce this issue?

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Yi Wang (wangyi4) wrote :

@Anujeyan, I did one more round test since I didn't see the same phenomenon as what you reported. Below is my steps. I made some changes on step 2 for reproducing your result.

1. swact from controller-0 to controller-1. verify controller-1 become active
2. unplug controller-1's management cable. verify controller-0 become active. Wait until controller-1 become offline on controller-0 as below (in your step, you only waited for less than 30s), then plug the cable back.

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 3 | controller-1 | controller | unlocked | disabled | offline |
| 4 | storage-0 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

3. controller-0's availability changed from "degraded" to "available" after a while
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | disabled | offline |
| 4 | storage-0 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

At that time, I can connect to controller-1 from controller-0, but the status showed is offline.

4. After several minutes, controller-1 rebooted. Finally controller-1 recovered. It is different from what you reported, keeping offline even after rebooting)

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | enabled | available |
| 4 | storage-0 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Details can be found in attached mtcAgent.log of controller-0.

So the whole behavior is expected. Could you check it again?

Revision history for this message
Yi Wang (wangyi4) wrote :

@Anujeyan, any updates from your side?

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

I will try when I get a similar lab. In the above test I see only one storage I would suggest to have similar configuration with 2 storage .

Revision history for this message
Yi Wang (wangyi4) wrote :

@Anujeyan Thanks for your help! I don't have enough machines. That's why I chose 2+1+1 configuration. Is the number of storage nodes related to this issue?

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

I am not sure storage nodes related issue. But just to make sure we test in same configuration .
I was able to test in similar configuration . When I pull controller-0 Mgt ment cable for less than 30 seconds as I described above description controller-0 got rebooted and came as failed state below . Then after long wait ~15min it got rebooted again to recover the failed state . This was observed cable pull for active and standby controller. But if we have the cable pull for more than 30 seconds with one reboot host is recovered as available.

Below state controller when the cable pull was with in 30 seconds after first reboot.
 system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | failed |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-1 ~(keystone_admin)]$ system host-list

Revision history for this message
Yi Wang (wangyi4) wrote :

@Anujeyan, it seems we both can see the same phenomenon for "pulling cable for more than 30s". The whole behavior is expected.

For the case of "pulling cable for less than 30s", in my test, we can use lock/unlock operation to recover the system. Do you think it is acceptable behavior?

Revision history for this message
Yi Wang (wangyi4) wrote :

Per Eric's analysis, this LP may be a duplicate of LP 1815969 which depends on LP 1835268. I will retest this issue once the other two LPs are resolved.

Revision history for this message
Yi Wang (wangyi4) wrote :

LP1835268 has just been fixed. I will use tomorrow daily build to re-verify this issue.

Revision history for this message
Yi Wang (wangyi4) wrote :

Re-verified this issue with 20190721T013000Z build plus bare metal 2+1+1 configuration. Tried three times, not reproduced this issue. So I think this issue has been resolved by https://review.opendev.org/#/c/670615/. Will set the bug to "Fixed Released".

Changed in starlingx:
status: Incomplete → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As confirmed above by Yi Wang, this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1835268

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This was verified in load 2019-08-12 21:00:17

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.