StarlingX

AIO-DX:Cable pull teston Mgt+cluster causes immediate reboot on standby controller

Bug #1847657 reported by Anujeyan Manokeran on 2019-10-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------.
Cable pull(MGT+Cluster) test on AIO-DX cause standby(controller-1) reboot immediately and active controller sm was disabled. This cause both controller out of service until standby reboot is complete.
The interfaces are not configured with vlan.

2019-10-10T18:42:44.846 controller-0 kernel: info [ 2222.460593] tg3 0000:01:00.1 eno2: Link is down

Active controller sm-dump soon after the cable pull

-Service_Groups-----------------------------------------------------------------
oam-services disabled disabled
controller-services disabled disabled
cloud-services disabled disabled
patching-services disabled disabled
directory-services disabled disabled
web-services disabled disabled
storage-services disabled disabling
storage-monitoring-services disabled disabled
vim-services disabled disabled
--------------------------------------------------------------------------------

-Services-----------------------------------------------------------------------
oam-ip disabled disabled
management-ip disabled disabled
drbd-pg disabled disabled
drbd-rabbit disabled disabled
drbd-platform disabled disabled
pg-fs disabled disabled
rabbit-fs disabled disabled
nfs-mgmt disabled disabled
platform-fs disabled disabled
postgres disabled disabled
rabbit disabled disabled
platform-export-fs disabled disabled
platform-nfs-ip disabled disabled
sysinv-inv disabled disabled
sysinv-conductor disabled disabled
mtc-agent disabled disabled
hw-mon disabled disabled
dnsmasq disabled disabled
fm-mgr disabled disabled
keystone disabled disabled
open-ldap disabled disabled
snmp disabled disabled
lighttpd disabled disabled
horizon disabled disabled
patch-alarm-manager disabled disabled
mgr-restful-plugin disabled disabled
ceph-manager disabled disabled
vim disabled disabled
vim-api disabled disabled
vim-webserver disabled disabled
haproxy disabled disabled
pxeboot-ip disabled disabled
drbd-extension disabled disabled
extension-fs disabled disabled
extension-export-fs disabled disabled
etcd disabled disabled
drbd-etcd disabled disabled
etcd-fs disabled disabled
barbican-api disabled disabled
barbican-keystone-listener disabled disabled
barbican-worker disabled disabled
cluster-host-ip disabled disabled
docker-distribution disabled disabled
dockerdistribution-fs disabled disabled
drbd-dockerdistribution disabled disabled

Severity
--------
Major

Steps to Reproduce
------------------
1.Pull cable on active controller(c-0) Mgt and cluster network. Timestamp 2019-10-10T18:42:44.846
2. standby rebooted immediately

Expected Behavior
------------------
No reboot on stanby . Active controller supposed to swact
Actual Behavior
----------------
As description immediate reboot
Reproducibility
---------------
Always reproduceable
System Configuration
--------------------
AIO-DX system
Branch/Pull Time/Commit
-----------------------
BUILD_DATE= 2019-10-08_20-00-00
Last Pass
---------

Timestamp/Logs
--------------
2019-10-10T18:42:44.846
Test Activity
-------------
Regression test

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-11:

Marking as stx.3.0 / medium priority - system doesn't behave properly on a failure condition

tags:	added: stx.3.0 stx.ha
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Bin Qian (bqian20)

Yang Liu (yliu12) on 2019-10-12

tags:

added: stx.retestneeded

Revision history for this message

Bin Qian (bqian20) wrote on 2019-10-15:

SM behaved correctly to disable the active controller. The standby controller was reset because of heartbeat failure. Eric may want to take a look at it from hbs side.

Eric MacDonald (rocksolidmtce) on 2019-10-15

Changed in starlingx:
assignee:	Bin Qian (bqian20) → Eric MacDonald (rocksolidmtce)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-15:

Mtce needs to delay host heartbeat loss failure declaration to give SM time to shut down mtce before that declaration.

Patch is implemented and testing of it is in progress.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-15:

This lp was found when verifying lp https://bugs.launchpad.net/starlingx/+bug/1844717 without vlan.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-15: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/688790

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-16: Fix merged to metal (master)

Reviewed: https://review.opendev.org/688790
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=675f49d5566e8e92e2ed713941bedff904227287
Submitter: Zuul
Branch: master

commit 675f49d5566e8e92e2ed713941bedff904227287
Author: Eric MacDonald <email address hidden>
Date: Tue Oct 15 15:16:22 2019 -0400

Add mtcAgent support for sm_node_unhealthy condition

    When heartbeat over both networks fail, mtcAgent
    provides a 5 second grace period for heartbeat to
    recover before failing the node.

    However, when heartbeat fails over only one of the
    networks (management or cluster) the mtcAgent does
    not honour that 5 second grace period ; a bug.

    When it comes to peer controller heartbeat failure
    handling, SM needs that 5 second grace period to handle
    swact before mtcAgent declares the peer controller as
    failed, resets the node and updates the database.

    This update implements a change that forces a 2 second
    wait time between each fast enable and fixes the fast
    enable threshold count to be the intended 3 retries.
    This ensures that at least 5 seconds, actually 6 in
    the case of single network heartbeat loss, passes
    before declaring the node as failed.

    In addition to that, a special condition is added to
    detect and stop work if the active controller is
    sm_node_unhealthy. We don't want mtcAgent to make
    any database updates while in this failure mode.
    This gives SM the time to handle the failure
    according to the system's controllers' high
    availability handling feature.

Test Plan:

    PASS: Verify mtcAgent behavior on set and clear of
          SM node unhealthy state.
    PASS: Verify SM has at least 5 seconds to shut down
          mtcAgent when heartbeat to peer controller fails
          for one or both networks.
    PASS: Test real case scenario with link pull.
    PASS: Verify logging in presence of real failure condition.

    Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9
    Closes-Bug: 1847657
    Signed-off-by: Eric MacDonald <email address hidden>

Reviewed:  https://review.opendev.org/688790
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=675f49d5566e8e92e2ed713941bedff904227287
Submitter: Zuul
Branch:    master

commit 675f49d5566e8e92e2ed713941bedff904227287
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Oct 15 15:16:22 2019 -0400

Add mtcAgent support for sm_node_unhealthy condition
    
    When heartbeat over both networks fail, mtcAgent
    provides a 5 second grace period for heartbeat to
    recover before failing the node.
    
    However, when heartbeat fails over only one of the
    networks (management or cluster) the mtcAgent does
    not honour that 5 second grace period ; a bug.
    
    When it comes to peer controller heartbeat failure
    handling, SM needs that 5 second grace period to handle
    swact before mtcAgent declares the peer controller as
    failed, resets the node and updates the database.
    
    This update implements a change that forces a 2 second
    wait time between each fast enable and fixes the fast
    enable threshold count to be the intended 3 retries.
    This ensures that at least 5 seconds, actually 6 in
    the case of single network heartbeat loss, passes
    before declaring the node as failed.
    
    In addition to that, a special condition is added to
    detect and stop work if the active controller is
    sm_node_unhealthy. We don't want mtcAgent to make
    any database updates while in this failure mode.
    This gives SM the time to handle the failure
    according to the system's controllers' high
    availability handling feature.
    
    Test Plan:
    
    PASS: Verify mtcAgent behavior on set and clear of
          SM node unhealthy state.
    PASS: Verify SM has at least 5 seconds to shut down
          mtcAgent when heartbeat to peer controller fails
          for one or both networks.
    PASS: Test real case scenario with link pull.
    PASS: Verify logging in presence of real failure condition.
    
    Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9
    Closes-Bug: 1847657
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-18:

Verified lp can be closed.

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.