MNFA times out immediately with timeout value of 0

Bug #1858216 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

Brief Description
-----------------
During the cable pull test on cluster and mgmt configured on same interface cause multinode failure. Standby controller(c-0) and worker nodes rebooted immediately on cable pull on active controller. Multinodes failure avoidance timeout was set to 0 in this lab. Multinode failure was not avoided as per set parameter below.

**Below set parameter value

5c9b64bf-769c-4949-96df-d21fda360bf7 | platform | maintenance | mnfa_threshold | 2 | None | None |
| 2d6feb56-b664-4d61-9468-8fba97c88a79 | platform | maintenance | mnfa_timeout | 0 | None | None |
| 37d3fc48-54cf-408e-b497-796a9c25cf4b | platform | maintenance | worker_boot_timeout | 720 | None | None |
| c2ab1f

** cable pull time
2020-01-03T14:37:10.135 controller-1 kernel: info [19603.064914] i40e 0000:18:00.0 enp24s0f0: NIC Link is Down 2020-01-03T14:37:11.271 controller-1 kernel: info [19604.196161] i40e 0000:18:00.0 enp24s0f0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None 2020-01-03T14:37:11.365 controller-1 kernel: info [19604.292773] i40e 0000:18:00.0 enp24s0f0: NIC Link is Down

**Below interface configuration for controller-1
system host-if-list 1 +--------------------------------------+----------+----------+----------+------+-----------------+--------------+----------------+---------------------------+ | uuid | name | class | type | vlan | ports | uses i/f | used by i/f | attributes | | | | | | id | | | | | +--------------------------------------+----------+----------+----------+------+-----------------+--------------+----------------+---------------------------+ | 117e4f43-7f4c-4f1d-8edc-02cfb6033d79 | data0 | data | ethernet | None | [u'enp175s0f0'] | [] | [] | MTU=1500,accelerated=True | | 5c6daf5d-8f78-4143-9193-0aaec5ca7924 | oam0 | platform | ethernet | None | [u'eno1'] | [] | [] | MTU=1500 | | 8c1f6b74-38a1-4b3d-ae0d-b47c2ec72390 | cluster0 | platform | vlan | 187 | [] | [u'pxeboot0' | [] | MTU=1500 | | | | | | | | ] | | | | | | | | | | | | | | a016d64b-a825-4c09-b54f-569fa551f4ce | pxeboot0 | platform | ethernet | None | [u'enp24s0f0'] 14 | [] | [u'mgmt0', | MTU=9216 | | | | | | | | | u'cluster0'] | | | | | | | | | | | | | a3d1f954-a9e2-482b-ad80-5d2015ad62e3 | mgmt0 | platform | vlan | 186 | [] | [u'pxeboot0' | [] | MTU=1500 | | | | | | | | ] | | | | | | | | | | | | | +--------------------------------------+----------+----------+----------+------+-----------------+--------------+----------------+---------------------------+
Reboot on all the nodes.
$ system host-list +----+--------------+-------------+----------------+-------------+--------------+ | id | hostname | personality | administrative | operational | availability | +----+--------------+-------------+----------------+-------------+--------------+ | 1 | controller-0 | controller | unlocked | disabled | intest | | 2 | compute-0 | worker | unlocked | disabled | intest | | 3 | compute-1 | worker | unlocked | disabled | intest | | 4 | compute-2 | worker | unlocked | disabled | intest | | 5 | controller-1 | controller | unlocked | enabled | available |

Severity
--------
Major

Steps to Reproduce
------------------

1. Have AIO+ lab with cluster and MGT provisioned in same network
2. On active controller pull cable where cluster and Mgt are in same vlan.
3. Verify host states after cable pull.

Expected Behavior
------------------
All the nodes shouldn’t reboot.

Actual Behavior
----------------
All the nodes rebooted for cable pull . MNFA was not triggered.

Reproducibility
---------------
Tested once in this load

System Configuration
--------------------
AIO+ wolfpass 8-12

Branch/Pull Time/Commit
-----------------------
2020-01-02 20:04:12 -0500

Last Pass
---------
Last tested on load 2019-12-13 19:04:39 it was different issue https://bugs.launchpad.net/starlingx/+bug/1856614

Timestamp/Logs
--------------
2020-01-03T14:37:10.135

Test Activity
-------------
Regression

description: updated
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
summary: - Cable pull test on active controller cause multi node failure lab with
- cluster and mgmt configured on a same interface
+ MNFA times out immediately with timeout value of 0
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

A timer module change made in https://review.opendev.org/#/c/698311/ introduced this bug.

Fix is implemented and tested.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/701057

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 / medium priority - this is only an issue with the user specifies a timeout value of 0. Even though this issue exists in stx.3.0, this doesn't seem to be a likely scenario

tags: added: stx.metal
tags: added: stx.3.0 stx.4.0
Changed in starlingx:
importance: Undecided → High
tags: removed: stx.3.0
Changed in starlingx:
importance: High → Medium
importance: Medium → High
tags: added: stx.3.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Correction: Marking as stx.3.0 & stx.4.0 gating / high priority - As per Eric MacDonald, the default value is set to zero, so users will hit this if they do a cable pull.

Fix will need to be cherrypicked in r/stx.3.0 after it merges in master

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/701057
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=e86d0b915896b74cb6956d93e99b5dd308271e80
Submitter: Zuul
Branch: master

commit e86d0b915896b74cb6956d93e99b5dd308271e80
Author: Eric MacDonald <email address hidden>
Date: Fri Jan 3 14:28:28 2020 -0500

    Fix MNFA timer timeout condition check

    A timer module change made in https://review.opendev.org/#/c/698311
    introduced a change that makes all unstarted/stopped timers appear
    as expired/rung.

    A MNFA (Multi Node Failure Avoidance) timeout of zero represents no
    timeout and is implemented by not starting a timer for that condition.

    However, due to the recent change, that makes the MNFA timer expiry
    check succeed immediatly causing MNFA to exit prematurely causing the
    issue reported by the Bug reference.

    The fix is to condition the timer expiry check with a non-zero MNFA
    timeout value.

    Test Plan:

    PASS: Verify MNFA handling with and without timeout.
    PASS: Verify 3 node MNFA handling due to node power cycle
    PASS: Verify 2 node MNFA handling and recovery due to cable pull

    Change-Id: I97363cd309f786b3d41288667d4378b91e4a0d23
    Closes-Bug: 1858216
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/702033

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (r/stx.3.0)

Reviewed: https://review.opendev.org/702033
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=76addb37b281fccf1e39bbebcd6bec88bfb4c362
Submitter: Zuul
Branch: r/stx.3.0

commit 76addb37b281fccf1e39bbebcd6bec88bfb4c362
Author: Eric MacDonald <email address hidden>
Date: Fri Jan 3 14:28:28 2020 -0500

    Fix MNFA timer timeout condition check

    A timer module change made in https://review.opendev.org/#/c/698311
    introduced a change that makes all unstarted/stopped timers appear
    as expired/rung.

    A MNFA (Multi Node Failure Avoidance) timeout of zero represents no
    timeout and is implemented by not starting a timer for that condition.

    However, due to the recent change, that makes the MNFA timer expiry
    check succeed immediatly causing MNFA to exit prematurely causing the
    issue reported by the Bug reference.

    The fix is to condition the timer expiry check with a non-zero MNFA
    timeout value.

    Test Plan:

    PASS: Verify MNFA handling with and without timeout.
    PASS: Verify 3 node MNFA handling due to node power cycle
    PASS: Verify 2 node MNFA handling and recovery due to cable pull

    Change-Id: I97363cd309f786b3d41288667d4378b91e4a0d23
    Closes-Bug: 1858216
    Signed-off-by: Eric MacDonald <email address hidden>
    (cherry picked from commit e86d0b915896b74cb6956d93e99b5dd308271e80)

Ghada Khalil (gkhalil)
tags: added: in-r-stx30
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

verified "2020-01-15 00:15:16"

Yang Liu (yliu12)
tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (9.6 KiB)

Reviewed: https://review.opendev.org/705848
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3e2ecfca815619a0aea2a7b8169ad297c09aaec8
Submitter: Zuul
Branch: f/centos8

commit 1f0706ee30f44c666714ea16739c346326074fcb
Author: Don Penney <email address hidden>
Date: Mon Feb 3 12:48:03 2020 -0500

    Drop clone.py references from boot menu comment blocks

    As part of recent code cleanup activities, the controllerconfig
    clone.py module was dropped. This update removes references to this
    file from the comment blocks of the boot menu files grub.cfg and
    centos.syslinux.cfg.

    Change-Id: If2260c74f58a61481cf3ed7c7fbbe5ebb5292b00
    Partial-Bug: 1834218
    Signed-off-by: Don Penney <email address hidden>

commit 91f488af02e2bc27007690450a3ee63826e67c86
Author: Jim Somerville <email address hidden>
Date: Mon Jan 27 17:18:47 2020 -0500

    Security: Handle nospectre_v1 in the bootargs

    Most of the v1 mitigation is baked into the kernel and not
    optional. The swapgs barriers are, however, optional.
    They have a negative performance impact so we disable them
    by using the nospectre_v1 kernel bootarg.

    Partial-Bug: 1860193
    Depends-On: https://review.opendev.org/#/c/704406
    Change-Id: Id11232fe113293ed04b2802aaf038e2eedf9d797
    Signed-off-by: Jim Somerville <email address hidden>

commit eff0d663776587a6ccca6c30a0433baf8663aa09
Author: Angie Wang <email address hidden>
Date: Mon Jan 20 15:09:10 2020 -0500

    Remove unused post_clone_iso_ks.cfg

    Change-Id: I3be9384b94473cc6e0f6efbc1e404c5878856ffc
    Partial-Bug: 1834218
    Depends-On: https://review.opendev.org/#/c/703516/
    Signed-off-by: Angie Wang <email address hidden>

commit 4609dacc5f2ffa54012eaffc60196488a6c589b6
Author: Eric MacDonald <email address hidden>
Date: Mon Jan 20 11:34:07 2020 -0500

    Fix rvmc container build

    The current rvmc container build has a conflict over the
    python3-pip install.

    The CentOS repo has a prior version of python3-pip that
    requires python3 3.6 but the CENGN repo requires 3.7 due
    to the Titanium base image.

    The issue is resolved by updating this dockerfile with an
    older verison of python3-pip.

    Also, Since python3-pip provides pip3 the explicit pip3
    install is no longer required.

    Change-Id: Ic0cf0d070eaa8f437a043ac52dfa7ecf0e42f957
    Story: 2006980
    Task: 37775
    Signed-off-by: Eric MacDonald <email address hidden>

commit a9a2ca64bc409dd74fc24639e2ece334324c4b8d
Author: Saul Wold <email address hidden>
Date: Wed Jan 15 19:22:09 2020 -0800

    rvmc: remove un-used build data

    The error below reported in the build logs because rvmc is not
    setup to be built as an RPM package.

    ERROR: build_dir (425): Neither srpm_path nor .spec file not found
    in '/localdisk/designer/swold/stx/cgcs-root/stx/metal/tools/rvmc/centos'

    Closes-Bug: 1859893

    Change-Id: I9b2788bb227afbdf49e2faa5f05628331719233e
    Signed-off-by: Saul Wold <email address hidden>

commit d59ba5fdc21a89581cdc4e3fad038645b9d20754
Author: Al Baile...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.