StarlingX

MNFA times out immediately with timeout value of 0

Bug #1858216 reported by Anujeyan Manokeran on 2020-01-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Eric MacDonald

Bug Description

Brief Description
-----------------
During the cable pull test on cluster and mgmt configured on same interface cause multinode failure. Standby controller(c-0) and worker nodes rebooted immediately on cable pull on active controller. Multinodes failure avoidance timeout was set to 0 in this lab. Multinode failure was not avoided as per set parameter below.

**Below set parameter value

** cable pull time
2020-01-03T14:37:10.135 controller-1 kernel: info [19603.064914] i40e 0000:18:00.0 enp24s0f0: NIC Link is Down 2020-01-03T14:37:11.271 controller-1 kernel: info [19604.196161] i40e 0000:18:00.0 enp24s0f0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None 2020-01-03T14:37:11.365 controller-1 kernel: info [19604.292773] i40e 0000:18:00.0 enp24s0f0: NIC Link is Down

Severity
--------
Major

Steps to Reproduce
------------------

1. Have AIO+ lab with cluster and MGT provisioned in same network
2. On active controller pull cable where cluster and Mgt are in same vlan.
3. Verify host states after cable pull.

Expected Behavior
------------------
All the nodes shouldn’t reboot.

Actual Behavior
----------------
All the nodes rebooted for cable pull . MNFA was not triggered.

Reproducibility
---------------
Tested once in this load

System Configuration
--------------------
AIO+ wolfpass 8-12

Branch/Pull Time/Commit
-----------------------
2020-01-02 20:04:12 -0500

Last Pass
---------
Last tested on load 2019-12-13 19:04:39 it was different issue https://bugs.launchpad.net/starlingx/+bug/1856614

Timestamp/Logs
--------------
2020-01-03T14:37:10.135

Test Activity
-------------
Regression

See original description

Tags:

Anujeyan Manokeran (anujeyan) on 2020-01-03

description:

updated

Eric MacDonald (rocksolidmtce) on 2020-01-03

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-01-03:

collect logs Edit (82.8 MiB, application/x-tar)

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-01-03:

collect logs Edit (82.8 MiB, application/x-tar)

Eric MacDonald (rocksolidmtce) on 2020-01-03

summary:

- Cable pull test on active controller cause multi node failure lab with
- cluster and mgmt configured on a same interface
+ MNFA times out immediately with timeout value of 0

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-01-03:

A timer module change made in https://review.opendev.org/#/c/698311/ introduced this bug.

Fix is implemented and tested.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-03: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/701057

Changed in starlingx:
status:	New → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-01-03:

Marking as stx.4.0 / medium priority - this is only an issue with the user specifies a timeout value of 0. Even though this issue exists in stx.3.0, this doesn't seem to be a likely scenario

tags:	added: stx.metal
tags:	added: stx.3.0 stx.4.0
Changed in starlingx:
importance:	Undecided → High
tags:	removed: stx.3.0
Changed in starlingx:
importance:	High → Medium
importance:	Medium → High
tags:	added: stx.3.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-01-03:

Correction: Marking as stx.3.0 & stx.4.0 gating / high priority - As per Eric MacDonald, the default value is set to zero, so users will hit this if they do a cable pull.

Fix will need to be cherrypicked in r/stx.3.0 after it merges in master

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-08: Fix merged to metal (master)

Reviewed: https://review.opendev.org/701057
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=e86d0b915896b74cb6956d93e99b5dd308271e80
Submitter: Zuul
Branch: master

commit e86d0b915896b74cb6956d93e99b5dd308271e80
Author: Eric MacDonald <email address hidden>
Date: Fri Jan 3 14:28:28 2020 -0500

Fix MNFA timer timeout condition check

    A timer module change made in https://review.opendev.org/#/c/698311
    introduced a change that makes all unstarted/stopped timers appear
    as expired/rung.

A MNFA (Multi Node Failure Avoidance) timeout of zero represents no
timeout and is implemented by not starting a timer for that condition.

    However, due to the recent change, that makes the MNFA timer expiry
    check succeed immediatly causing MNFA to exit prematurely causing the
    issue reported by the Bug reference.

The fix is to condition the timer expiry check with a non-zero MNFA
timeout value.

Test Plan:

    PASS: Verify MNFA handling with and without timeout.
    PASS: Verify 3 node MNFA handling due to node power cycle
    PASS: Verify 2 node MNFA handling and recovery due to cable pull

    Change-Id: I97363cd309f786b3d41288667d4378b91e4a0d23
    Closes-Bug: 1858216
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-10: Fix proposed to metal (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/702033

Yang Liu (yliu12) on 2020-01-12

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-15: Fix merged to metal (r/stx.3.0)

Reviewed: https://review.opendev.org/702033
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=76addb37b281fccf1e39bbebcd6bec88bfb4c362
Submitter: Zuul
Branch: r/stx.3.0

commit 76addb37b281fccf1e39bbebcd6bec88bfb4c362
Author: Eric MacDonald <email address hidden>
Date: Fri Jan 3 14:28:28 2020 -0500

Fix MNFA timer timeout condition check

    A timer module change made in https://review.opendev.org/#/c/698311
    introduced a change that makes all unstarted/stopped timers appear
    as expired/rung.

A MNFA (Multi Node Failure Avoidance) timeout of zero represents no
timeout and is implemented by not starting a timer for that condition.

    However, due to the recent change, that makes the MNFA timer expiry
    check succeed immediatly causing MNFA to exit prematurely causing the
    issue reported by the Bug reference.

The fix is to condition the timer expiry check with a non-zero MNFA
timeout value.

Test Plan:

    PASS: Verify MNFA handling with and without timeout.
    PASS: Verify 3 node MNFA handling due to node power cycle
    PASS: Verify 2 node MNFA handling and recovery due to cable pull

    Change-Id: I97363cd309f786b3d41288667d4378b91e4a0d23
    Closes-Bug: 1858216
    Signed-off-by: Eric MacDonald <email address hidden>
    (cherry picked from commit e86d0b915896b74cb6956d93e99b5dd308271e80)

Ghada Khalil (gkhalil) on 2020-01-16

tags:

added: in-r-stx30

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-01-16:

#10

verified "2020-01-15 00:15:16"

Yang Liu (yliu12) on 2020-01-27

tags:

removed: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-04: Fix proposed to metal (f/centos8)

#11

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705848

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-05: Fix merged to metal (f/centos8)

#12

Download full text (9.6 KiB)

Reviewed: https://review.opendev.org/705848
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3e2ecfca815619a0aea2a7b8169ad297c09aaec8
Submitter: Zuul
Branch: f/centos8

commit 1f0706ee30f44c666714ea16739c346326074fcb
Author: Don Penney <email address hidden>
Date: Mon Feb 3 12:48:03 2020 -0500

Drop clone.py references from boot menu comment blocks

    As part of recent code cleanup activities, the controllerconfig
    clone.py module was dropped. This update removes references to this
    file from the comment blocks of the boot menu files grub.cfg and
    centos.syslinux.cfg.

    Change-Id: If2260c74f58a61481cf3ed7c7fbbe5ebb5292b00
    Partial-Bug: 1834218
    Signed-off-by: Don Penney <email address hidden>

commit 91f488af02e2bc27007690450a3ee63826e67c86
Author: Jim Somerville <email address hidden>
Date: Mon Jan 27 17:18:47 2020 -0500

Security: Handle nospectre_v1 in the bootargs

    Most of the v1 mitigation is baked into the kernel and not
    optional. The swapgs barriers are, however, optional.
    They have a negative performance impact so we disable them
    by using the nospectre_v1 kernel bootarg.

    Partial-Bug: 1860193
    Depends-On: https://review.opendev.org/#/c/704406
    Change-Id: Id11232fe113293ed04b2802aaf038e2eedf9d797
    Signed-off-by: Jim Somerville <email address hidden>

commit eff0d663776587a6ccca6c30a0433baf8663aa09
Author: Angie Wang <email address hidden>
Date: Mon Jan 20 15:09:10 2020 -0500

Remove unused post_clone_iso_ks.cfg

    Change-Id: I3be9384b94473cc6e0f6efbc1e404c5878856ffc
    Partial-Bug: 1834218
    Depends-On: https://review.opendev.org/#/c/703516/
    Signed-off-by: Angie Wang <email address hidden>

commit 4609dacc5f2ffa54012eaffc60196488a6c589b6
Author: Eric MacDonald <email address hidden>
Date: Mon Jan 20 11:34:07 2020 -0500

Fix rvmc container build

The current rvmc container build has a conflict over the
python3-pip install.

    The CentOS repo has a prior version of python3-pip that
    requires python3 3.6 but the CENGN repo requires 3.7 due
    to the Titanium base image.

The issue is resolved by updating this dockerfile with an
older verison of python3-pip.

Also, Since python3-pip provides pip3 the explicit pip3
install is no longer required.

    Change-Id: Ic0cf0d070eaa8f437a043ac52dfa7ecf0e42f957
    Story: 2006980
    Task: 37775
    Signed-off-by: Eric MacDonald <email address hidden>

commit a9a2ca64bc409dd74fc24639e2ece334324c4b8d
Author: Saul Wold <email address hidden>
Date: Wed Jan 15 19:22:09 2020 -0800

rvmc: remove un-used build data

The error below reported in the build logs because rvmc is not
setup to be built as an RPM package.

ERROR: build_dir (425): Neither srpm_path nor .spec file not found
in '/localdisk/designer/swold/stx/cgcs-root/stx/metal/tools/rvmc/centos'

Closes-Bug: 1859893

Change-Id: I9b2788bb227afbdf49e2faa5f05628331719233e
Signed-off-by: Saul Wold <email address hidden>

commit d59ba5fdc21a89581cdc4e3fad038645b9d20754
Author: Al Baile...

Reviewed:  https://review.opendev.org/705848
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3e2ecfca815619a0aea2a7b8169ad297c09aaec8
Submitter: Zuul
Branch:    f/centos8

commit 1f0706ee30f44c666714ea16739c346326074fcb
Author: Don Penney <don.penney@windriver.com>
Date:   Mon Feb 3 12:48:03 2020 -0500

Drop clone.py references from boot menu comment blocks
    
    As part of recent code cleanup activities, the controllerconfig
    clone.py module was dropped. This update removes references to this
    file from the comment blocks of the boot menu files grub.cfg and
    centos.syslinux.cfg.
    
    Change-Id: If2260c74f58a61481cf3ed7c7fbbe5ebb5292b00
    Partial-Bug: 1834218
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 91f488af02e2bc27007690450a3ee63826e67c86
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Mon Jan 27 17:18:47 2020 -0500

Security: Handle nospectre_v1 in the bootargs
    
    Most of the v1 mitigation is baked into the kernel and not
    optional.  The swapgs barriers are, however, optional.
    They have a negative performance impact so we disable them
    by using the nospectre_v1 kernel bootarg.
    
    Partial-Bug: 1860193
    Depends-On: https://review.opendev.org/#/c/704406
    Change-Id: Id11232fe113293ed04b2802aaf038e2eedf9d797
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit eff0d663776587a6ccca6c30a0433baf8663aa09
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon Jan 20 15:09:10 2020 -0500

Remove unused post_clone_iso_ks.cfg
    
    Change-Id: I3be9384b94473cc6e0f6efbc1e404c5878856ffc
    Partial-Bug: 1834218
    Depends-On: https://review.opendev.org/#/c/703516/
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 4609dacc5f2ffa54012eaffc60196488a6c589b6
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 20 11:34:07 2020 -0500

Fix rvmc container build
    
    The current rvmc container build has a conflict over the
    python3-pip install.
    
    The CentOS repo has a prior version of python3-pip that
    requires python3 3.6 but the CENGN repo requires 3.7 due
    to the Titanium base image.
    
    The issue is resolved by updating this dockerfile with an
    older verison of python3-pip.
    
    Also, Since python3-pip provides pip3 the explicit pip3
    install is no longer required.
    
    Change-Id: Ic0cf0d070eaa8f437a043ac52dfa7ecf0e42f957
    Story: 2006980
    Task: 37775
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit a9a2ca64bc409dd74fc24639e2ece334324c4b8d
Author: Saul Wold <sgw@linux.intel.com>
Date:   Wed Jan 15 19:22:09 2020 -0800

rvmc: remove un-used build data
    
    The  error below reported in the build logs because rvmc is not
    setup to be built as an RPM package.
    
    ERROR: build_dir (425): Neither srpm_path nor .spec file not found
    in '/localdisk/designer/swold/stx/cgcs-root/stx/metal/tools/rvmc/centos'
    
    Closes-Bug: 1859893
    
    Change-Id: I9b2788bb227afbdf49e2faa5f05628331719233e
    Signed-off-by: Saul Wold <sgw@linux.intel.com>

commit d59ba5fdc21a89581cdc4e3fad038645b9d20754
Author: Al Bailey <Al.Bailey@windriver.com>
Date:   Tue Jan 7 08:41:36 2020 -0600

Remove unused inventory and python-inventoryclient
    
    Neither of these components were maintained or used, and so are
    being abandoned.
    
     - inventory was an old fork of the sysinv code
     - python-inventoryclient was an old fork of the cgts-client code
    
    The devstack commands, although currently disabled, have also
    been updated.
    
    Change-Id: If6a109edbc70eb1bd92012f4261dec4a2c58fbd1
    Story: 2004515
    Task: 37538
    Depends-On: https://review.opendev.org/701591
    Signed-off-by: Al Bailey <Al.Bailey@windriver.com>

commit e86d0b915896b74cb6956d93e99b5dd308271e80
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Jan 3 14:28:28 2020 -0500

Fix MNFA timer timeout condition check
    
    A timer module change made in https://review.opendev.org/#/c/698311
    introduced a change that makes all unstarted/stopped timers appear
    as expired/rung.
    
    A MNFA (Multi Node Failure Avoidance) timeout of zero represents no
    timeout and is implemented by not starting a timer for that condition.
    
    However, due to the recent change, that makes the MNFA timer expiry
    check succeed immediatly causing MNFA to exit prematurely causing the
    issue reported by the Bug reference.
    
    The fix is to condition the timer expiry check with a non-zero MNFA
    timeout value.
    
    Test Plan:
    
    PASS: Verify MNFA handling with and without timeout.
    PASS: Verify 3 node MNFA handling due to node power cycle
    PASS: Verify 2 node MNFA handling and recovery due to cable pull
    
    Change-Id: I97363cd309f786b3d41288667d4378b91e4a0d23
    Closes-Bug: 1858216
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit e68db45a2e12fb265adc8e39cec943fb4d7d8032
Author: Al Bailey <Al.Bailey@windriver.com>
Date:   Mon Dec 23 10:49:58 2019 -0600

Add pylint checks for python files in metal
    
    Created a pylint.rc file for running pylint
    Added a pylint task to zuul
    
    Targets the following python files:
     - redfish docker code
     - hwmond_notify
    Other python components in metal are not being included
    because they are being removed in later commits.
    
    Story: 2004515
    Task: 37956
    Change-Id: I782672c366e56d1f1597d40f5754444b2fa76b9e
    Signed-off-by: Al Bailey <Al.Bailey@windriver.com>

commit 9bf231a2866c0ff737064755d0106198d4df7d7d
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Jan 3 09:34:37 2020 -0500

Fix BMC access loss handling
    
    Recent refactoring of the BMC handler FSM introduced a code change that
    prevents the BMC Access alarm from being raised after initial BMC
    accessibility was established and is then lost.
    
    This update ensures BMC access alarm management is working properly.
    
    This update also implements ping failure debounce so that a single ping
    failure does not trigger full reconnection handling. Instead that now
    requires 3 ping failures in a row. This has the effect of adding a minute
    to ping failure action handling before the usual 2 minute BMC access failure
    alarm is raised. ping failure logging is reduced/improved.
    
    Test Plan: for both hwmond and mtcAgent
    
    PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
    PASS: Verify BMC ping failure debounce handling, recovery and logging
    PASS: Verify BMC ping persistent failure handling
    PASS: Verify BMC ping periodic miss handling
    PASS: Verify BMC ping and access failure recovery timing
    PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
    PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery
    
    Regression:
    
    PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
    PASS: Verify BMC power-off request handling with BMC ping failing & recovering
    PASS: Verify BMC power-on request handling with BMC ping failing & recovering
    PASS: Verify BMC reset request handling with BMC ping failing & recovering
    PASS: Verify BMC sensor group read failure handling & recovery
    PASS: Verify sensor monitoring after ping failure handling & recovery
    
    Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
    Closes-Bug: 1858110
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 09b95bf651c065e9fffae3255ecf0d0e52a61249
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Jan 2 17:40:09 2020 -0500

Update kickstarts to generate DNF repo config files
    
    As the patching framework is updated to use DNF instead of the smart
    package manager, the kickstarts are updated to generate the initial
    DNF repo config files, rather than configure the smartpm channels.
    
    Depends-On: https://review.opendev.org/700961
    Change-Id: Ic625aa4646b45719c9527159aa46f157a4d2cff0
    Story: 2006227
    Task: 37935
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 8959d8258ddd6be744c66cf68a6fe786d5b56c06
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Sun Dec 8 16:35:51 2019 -0500

Utility to install a server via Redfish
    
    This update introduces the implementation for a docker container named
    'rvmc', standing for Redfish Virtual Media Controller, which executes a
    python file that imports the open source redfish-python-library used to
    establish a secure Redfish communication session with a Redfish
    supported Board Management Controller to insert a URL based ISO image
    into its Virtual Media CD/DVD device so that on that server's next
    reboot will boot and install that image.
    
    This container supports single and multi target configuration files with
    IPV4 and IPV6 BMC addressing.
    
    Change-Id: I4d555046800c8d193686b9ef3a2b1e61c13d4ff8
    Depends-On: https://review.opendev.org/#/c/700434/
    Depends-On: https://review.opendev.org/#/c/700080/
    Story: 2006980
    Task: 37775
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 6ccd0f4a4322b62006a501f81957b7f64a034633
Author: marvin <weifei.yu@intel.com>
Date:   Tue Nov 19 15:26:03 2019 +0800

Monitor the datanetwork for non-OpenStack work node
    
    Update the lmon to support datanetwork interface monitoring
    and use collectd to control the alarm information. Now lmon
    will obtain the list of interfaces from /etc/lmon/lmon.conf
    which can be generated by puppet.
    
    Change-Id: Ice72eda03d1bbdee6c644b1ed7ab878c942eb85c
    Story: #2002948
    Task: #37326
    Signed-off-by: marvin <weifei.yu@intel.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.