mtcAgent is stopped after CentOS networking is upgraded

Bug #2041194 reported by Joshua Kraitberg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Joshua Kraitberg

Bug Description

Brief Description
-----------------
After networking runtime manifest is applied during stx6 to stx8 upgrade, mtcAgent is stopped and node cannot be unlocked.

Severity
--------
Critical

Steps to Reproduce
------------------
stx6 to stx8 upgrade

Expected Behavior
------------------
Unlock works at end of playbook

Actual Behavior
----------------
Unlock does not work at end of playbook

Reproducibility
---------------
100% on some systems

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
10-26-2023

Last Pass
---------
N/A

Timestamp/Logs
--------------
mtcAgent.log
2023-10-24T00:28:08.149 [19122.00147] controller-0 mtcAgent hbs nodeClass.cpp (4824) service_netlink_events : Warn : Management link vlan409 is down
2023-10-24T00:28:08.149 [19122.00148] controller-0 mtcAgent hbs nodeClass.cpp (4837) service_netlink_events : Warn : vlan409 is down (oper:down)
2023-10-24T00:28:15.740 [19122.00149] controller-0 mtcAgent hbs nodeClass.cpp (4824) service_netlink_events : Warn : Management link vlan409 is down
2023-10-24T00:28:15.740 [19122.00150] controller-0 mtcAgent hbs nodeClass.cpp (4837) service_netlink_events : Warn : vlan409 is down (oper:down)
2023-10-24T00:28:15.740 [19122.00151] controller-0 mtcAgent hbs nodeClass.cpp (4860) service_netlink_events : Warn : Management link vlan409 is up
2023-10-24T00:28:26.725 [19122.00152] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3454) online_handler : Info : controller-0 mtcAlive lost ; going 'offline'
2023-10-24T00:28:26.725 fmAPI.cpp(489): Enqueue raise alarm request: UUID (9aeb1e0e-5330-4ee4-b969-7d6b958f6610) alarm id (200.022) instant id (host=controller-0.status=offline)
2023-10-24T00:28:26.725 [19122.00153] controller-0 mtcAgent inv mtcInvApi.cpp (1119) mtcInvApi_update_state : Info : controller-0 offline (seq:14)
2023-10-24T00:28:26.725 [19122.00154] controller-0 mtcAgent inv mtcInvApi.cpp (1115) mtcInvApi_update_state : Info : controller-0-compute offline (seq:15)
2023-10-24T00:28:26.725 [19122.00155] controller-0 mtcAgent vim mtcVimApi.cpp ( 255) mtcVimApi_state_change : Info : controller-0 sending 'host' state change to vim (offline)
2023-10-24T00:28:26.725 [19122.00156] controller-0 mtcAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2023-10-24T00:28:26.725 [19122.00157] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 788) send_mtc_cmd :Error : controller-0 Failed to send command (rc:-1)
2023-10-24T00:28:26.725 [19122.00158] controller-0 mtcAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2023-10-24T00:28:26.725 [19122.00159] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 788) send_mtc_cmd :Error : controller-0 Failed to send command (rc:-1)
2023-10-24T00:28:26.725 [19122.00160] controller-0 mtcAgent --- mtcHttpUtil.cpp (1238) getEvent :Swerr : controller-0 mtcInvApi_update_state seq:14 is not active ; removing from workQueue
2023-10-24T00:28:26.725 [19122.00161] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 642) workQueue_del_cmd : Warn : controller-0 mtcInvApi_update_state seq:14 force removed from work queue
2023-10-24T00:28:26.725 [19122.00162] controller-0 mtcAgent --- mtcHttpUtil.cpp (1271) mtcHttpUtil_handler :Swerr : HTTP Event Lookup Failed for http base (0x562902c5fd40) <------
2023-10-24T00:28:26.729 fmAlarmUtils.cpp(623): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=controller-0.status=offline)
2023-10-24T00:28:26.729 fmSocket.cpp(139): Socket Error: Failed to write to fd:(19), len:(4), rc:(-1), error:(Network is unreachable)
2023-10-24T00:28:26.729 fmAlarmUtils.cpp(623): Failed to send FM raise alarm request: alarm_id (200.022), entity_id (host=controller-0.status=offline)
2023-10-24T00:28:26.729 fmAPI.cpp(140): Failed to connect to FM Manager.
2023-10-24T00:28:26.732 [19122.00163] controller-0 mtcAgent --- mtcHttpUtil.cpp (1238) getEvent :Swerr : controller-0 mtcInvApi_update_state seq:15 is not active ; removing from workQueue
2023-10-24T00:28:26.732 [19122.00164] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 642) workQueue_del_cmd : Warn : controller-0 mtcInvApi_update_state seq:15 force removed from work queue
2023-10-24T00:28:26.732 [19122.00165] controller-0 mtcAgent --- mtcHttpUtil.cpp (1271) mtcHttpUtil_handler :Swerr : HTTP Event Lookup Failed for http base (0x562902c5fd40) <------

Test Activity
-------------
Feature Testing

Workaround
----------
Before unlock, execute:
sudo OCF_ROOT="/usr/lib/ocf" OCF_RESKEY_state="active" /usr/lib/ocf/resource.d/platform/mtcAgent reload

Changed in starlingx:
assignee: nobody → Joshua Kraitberg (jkraitbe-wr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/899401
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/7e148b5ccd3e602533470b44ad6e9bfff6703b04
Submitter: "Zuul (22348)"
Branch: master

commit 7e148b5ccd3e602533470b44ad6e9bfff6703b04
Author: Joshua Kraitberg <email address hidden>
Date: Wed Oct 25 15:41:09 2023 -0400

    Restart mtcAgent after upgrading CentOS networking

    During stx6 to stx8 upgrade, the CentOS networking is upgraded using
    a puppet runtime manifest. During the application of this manifest,
    there is the possibility of mtcAgent going down depending on the
    network configuration. To unlock the node at the end of the upgrade
    playbook mtcAgent is required.

    To avoid being unable to unlock mtcAgent is restarted after the
    networking runtime manifest is applied.

    TEST PLAN
    PASS: stx6 to stx8 AIO-SX upgrade on unaffected system
    PASS: stx6 to stx8 AIO-SX upgrade on unaffected system

    Closes-Bug: 2041194
    Signed-off-by: Joshua Kraitberg <email address hidden>
    Change-Id: I49f60bcf452fb2e1b5e1a497097f5310b5e40b17

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.