mtcAgent segfaults on controller-0 initial unlock if lo interface is reset

Bug #1869785 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
This is a follow-up on https://bugs.launchpad.net/starlingx/+bug/1868584

In the above LP, the code to configure the SR-IOV interfaces resulted in the full network manifest being re-applied. The side effect of that was that all platform interfaces may be brought down/up. This, in turn, resulted in issues with the system maintenance code including a segfault.

The SR-IOV configuration code has been updated to be more targeted, so the initial trigger has been addressed. This is a low priority follow-up bug to look at the mtcAgent segfault to determine if the code should be improved to address it.

Severity
--------
Minor -- the trigger for the segfault has already been addressed, so the segfault is not likely to be hit again

Steps to Reproduce
------------------
Originally the issue was triggered on the initial unlock of controller-0
Given that the trigger has already been fixed, there are no steps to reproduce other than forcing this code path explicitly

Expected Behavior
------------------
no mtcAgent segfaults are seen when the lo interface is reset before the initial unlock of controller-0

Actual Behavior
----------------
mtcAgent segfaults are reported in the logs

Reproducibility
---------------
N/A -- trigger is removed

System Configuration
--------------------
One node system
Lab-name: SM-3, wcp-11

Branch/Pull Time/Commit
-----------------------
Load: 2020-03-22_16-04-38

Last Pass
---------
Load: 2020-03-22_04-10-00

Timestamp/Logs
--------------
Logs are attached to https://bugs.launchpad.net/starlingx/+bug/1868584

Key notes:
There seems to have been an mtcAgent segfault. The corresponding kern.log segfault log is:

2020-03-23T03:13:43.141 localhost kernel: info [ 1224.325011] mtcAgent[110471]: segfault at 29 ip 00007f96c28dc9cb sp 00007ffe25779a40 error 4 in libc-2.17.so[7f96c285c000+1c2000]

Note the message a few seconds before the segfault:

2020-03-23T03:13:40.781 [110471.00126] controller-0 mtcAgent hbs nodeClass.cpp (4687) service_netlink_events : Warn : lo is down (oper:down)

Not sure if this is related to the crash, but this is coming from the apply_network_config.sh. As can be seen by the user.log, the lo is brought down, and brought back up after the segfault occurs:

controller-0:~$ cat /var/log/user.log | grep ifcfg-lo
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh /var/run/network-scripts.puppet/ifcfg-lo:5 and /etc/sysconfig/network-scripts/ifcfg-lo:5 differ on attribute BOOTPROTO
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh ifcfg-lo:5 changed
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh Adding ifcfg-lo to upDown list
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh /var/run/network-scripts.puppet/ifcfg-lo and /etc/sysconfig/network-scripts/ifcfg-lo differ on attribute BOOTPROTO
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh ifcfg-lo changed
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh /var/run/network-scripts.puppet/ifcfg-lo:1 and /etc/sysconfig/network-scripts/ifcfg-lo:1 differ on attribute BOOTPROTO
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh ifcfg-lo:1 changed
2020-03-23T03:13:40.000 localhost root: notice /usr/local/bin/apply_network_config.sh Bringing /etc/sysconfig/network-scripts/ifcfg-lo down
2020-03-23T03:13:41.000 localhost root: notice /usr/local/bin/apply_network_config.sh copying network cfg /var/run/network-scripts.puppet/ifcfg-lo:5 to /etc/sysconfig/network-scripts/ifcfg-lo:5
2020-03-23T03:13:41.000 localhost root: notice /usr/local/bin/apply_network_config.sh copying network cfg /var/run/network-scripts.puppet/ifcfg-lo to /etc/sysconfig/network-scripts/ifcfg-lo
2020-03-23T03:13:41.000 localhost root: notice /usr/local/bin/apply_network_config.sh copying network cfg /var/run/network-scripts.puppet/ifcfg-lo:1 to /etc/sysconfig/network-scripts/ifcfg-lo:1
2020-03-23T03:13:46.000 localhost root: notice /usr/local/bin/apply_network_config.sh Bringing /var/run/network-scripts.puppet/ifcfg-lo up

Test Activity
-------------
installation

Tags: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Low priority / not gating any stx release - the trigger for this issue has been addressed, so this will no longer be hit on initial controller-0 unlocks

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This issue is not reproducible.

Repeated toggling of the 'lo' interface in AIO SX does not cause any segfault of any process, including the mtcAgent which handles the error condition as expected with retries.

2021-04-13T18:02:37.933 [1255499.00117] controller-0 mtcAgent hbs nodeClass.cpp (4710) service_netlink_events : Warn : Management link lo is up
2021-04-13T18:02:37.933 [1255499.00118] controller-0 mtcAgent hbs nodeClass.cpp (4715) service_netlink_events : Warn : Cluster-host link lo is up
2021-04-13T18:02:38.119 [1255499.00119] controller-0 mtcAgent hbs nodeClass.cpp (4674) service_netlink_events : Warn : Management link lo is down
2021-04-13T18:02:38.119 [1255499.00120] controller-0 mtcAgent hbs nodeClass.cpp (4680) service_netlink_events : Warn : Cluster-host link lo is down
2021-04-13T18:02:38.119 [1255499.00121] controller-0 mtcAgent hbs nodeClass.cpp (4687) service_netlink_events : Warn : lo is down (oper:down)
controller-0:~$ 2021-04-13T18:02:40.040 [1255499.00122] controller-0 mtcAgent hbs nodeClass.cpp (4710) service_netlink_events : Warn : Management link lo is up
2021-04-13T18:02:40.040 [1255499.00123] controller-0 mtcAgent hbs nodeClass.cpp (4715) service_netlink_events : Warn : Cluster-host link lo is up
2021-04-13T18:02:42.179 [1255499.00124] controller-0 mtcAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2021-04-13T18:02:42.179 [1255499.00125] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 859) send_hbs_command : Warn : controller-0 send command (0x11110016) failed (abcd:204::2)
2021-04-13T18:02:42.179 [1255499.00126] controller-0 mtcAgent --- nodeBase.cpp ( 306) print_mtc_message : Info : controller-0 rx <- publish active controller (Mgmnt network) 2.0 11110016:1:3.0.0.0 [cgts mtc hbs cmd:] {"
2021-04-13T18:02:52.199 [1255499.00127] controller-0 mtcAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2021-04-13T18:02:52.199 [1255499.00128] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 859) send_hbs_command : Warn : controller-0 send command (0x11110016) failed (abcd:204::2)
2021-04-13T18:02:52.199 [1255499.00129] controller-0 mtcAgent --- nodeBase.cpp ( 306) print_mtc_message : Info : controller-0 rx <- publish active controller (Mgmnt network) 2.0 11110016:1:3.0.0.0 [cgts mtc hbs cmd:] {"
2021-04-13T18:03:01.159 [1255499.00130] controller-0 mtcAgent hbs nodeClass.cpp (5561) log_process_failure : Warn : controller-0 pmon: 'sw-patch-controller-daemon' process failed and is being auto recovered
2021-04-13T18:03:02.219 [1255499.00131] controller-0 mtcAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2021-04-13T18:03:02.219 [1255499.00132] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 859) send_hbs_command : Warn : controller-0 send command (0x11110016) failed (abcd:204::2)
2021-04-13T18:03:02.219 [1255499.00133] controller-0 mtcAgent --- nodeBase.cpp ( 306) print_mtc_message : Info : controller-0 rx <- publish active controller (Mgmnt network) 2.0 11110016:1:3.0.0.0 [cgts mtc hbs cmd:] {"

Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.