mtcAgent seen to core dump on process exit

Bug #1831956 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

A mtcAgent core dump was observed during unit testing of a fix for another issue that involved rebooting the active controller

controller-0:~$ ls -lrt /var/lib/systemd/coredump/
-rw-r----- 1 root root 275348 Jun 6 19:56 core.mtcAgent.0.9f361ef237c147d494dc8f268dc81fd5.576155.1559851001000000.xz

Debug of the coredump showed that it occurred inside the nodeLinkClass destructor.

#0 0x00007fcf640b34af in _int_free () from /lib64/libc.so.6
#1 0x0000000000452312 in std::_List_base<libEvent, std::allocator<libEvent> >::_M_clear() ()
#2 0x00000000004a1a82 in nodeLinkClass::~nodeLinkClass() ()
#3 0x00007fcf6406bb69 in __run_exit_handlers () from /lib64/libc.so.6
#4 0x00007fcf6406bbb7 in exit () from /lib64/libc.so.6
#5 0x0000000000414411 in daemon_exit() ()
#6 0x00000000004ad4e5 in daemon_signal_hdlr() ()
#7 0x0000000000417a5b in daemon_service_run() ()
#8 0x0000000000406585 in main ()

Severity
--------
Minor: not service impacting

Steps to Reproduce
------------------
reboot the active controller

Expected Behavior
------------------
no core dump

Actual Behavior
----------------
occasional core dump

Reproducibility
---------------
Intermittent: 1 in 10

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------
SW_VERSION="19.01" rebase as of "2019-06-05 18:32:46"

Last Pass
---------
Unknown

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
status: New → In Progress
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Ran hundreds of mtcAgent process kills -INT and -TERM overnight nd did not get any core dumps

while true
do
sm-unmanage service mtc-agent
pkill -term mtcAgent
sleep 3
date
/usr/local/bin/mtcAgent -l -a
s=$((1 + RANDOM % 30))
echo "sleeping $s seconds"
sleep $s
ps -efL | grep mtcAgent
ls /var/lib/systemd/coredump/
sleep 2
done

and another variation that restarted mtcAgent by SM.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Low priority; no service impact; issue is intermittent; does not gate any starlingx release

Changed in starlingx:
importance: Undecided → Low
tags: added: stx.metal
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is not as frequent as reported ; 1 in 10. I can't even reproduce after hundreds.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/685774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/685774
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=01818fdf090f0a1e0662b1a7e5c1586ecc3862be
Submitter: Zuul
Branch: master

commit 01818fdf090f0a1e0662b1a7e5c1586ecc3862be
Author: Eric MacDonald <email address hidden>
Date: Sun Sep 29 11:33:42 2019 -0400

    Fix rare mtcAgent segfault on process shutdown

    A mtcAgent core dump was observed during unit testing
    of another feature. Debug of that coredump revealed the
    segfault occured in the freeing of host object memory.

    Occurrence is extremely rare but at the same time there
    is no real need to free this memory in the destructor
    because the kernel, which is much better at that task,
    does that automatically when a proccess exits.

    Test Plan:

    PASS: Verify system install with no core dumps.
    PASS: Verify mtcAgent proccesss restart soak with no code dumps.

    Change-Id: I6107078fb802be0ef2aaf632b79e751376ba9c42
    Closes-Bug: 1831956
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.