compute node collectd coredump generated during initial setup

Bug #1876728 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
Compute node coredump file was found and the timestamp shows that it was collected during compute node initial setup.

Severity
--------
Major

Steps to Reproduce
------------------
installation multi

TC-name: common/test_system_health.py::TestCoreDumpsAndCrashes::test_system_coredumps_and_crashes[core_dumps]

Expected Behavior
------------------
no coredump file collected during system setup

Actual Behavior
----------------
coredump file collected during system setup

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
Multi node system

Lab-name: WP_8-12

Branch/Pull Time/Commit
-----------------------
2020-05-03_20-00-00

Last Pass
---------
2020-05-01_20-00-00

Timestamp/Logs
--------------
assert not {'compute-2': ['-rw-r----- 1 root root 5901772 2020-05-04_08-15-54 core.collectd.0.d6d037c866ce4ee98ba37b33c98248d6.2547.1588580127000000.xz']}

====================== Test Step 7: Wait for Deployment Mgr to configure other hosts
[2020-05-04 07:52:40,025] 2169 INFO MainThread fresh_install_helper.wait_for_deploy_mgr_hosts_config:: Waiting for Deploy Mgr to configure and unlock hosts: ['controller-1', 'compute-0', 'compute-1', 'compute-2'] ..

[2020-05-04 08:14:50,814] 58 INFO MainThread kube_helper.exec_kube_cmd:: exec_kube_cmd:kubectl get hosts -n=deployment -o=wide
[2020-05-04 08:15:11,149] 58 INFO MainThread kube_helper.exec_kube_cmd:: exec_kube_cmd:kubectl get hosts -n=deployment -o=wide
[2020-05-04 08:15:31,452] 58 INFO MainThread kube_helper.exec_kube_cmd:: exec_kube_cmd:kubectl get hosts -n=deployment -o=wide
[2020-05-04 08:15:51,769] 58 INFO MainThread kube_helper.exec_kube_cmd:: exec_kube_cmd:kubectl get hosts -n=deployment -o=wide
[2020-05-04 08:16:12,126] 58 INFO MainThread kube_helper.exec_kube_cmd:: exec_kube_cmd:kubectl get hosts -n=deployment -o=wide
[2020-05-04 08:16:12,510] 2183 INFO MainThread fresh_install_helper.wait_for_deploy_mgr_hosts_config:: Waiting for ['controller-1', 'compute-0', 'compute-1', 'compute-2'] to become availability=available and insync=true: [('compute-0', 'available', 'true')]

Test Activity
-------------
installation

Tags: stx.metal
Revision history for this message
Peng Peng (ppeng) wrote :

collect log is added at
https://files.starlingx.kube.cengn.ca/launchpad/1876728

coredump file attached

tags: added: stx.retestneeded
summary: - compute node coredump collected during intial setup
+ compute node coredump collected during initial setup
Revision history for this message
Ghada Khalil (gkhalil) wrote : Re: compute node coredump collected during initial setup

Marking as low priority / not gating since there is no system impact noted.

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Eric as he is the domain owner for collectd. However, this is not gating and should NOT be investigated as a priority.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Similar issue reported in https://bugs.launchpad.net/starlingx/+bug/1872979

Is this lab, in this failure case, configured with SRIOV ?

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_71_75
Load: 2020-05-05_20-29-49

https://files.starlingx.kube.cengn.ca/launchpad/1876728

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Can this be marked as a duplicate of https://bugs.launchpad.net/starlingx/+bug/1872979?

summary: - compute node coredump collected during initial setup
+ compute node collectd coredump generated during initial setup
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The collected logs for this bug report contain the same failure mode as bug report 'https://bugs.launchpad.net/starlingx/+bug/1872979'

The most identified aspect of this failure mode is the following log also reported in my analysis of 1872979.

2020-05-04T08:15:27.343 compute-2 collectd[2547]: info *** Error in `/usr/sbin/collectd': double free or corruption (!prev): 0x00007f0c3c008840 ***

The even's surrounding logs look the same as well.

Marking this issue as duplicate of 1872979.

Changed in starlingx:
status: Triaged → Invalid
Peng Peng (ppeng)
tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per above, confirmed to be a duplicate of https://bugs.launchpad.net/starlingx/+bug/1872979

Setting the status to Fix Released to match the duplicate.
Fix was merged on 2020-06-30: https://review.opendev.org/#/c/736817/

Changed in starlingx:
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.