collectd core dump generated after lock/unlock controller-0

Bug #1872979 reported by Anujeyan Manokeran
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
  Below coredump files are generated during the automation run testcases test_invalid_huge_page_input[1--2M asdf -1G asdf] and test_ptp_over_interface[sriov] . After the controller-0 lock and unlock.

controller-0_2020-04-12_02-17-06_core.collectd.0.e31bf7491508489f91925f17ae67873b.2522.1586657808000000.xz
controller-0_2020-04-12_02-47-56_core.collectd.0.030bf5affac94ebdb55618230c994ca9.2520.1586659659000000.xz

Severity
--------
Major

Steps to Reproduce
------------------
1. Configure ptp.
 Eg : system host-update clock_synchronization ptp
2. lock and unlock controller-0

Expected Behavior
------------------
No core dump

Actual Behavior
----------------
As description core dump on controller-0

Reproducibility
---------------
WCP-11 . Intermittent. Seen twice out of 4 lock/unlock on same system.

System Configuration
--------------------
wildcat-11

Branch/Pull Time/Commit
-----------------------
Build ID: 2020-04-11_00-10-00

Last Pass
---------

Timestamp/Logs
--------------
 02:17 & 02:47

Test Activity
-------------
Sysinv Automated regression run

Brief Description
-----------------
  Below coredump files are generated during the automation run test cases test_invalid_huge_page_input[1--2M asdf -1G asdf] and test_ptp_over_interface[sriov] . After the controller-0 lock and unlock.

controller-0_2020-04-12_02-17-06_core.collectd.0.e31bf7491508489f91925f17ae67873b.2522.1586657808000000.xz
controller-0_2020-04-12_02-47-56_core.collectd.0.030bf5affac94ebdb55618230c994ca9.2520.1586659659000000.xz

Severity
--------
Major

Steps to Reproduce
------------------
1. Configure ptp.
 Eg : system host-update clock_synchronization ptp
2. lock and unlock controller-0

Expected Behavior
------------------
No core dump

Actual Behavior
----------------
As description core dump on controller-0

Reproducibility
---------------
WCP-11 . Seen only once.

System Configuration
--------------------
wildcat-11

Branch/Pull Time/Commit
-----------------------
Build ID: 2020-04-11_00-10-00

Last Pass
---------

Timestamp/Logs
--------------
 02:17 & 02:47

Test Activity
-------------
Sysinv Automated regression run

CVE References

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Yang Liu (yliu12)
description: updated
summary: - controller-0 core dump during sysinv test execution
+ collectd core dump generated after lock/unlock controller-0
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :
Download full text (6.3 KiB)

collectd is not built with symbols nore is the debuginfo rpm with symbols built by the build system making it difficult to debug the issue from the coredump.

controller-0:~$ sudo gdb /usr/sbin/collectd ./controller-0_2020-04-12_02-47-56_core.collectd.0.030bf5affac94ebdb55618230c994ca9.2520.1586659659000000
Password:
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/collectd...Reading symbols from /usr/sbin/collectd...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 3185]
[New LWP 3192]
[New LWP 3186]
[New LWP 3188]
[New LWP 3194]
[New LWP 3189]
[New LWP 3187]
[New LWP 3191]
[New LWP 3190]
[New LWP 3193]
[New LWP 2520]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/collectd'.
Program terminated with signal 6, Aborted.
#0 0x00007f53ceae9207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install collectd-5.8.1-4.el7.x86_64

daemon.log contains some traceback but there is nothing in the traceback that pinpoints the plugin or place in plugin where he traceback occured.

33373 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Backtrace: =========
33374 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libc.so.6(+0x81489)[0x7f53ceb34489]
33375 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x3445)[0x7f53ce6a5445]
33376 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x3585)[0x7f53ce6a5585]
33377 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x5afe)[0x7f53ce6a7afe]
33378 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x5fb7)[0x7f53ce6a7fb7]
33379 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x6aed)[0x7f53ce6a8aed]
33380 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(plugin_write+0xf4)[0x5622b87dce14]
33381 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0x9b8d)[0x5622b87d8b8d]
33382 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0xd005)[0x5622b87dc005]
33383 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0xf4d7)[0x5622b87de4d7]
33384 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libpthread.so.0(+0x7dd5)[0x7f53cf290dd5]
33385 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libc.so.6(clone+0x6d)[0x7f53cebb0ead]
33386 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Memory map: ========
33387 2020-04-12T02:47:39....

Read more...

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

daemon.log's collectd logs start reporting network plugin starts reporting sendto errors in the minutes leading up to the traceback.

First indication of network failures are here and blocks of this repeat up to the coredump ~2 minutes from now

32758 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32759 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32760 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32761 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32762 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32763 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.

Then continued failure logs just leading up to the ultimate coredump

2020-04-12T02:47:39.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
2020-04-12T02:47:39.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.

2020-04-12T02:47:39.161 controller-0 collectd[2520]: info *** Error in `/usr/sbin/collectd': double free or corruption (!prev): 0x00007f53ac00fde0 ***
2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Backtrace: =========

The last log above looks particularly bad. 'double free or corruption' followed by the coredump

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue followed ptp reconfiguration and host lock/unlock and this lab has a very specific hardware ptp configuration.

  > service-parameter-add ptp global domainNumber=24
  > service-parameter-add ptp global delay_mechanism=e2e
  > service-parameter-apply ptp
  > host-if-modify controller-0 sriov0 --ptp-role slave
  > host-unlock controller-0

Need to try and reproduce following the current activity going on in this lab.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Question to PV ... Is this the only case this issue was seen and if not has this issue been seen on any other system/lab ?

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This was not seen in any other lab regular lab. This automation test was not executed in simplex before since it was automated recently.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Jeyan, What is the system impact of this issue? Does collectd coredump and then recovers?

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx. stx.metal
tags: added: stx.4.0
tags: removed: stx.
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 / medium priority for now until further attempts to reproduce and investigate

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_112
Load: 2020-05-07_21-11-18

log:
[2020-05-08 07:13:30,469] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

'-rw-r----- 1 root root 6074060 2020-05-08_07-18-53 core.collectd.0.be8e03d4a8004b7088e687026a86cf54.2165.1588922304000000.xz'

http://128.224.150.21/auto_logs/wcp_112/202005080309/

log added:
https://files.starlingx.kube.cengn.ca/launchpad/1872979

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Reproduced issue over lock and unlock of already ptp provisioned AIO SX (wp11)

controller-0:~$ ls /var/lib/systemd/coredump/
core.collectd.0.b9fb04f48a1b410bad75f9c2e24627c2.2537.1590503151000000.xz

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is not dependent on PTP.
Was able to reproduce a second time.
This time on the 5th lock/unlock try with NTP rather than PTP enabled.

The common thread is collectd running early and experiencing network plugin 'Network Unreachable' failures.

Ideally, collectd should only start following configuration complete. However, since there is a collectd configuration manifest the collectd service file cannot have a After=config dependency directive as that will cause startup to hang ; collectd needs config but config needs collectd.

Revision history for this message
Peng Peng (ppeng) wrote :

The issue was reproduced on
2020-06-03_20-00-00
wcp_11

https://files.starlingx.kube.cengn.ca/launchpad/1872979

Revision history for this message
Difu Hu (difuhu) wrote :

Similar coredump collected on DC-1 subcloud2 after reboot.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736817

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/736817
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=63c8d1e55aecfb8aed4f98e29ccf0dc6ccd18cf3
Submitter: Zuul
Branch: master

commit 63c8d1e55aecfb8aed4f98e29ccf0dc6ccd18cf3
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 18 15:44:32 2020 -0400

    Add consistent init and config complete checks to collectd plugins

    Some of the collectd plugins are not waiting for configuration
    complete before starting to monitor or communicate with external
    services such as fm. This leads to the collectd networking plugin
    being triggered to run before or while the host is being configured
    which has been seen to lead to collectd segfaults/coredumps within
    the collectd's internal networking plugin.

    To solve this issue, reduce startup thrash and a slew of plugin
    startup error logs, this update adds consistent initialization
    and configuration complete checks to all of the starlingX
    plugins so monitoring and external service access is not
    performed until the host configuration is complete.

    Test Plan:

    PASS: Verify no plugin sampling till after config is complete
    PASS: Verify alarm assert and clear cycle for all plugins
    PASS: Install AIO SX system install
    PASS: Install AIO DX system install
    PEND: Verify Standard system install
    PASS: Verify logging

    Change-Id: I90a5d1c8c3be77269a571738c9499b2e908e1fc5
    Closes-Bug: 1872979
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
2020-07-01_22-00-00
WCP_112, R430_3-4, WP_8-12

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as the issue is still seen with a build containing the above fix

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Moving to stx.5.0 as there is no functional impact

tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

There is a new occurrence reported in https://bugs.launchpad.net/starlingx/+bug/1901039 which I will mark as duplicate

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The following additional updates have been merged towards the fix of this issue.

update : Add node ready check to collectd plugins
review : https://review.opendev.org/c/starlingx/monitoring/+/772349
commit : https://opendev.org/starlingx/monitoring/commit/ea4b515f91f38523a22e877ebba9d552962153b2

update : Modify collectd manifest to not start collectd
review : https://review.opendev.org/c/starlingx/stx-puppet/+/772354
commit : https://opendev.org/starlingx/stx-puppet/commit/a34cd954e92f37994d40a31ccc2777249598622d

update : Make mtcClient stop collectd before shutdown
review : https://review.opendev.org/c/starlingx/metal/+/772356
commit : https://opendev.org/starlingx/metal/commit/2d5c5b04edf0d84f78a87e971cf1646e6efda00f

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Unfortunately, a new occurrence has been observed and initial investigation suggests that there is yet another failure mode that none of the 3 fixes above could address.

Investigation of this new failure mode, that was not observed during the exhaustive testing/soaking of the above fixes, is under weigh.

Revision history for this message
Peng Peng (ppeng) wrote :

The issue was reproduced on
WRCP_20.06_POSTGA_Build 2021-02-18_20-00-07
R730_1

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority as there is no system impact

Changed in starlingx:
importance: Medium → Low
tags: removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/monitoring/+/792244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

    Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (f/centos8)
Download full text (7.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/792244
Committed: https://opendev.org/starlingx/monitoring/commit/fdc0d099fb0d65cbf8f037fe0cc9ac8125410284
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300

    Format 2 lines ntpq data into 1 lines

    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.

    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.

    Closes-Bug: 1889101

    Signed-off-by: Takamasa Takenaka <email address hidden>
    Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400

    Add alarm audit to starlingx collectd fm notifier plugin

    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:

       Error Case Correction Action
       ----------------------- -----------------
     - stale alarm ; delete alarm
     - missing alarm ; assert alarm
     - alarm severity mismatch ; refresh alarm

    The common audit is enabled for the fm_notifier plugin that supports
    alarm managment for the following resources.

     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104

    Other plugins may use this common audit in the future but only the
    above resources have the audit enabled for them by this update.

    Test Plan:

    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features

    Regression:

    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
  ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/797509

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/797509
Committed: https://opendev.org/starlingx/stx-puppet/commit/ebcbf953bf82a49f545fc8de01b68ce547e78d6d
Submitter: "Zuul (22348)"
Branch: master

commit ebcbf953bf82a49f545fc8de01b68ce547e78d6d
Author: Eric MacDonald <email address hidden>
Date: Tue Jun 22 12:32:44 2021 -0400

    Reduce collectd write_threads from 5 to 1

    StarlingX currently uses collectd version 5.8.1
    with 5 write threads. This version of collectd is
    seen to coredump in its network plugin 1-2 times
    out of 100 process restarts. This means that
    every time a node is rebooted there is a 1-2 %
    chance it will coredump.

    The opensource collectd version 5.12.0 includes
    the following change which addresses a race
    condition by implementing a mutex pthread lock
    around the sendto network call to prevent the
    race condition and avoid the coredump.

    https://github.com/collectd/collectd/commit
    /c44c159065daf8bc7ab6c03287f281d317b1d5fd

    StarlingX is not yet prepared to migrate to this
    new version. Instead and until then this update
    reduces number of write_threads to 1, as
    recommended by the collectd update author, until
    StarlingX successfully integrates a version of
    collectd -ge 5.12.0

    Test Plan:

    PASS: Verify no collectd coredumps in over 5000
          process restarts across multiple servers

    Regression:

    PASS: Verify collectd logging
    PASS: Verify collectd sampling
    PASS: Verify alarming and degrade handling

    Closes-Bug: 1872979
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: Ie9297f596d30c2754142a5237608ebb227898ecb

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.