Bug #1872979 “collectd core dump generated after lock/unlock con...” : Bugs : StarlingX

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-04-15:

#1

core dump Edit (5.8 MiB, application/octet-stream)

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-04-15:

#2

collect logs Edit (72.4 MiB, application/x-tar)

Yang Liu (yliu12) on 2020-04-15

description:	updated
summary:	- controller-0 core dump during sysinv test execution + collectd core dump generated after lock/unlock controller-0

Anujeyan Manokeran (anujeyan) on 2020-04-15

tags:

added: stx.retestneeded

Ghada Khalil (gkhalil) on 2020-04-15

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-15:

#3

Download full text (6.3 KiB)

collectd is not built with symbols nore is the debuginfo rpm with symbols built by the build system making it difficult to debug the issue from the coredump.

controller-0:~$ sudo gdb /usr/sbin/collectd ./controller-0_2020-04-12_02-47-56_core.collectd.0.030bf5affac94ebdb55618230c994ca9.2520.1586659659000000
Password:
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/collectd...Reading symbols from /usr/sbin/collectd...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 3185]
[New LWP 3192]
[New LWP 3186]
[New LWP 3188]
[New LWP 3194]
[New LWP 3189]
[New LWP 3187]
[New LWP 3191]
[New LWP 3190]
[New LWP 3193]
[New LWP 2520]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/collectd'.
Program terminated with signal 6, Aborted.
#0 0x00007f53ceae9207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install collectd-5.8.1-4.el7.x86_64

daemon.log contains some traceback but there is nothing in the traceback that pinpoints the plugin or place in plugin where he traceback occured.

33373 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Backtrace: =========
33374 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libc.so.6(+0x81489)[0x7f53ceb34489]
33375 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x3445)[0x7f53ce6a5445]
33376 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x3585)[0x7f53ce6a5585]
33377 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x5afe)[0x7f53ce6a7afe]
33378 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x5fb7)[0x7f53ce6a7fb7]
33379 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x6aed)[0x7f53ce6a8aed]
33380 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(plugin_write+0xf4)[0x5622b87dce14]
33381 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0x9b8d)[0x5622b87d8b8d]
33382 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0xd005)[0x5622b87dc005]
33383 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0xf4d7)[0x5622b87de4d7]
33384 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libpthread.so.0(+0x7dd5)[0x7f53cf290dd5]
33385 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libc.so.6(clone+0x6d)[0x7f53cebb0ead]
33386 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Memory map: ========
33387 2020-04-12T02:47:39....

collectd is not built with symbols nore is the debuginfo rpm with symbols built by the build system making it difficult to debug the issue from the coredump.

controller-0:~$ sudo gdb /usr/sbin/collectd ./controller-0_2020-04-12_02-47-56_core.collectd.0.030bf5affac94ebdb55618230c994ca9.2520.1586659659000000
Password:
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/collectd...Reading symbols from /usr/sbin/collectd...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 3185]
[New LWP 3192]
[New LWP 3186]
[New LWP 3188]
[New LWP 3194]
[New LWP 3189]
[New LWP 3187]
[New LWP 3191]
[New LWP 3190]
[New LWP 3193]
[New LWP 2520]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/collectd'.
Program terminated with signal 6, Aborted.
#0  0x00007f53ceae9207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install collectd-5.8.1-4.el7.x86_64

daemon.log contains some traceback but there is nothing in the traceback that pinpoints the plugin or place in plugin where he traceback occured.

33373 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Backtrace: =========
33374 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libc.so.6(+0x81489)[0x7f53ceb34489]
33375 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x3445)[0x7f53ce6a5445]
33376 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x3585)[0x7f53ce6a5585]
33377 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x5afe)[0x7f53ce6a7afe]
33378 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x5fb7)[0x7f53ce6a7fb7]
33379 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/lib64/collectd/network.so(+0x6aed)[0x7f53ce6a8aed]
33380 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(plugin_write+0xf4)[0x5622b87dce14]
33381 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0x9b8d)[0x5622b87d8b8d]
33382 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0xd005)[0x5622b87dc005]
33383 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /usr/sbin/collectd(+0xf4d7)[0x5622b87de4d7]
33384 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libpthread.so.0(+0x7dd5)[0x7f53cf290dd5]
33385 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info /lib64/libc.so.6(clone+0x6d)[0x7f53cebb0ead]
33386 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Memory map: ========
33387 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info 5622b87cf000-5622b8803000 r-xp 00000000 08:03 1058906                    /usr/sbin/collectd
33388 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info 5622b8a02000-5622b8a03000 r--p 00033000 08:03 1058906                    /usr/sbin/collectd
33389 2020-04-12T02:47:39.161 controller-0 collectd[2520]: info 5622b8a03000-5622b8a04000 rw-p 00034000 08:03 1058906                    /usr/sbin/collectd

[snip other lib dependencies]

33848 2020-04-12T02:47:39.162 controller-0 collectd[2520]: info 7f53cf9c5000-7f53cf9c8000 rw-p 00000000 00:00 0
33849 2020-04-12T02:47:39.162 controller-0 collectd[2520]: info 7f53cf9c8000-7f53cf9c9000 r--p 00021000 08:03 1048991                    /usr/lib64/ld-2.17.so
33850 2020-04-12T02:47:39.162 controller-0 collectd[2520]: info 7f53cf9c9000-7f53cf9ca000 rw-p 00022000 08:03 1048991                    /usr/lib64/ld-2.17.so
33851 2020-04-12T02:47:39.162 controller-0 collectd[2520]: info 7f53cf9ca000-7f53cf9cb000 rw-p 00000000 00:00 0
33852 2020-04-12T02:47:39.162 controller-0 collectd[2520]: info 7ffcdcac4000-7ffcdcae5000 rw-p 00000000 00:00 0                          [stack]
33853 2020-04-12T02:47:39.163 controller-0 collectd[2520]: info 7ffcdcb9f000-7ffcdcba1000 r-xp 00000000 00:00 0                          [vdso]
33854 2020-04-12T02:47:39.163 controller-0 collectd[2520]: info ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
33855 2020-04-12T02:47:56.863 controller-0 collectd[2520]: info =====
33856 2020-04-12T02:47:56.863 controller-0 systemd[1]: notice collectd.service: main process exited, code=killed, status=6/ABRT
33857 2020-04-12T02:47:56.874 controller-0 systemd[1]: notice Unit collectd.service entered failed state.
33858 2020-04-12T02:47:56.874 controller-0 systemd[1]: warning collectd.service failed.
33859 2020-04-12T02:48:03.701 controller-0 systemd[1]: info Reloading.
33860 2020-04-12T02:48:03.764 controller-0 systemd[1]: info Reloading.
33861 2020-04-12T02:48:11.450 controller-0 systemd[1]: info Reloading System Logger Daemon.
33862 2020-04-12T02:48:11.466 controller-0 systemd[1]: info Reloaded System Logger Daemon.
33863 2020-04-12T02:48:20.322 controller-0 systemd[1]: info Got automount request for /proc/sys/fs/binfmt_misc, triggered by 74663 (sysctl)
33864 2020-04-12T02:48:20.332 controller-0 systemd[1]: info Mounting Arbitrary Executable File Formats File System...
33865 2020-04-12T02:48:20.341 controller-0 systemd[1]: info Mounted Arbitrary Executable File Formats File System.
33866 2020-04-12T02:48:22.939 controller-0 systemd[1]: info Reloading.
33867 2020-04-12T02:48:23.010 controller-0 systemd[1]: warning Cannot add dependency job for unit dev-hugepages.mount, ignoring: Unit is masked.
33868 2020-04-12T02:48:23.011 controller-0 systemd[1]: info Stopping Synchronize system clock or PTP hardware clock (PHC)...
33869 2020-04-12T02:48:23.051 controller-0 systemd[1]: info Stopped Synchronize system clock or PTP hardware clock (PHC).
33870 2020-04-12T02:48:23.066 controller-0 systemd[1]: info Starting Synchronize system clock or PTP hardware clock (PHC)...
33871 2020-04-12T02:48:23.074 controller-0 systemd[1]: info Started Synchronize system clock or PTP hardware clock (PHC).

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-15:

#4

daemon.log's collectd logs start reporting network plugin starts reporting sendto errors in the minutes leading up to the traceback.

First indication of network failures are here and blocks of this repeat up to the coredump ~2 minutes from now

32758 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32759 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32760 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32761 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32762 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
32763 2020-04-12T02:45:13.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.

Then continued failure logs just leading up to the ultimate coredump

2020-04-12T02:47:39.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.
2020-04-12T02:47:39.000 controller-0 collectd[2520]: err network plugin: sendto failed: Network is unreachable. Closing sending socket.

2020-04-12T02:47:39.161 controller-0 collectd[2520]: info *** Error in `/usr/sbin/collectd': double free or corruption (!prev): 0x00007f53ac00fde0 ***
2020-04-12T02:47:39.161 controller-0 collectd[2520]: info ======= Backtrace: =========

The last log above looks particularly bad. 'double free or corruption' followed by the coredump

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-15:

#5

Issue followed ptp reconfiguration and host lock/unlock and this lab has a very specific hardware ptp configuration.

  > service-parameter-add ptp global domainNumber=24
  > service-parameter-add ptp global delay_mechanism=e2e
  > service-parameter-apply ptp
  > host-if-modify controller-0 sriov0 --ptp-role slave
  > host-unlock controller-0

Need to try and reproduce following the current activity going on in this lab.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-16:

#6

Question to PV ... Is this the only case this issue was seen and if not has this issue been seen on any other system/lab ?

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-04-17:

#7

This was not seen in any other lab regular lab. This automation test was not executed in simplex before since it was automated recently.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-22:

#8

@Jeyan, What is the system impact of this issue? Does collectd coredump and then recovers?

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx. stx.metal
tags:	added: stx.4.0
tags:	removed: stx.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-22:

#9

Marking as stx.4.0 / medium priority for now until further attempts to reproduce and investigate

Revision history for this message

Peng Peng (ppeng) wrote on 2020-05-08:

#10

controller-0_2020-05-08_07-18-53_core.collectd.0.be8e03d4a8004b7088e687026a86cf54.2165.1588922304000000.xz Edit (5.8 MiB, application/octet-stream)

Issue was reproduced on
Lab: WCP_112
Load: 2020-05-07_21-11-18

log:
[2020-05-08 07:13:30,469] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

'-rw-r----- 1 root root 6074060 2020-05-08_07-18-53 core.collectd.0.be8e03d4a8004b7088e687026a86cf54.2165.1588922304000000.xz'

http://128.224.150.21/auto_logs/wcp_112/202005080309/

log added:
https://files.starlingx.kube.cengn.ca/launchpad/1872979

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-05-26:

#11

Reproduced issue over lock and unlock of already ptp provisioned AIO SX (wp11)

controller-0:~$ ls /var/lib/systemd/coredump/
core.collectd.0.b9fb04f48a1b410bad75f9c2e24627c2.2537.1590503151000000.xz

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-05-27:

#12

Issue is not dependent on PTP.
Was able to reproduce a second time.
This time on the 5th lock/unlock try with NTP rather than PTP enabled.

The common thread is collectd running early and experiencing network plugin 'Network Unreachable' failures.

Ideally, collectd should only start following configuration complete. However, since there is a collectd configuration manifest the collectd service file cannot have a After=config dependency directive as that will cause startup to hang ; collectd needs config but config needs collectd.

Revision history for this message

Peng Peng (ppeng) wrote on 2020-06-04:

#13

controller-0_2020-06-04_04-17-40_core.collectd.0.b676a49fa399420eb236f2027df886b8.2539.1591244214000000.xz Edit (5.8 MiB, application/octet-stream)

The issue was reproduced on
2020-06-03_20-00-00
wcp_11

https://files.starlingx.kube.cengn.ca/launchpad/1872979

Revision history for this message

Difu Hu (difuhu) wrote on 2020-06-15:

#14

controller-0_2020-06-14_09-21-08_core.collectd.0.aba588a90cfc4d4eb03d3a4501e925ba.3443.1592126446000000.xz Edit (5.9 MiB, application/octet-stream)

Similar coredump collected on DC-1 subcloud2 after reboot.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-18: Fix proposed to monitoring (master)

#15

Fix proposed to branch: master
Review: https://review.opendev.org/736817

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-30: Fix merged to monitoring (master)

#16

Reviewed: https://review.opendev.org/736817
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=63c8d1e55aecfb8aed4f98e29ccf0dc6ccd18cf3
Submitter: Zuul
Branch: master

commit 63c8d1e55aecfb8aed4f98e29ccf0dc6ccd18cf3
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 18 15:44:32 2020 -0400

Add consistent init and config complete checks to collectd plugins

    Some of the collectd plugins are not waiting for configuration
    complete before starting to monitor or communicate with external
    services such as fm. This leads to the collectd networking plugin
    being triggered to run before or while the host is being configured
    which has been seen to lead to collectd segfaults/coredumps within
    the collectd's internal networking plugin.

    To solve this issue, reduce startup thrash and a slew of plugin
    startup error logs, this update adds consistent initialization
    and configuration complete checks to all of the starlingX
    plugins so monitoring and external service access is not
    performed until the host configuration is complete.

Test Plan:

    PASS: Verify no plugin sampling till after config is complete
    PASS: Verify alarm assert and clear cycle for all plugins
    PASS: Install AIO SX system install
    PASS: Install AIO DX system install
    PEND: Verify Standard system install
    PASS: Verify logging

    Change-Id: I90a5d1c8c3be77269a571738c9499b2e908e1fc5
    Closes-Bug: 1872979
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2020-07-02:

#17

Issue was reproduced on
2020-07-01_22-00-00
WCP_112, R430_3-4, WP_8-12

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-07-02:

#18

Re-opening as the issue is still seen with a build containing the above fix

Changed in starlingx:
status:	Fix Released → Confirmed

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-07-07:

#19

Moving to stx.5.0 as there is no functional impact

tags:

added: stx.5.0
removed: stx.4.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-11-06:

#20

There is a new occurrence reported in https://bugs.launchpad.net/starlingx/+bug/1901039 which I will mark as duplicate

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-02-05:

#21

The following additional updates have been merged towards the fix of this issue.

update : Add node ready check to collectd plugins
review : https://review.opendev.org/c/starlingx/monitoring/+/772349
commit : https://opendev.org/starlingx/monitoring/commit/ea4b515f91f38523a22e877ebba9d552962153b2

update : Modify collectd manifest to not start collectd
review : https://review.opendev.org/c/starlingx/stx-puppet/+/772354
commit : https://opendev.org/starlingx/stx-puppet/commit/a34cd954e92f37994d40a31ccc2777249598622d

update : Make mtcClient stop collectd before shutdown
review : https://review.opendev.org/c/starlingx/metal/+/772356
commit : https://opendev.org/starlingx/metal/commit/2d5c5b04edf0d84f78a87e971cf1646e6efda00f

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-02-05:

#22

Unfortunately, a new occurrence has been observed and initial investigation suggests that there is yet another failure mode that none of the 3 fixes above could address.

Investigation of this new failure mode, that was not observed during the exhaustive testing/soaking of the above fixes, is under weigh.

Revision history for this message

Peng Peng (ppeng) wrote on 2021-02-19:

#23

The issue was reproduced on
WRCP_20.06_POSTGA_Build 2021-02-18_20-00-07
R730_1

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-03-17:

#24

Lowering the priority as there is no system impact

Changed in starlingx:
importance:	Medium → Low
tags:	removed: stx.5.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

#25

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Change abandoned on stx-puppet (f/centos8)

#26

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

#27

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Change abandoned on stx-puppet (f/centos8)

#28

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

#29

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Change abandoned on stx-puppet (f/centos8)

#30

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

#31

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to monitoring (f/centos8)

#32

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/monitoring/+/792244

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to metal (f/centos8)

#33

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-27: Fix merged to metal (f/centos8)

#34

Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

Revert "Align partitions created by kickstarters"

This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

Reason for revert: Review should have been abandoned rather than merged.

Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

This update reverts a small code change that was
introduced by the following update.

https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu May 13 15:57:43 2021 +0000

Revert "Align partitions created by kickstarters"
    
    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.
    
    Reason for revert: Review should have been abandoned rather than merged.
    
    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 7 08:56:06 2021 -0400

Add /pxeboot/grubx64.efi symlink for UEFI pxeboot
    
    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.
    
    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller
    
    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.
    
    This update reverts a small code change that was
    introduced by the following update.
    
    https://review.opendev.org/c/starlingx/metal/+/788495
    
    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.
    
    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot
    
    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.
    
    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.
    
    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.
    
    The primary fix in this update is to increase this long standing
    threshold to 15 minutes to account for evolution of the product.
    
    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.
    
     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299
    
     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.
    
     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.
    
     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbeat soak.
    
     - Add Graceful Recovery Wait state recovery over process restart.
       Reason: Avoid double reboot of Gracefully Recovering host over
               SM service bounce.
    
     - Add requirement for a valid out-of-band mtce flags value before
       declaring configuration error in the subfunction enable handler.
       Reason: rebooting the active controller can sometimes result in
               a falsely reported configation error due to the
               subfunction enable handler interpreting a zero value as
               a configuration error.
    
     - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
       Reason: To assist log analysis and issue debug
    
    Test Plan:
    
    PASS: Verify handling active controller reboot
                 cases: AIO DC, AIO DX, Standard, and Storage
    PASS: Verify Graceful Recovery Wait behavior
                 cases: with and without timeout, with and without bmc
                 cases: uptime > 15 mins and 10 < uptime < 15 mins
    PASS: Verify Graceful Recovery continuation over mtcAgent restart
                 cases: peer controller, compute, MNFA 4 computes
    PASS: Verify AIO DX and DC active controller reboot to standby
                 takeover that up for less than 15 minutes.
    
    Regression:
    
    PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
    PASS: Verify cluster network only heartbeat loss handling
                 cases: worker and standby controller in all systems.
    PASS: Verify Dead Office Recovery (DOR)
                 cases: AIO DC, AIO DX, Standard, Storage
    PASS: Verify system installations
                 cases: AIO SX/DC/DX and 8 node Storage system
    PASS: Verify heartbeat and graceful recovery of both 'standby
                 controller' and worker nodes in AIO Plus.
    
    PASS: Verify logging and no coredumps over all of testing
    PASS: Verify no missing or stuck alarms over all of testing
    
    Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 7539d36c3f01a338acfa449204c6034dc43f45df
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 21 10:12:30 2021 -0400

Prevent mtcClient from sending to uninitialized socket in AIO SX
    
    The mtcClient will perform a socket reinit if it detects a socket
    failure. The mtcClient also avoids setting up its controller-1
    cluster network socket for the AIO SX system type ; because there
    is no controller-1 provisioned.
    
    Most AIO SX systems have the management/cluster networks set to
    the 'loopback' interface. However, when an AIO SX system is setup
    with its management and cluster networks on physical interfaces,
    with or without vlan, the mtcAlive send message utility will try
    to send to the uninitialized controller-1 cluster socket. This
    leads to a socket error that triggers a socket reinitialization
    loop which causes log flooding.
    
    This update adds a check to the mtcAlive send utility to avoid
    sending mtcAlive to controller-1 for AIO SX system type where
    there is no controller-1 provisioned; no send,no error,no flood.
    
    Since this update needed to add a system type check, this update
    also implemented a system type definition rename from CPE to AIO.
    Other related definitions and comments were also changed to make
    the code base more understandable and maintainable
    
    Test Plan:
    
    PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
    PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
    PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
    PASS: Verify AIO SX locked-disabled-online state
    PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
    PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)
    
    Regression:
    
    PASS: Verify AIO SX Lock and Unlock (lazy reboot)
    PASS: Verify AIO DX and DC install with pv regression and sanity
    PASS: Verify Standard system install with pv regression and sanity
    
    Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
    Closes-Bug: 1897334
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 3c1e9d960198c044e382eb7d47b3bb70cbf6ba70
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 6 10:29:09 2021 -0400

Modify mtce daemon log rotation config files
    
    This update make the following setting changes to the
    maintenance log rotation configuration files
    
     - add 'create' with permissions to each tuple
     - add 'delaycompress'
     - group together log files with similar settings
     - move global settings ro local settings
     - remove 'copytruncate' global setting
     - remove the 'nodateext' global and local setting
    
    Test Plan:
    
    PASS: Verify log rotation for all mtc log files
    PASS: Verify no log loss over rotation
    PASS: Verify log rotation file naming convention
    PASS: Verify delaycompress on all mtce log files
    PASS: Verify log permissions after rotate are 0640
    
    Regression:
    
    PASS: Verify AIO system install
    PASS: Verify Standard system install
    PASS: Verify full and dated collect
    
    Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924
    Partial-Bug: 1918979
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 99a871c7d9dd04b3bd2ce149dd43bf058d805f03
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jun 15 13:45:23 2020 -0400

Restrict isolcpu_plugin to nodes with worker function
    
    The isolcpu_plugin process is intended to run on worker nodes only.
    This update excludes its rpm parcel from standard controller and
    storage nodes.
    
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/783730
    Story: 2008760
    Task: 42189
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Change-Id: Iec61638b49692622e128d8388bc3aa78c922ac3a

commit 031818e55bc255b59e486ebf6faadf4b784c93fe
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Mar 26 13:05:51 2021 -0400

Add in-service test to clear stale config failure alarm
    
    A configuration failure alarm can get stuck asserted if
    that node experiences an uncontrolled reboot that recovers
    without a configuration failure.
    
    This update adds an in-service test that audits host health
    while there is a configuration failure alarm raised and
    clear that alarm if the failure condition goes away. This
    could be a result of an in-service manifest that runs and
    corrects the configuration or if the node reboots and comes
    back up in a healthy (properly configured) state.
    
    Fixed bug that was clearing config alarm severity state
    when a heartbeat clear event is received.
    
    This update also goes a step further and introduces an
    alarms state audit that detects and corrects maintenance
    alarm state mismatches.
    
    Test Plan:
    
    PASS: Verify the add handler loads config alarm state
    PASS: Verify in-service test clears stale config alarm
    PASS: Verify in-service test acts on new config failure
          ... degrade - active controller
          ... fail    - other hosts
    PASS: Verify audit fixes mtce alarm state mismatches
    PASS: Verify audit handles fm not running case
    PASS: Verify audit handling behavior with valid alarm cases
    PASS: Verify locked alarm management over process restart
    PASS: Verify audit only logs active alarms list changes
    PASS: Verify audit runs for both locked/unlocked nodes
    PASS: Verify update as a patch
    
    Regression:
    
    PASS: Verify enable sequence config failure handling
    PASS: ... active controller     - recoverable degrade
    PASS: ... other nodes           - threshold fail
    PASS: ... auto recovery disable - config failure
    PASS: Verify mtcAgent process logging
    PASS: Verify heartbeat handling and alarming
    PASS: Verify Standard system install
    PASS: Verify AIO system install
    
    Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
    Closes-Bug: 1918195
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 5c83453fdf8775e5d776a02a2b5c06810d84cb55
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Mar 16 17:03:49 2021 -0400

Fix Graceful Recovery handling while in Graceful Recovery handling
    
    The current Graceful Recovery handler is not properly handling
    back-to-back Multi Node Failure Avoidance (MNFA) events.
    
    There are two phases to MNFA
    
     phase 1: waiting for number of failed nodes to fall below
              mnfa_threahold as each affected node's heartbeat
              is recovered.
     phase 2: then a Graceful Recovery Wait period which is an
              11 second heartbeat soak to verify that a stable
              heartbeat is regained before declaring the NMFA
              event complete.
    
    The Graceful Recovery Wait status of one or more affected nodes
    has been seen to be left uncleared (stuck) on one or more of the
    affected nodes if phase 2 of MNFA is interrupted by another MNFA
    event ; aka MNFA Nesting.
    
    Although this stuck status is not service affecting it does leave
    one or more nodes' host.task field, as observed under host-show,
    with "Graceful Recovery Wait" rather than empty.
    
    This update makes Multi Node Failure Avoidance (MNFA) handling
    changes to ensure that, upon MNFA exit, the recovery handler
    is properly restarted if MNFA Nesting occurs.
    
    Two additional Graceful Recovery phase issues were identified
    and fixed by this update.
    
     1. Cut Graceful recovery handling in half
    
        - Found and removed a redundant 11 second heartbeat soak
          at the very end of the recovery handler.
        - This cuts the graceful recovery handling time down from
          22 to 11 seconds thereby cutting potential for nesting
          in half.
    
     2. Increased supported Graceful Recovery nesting from 3 to 5
    
        - Found that some links bounce more than others so a nesting
          count of 3 can lead to an occasional single node failure.
        - This adds a bit more resiliency to MNFA handling of cases
          that exhibit more link messaging bounce.
    
    Test Plan: Verified 60+ MNFA occurrences across 4 different
               system types including AIO plus, Standard and Storage
    
    PASS: Verify Single Node Graceful Recovery Handling
    PASS: Verify Multi Node Graceful Recovery Handling
    PASS: Verify Single Node Graceful Recovery Nesting Handling
    PASS: Verify Multi Node Graceful Recovery Nesting Handling
    PASS: Verify MNFA of up to 5 nests can be gracefully recovered
    PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
    PASS: Verify update as a patch
    PASS: Verify mtcAgent logging
    
    Regression:
    
    PASS: Verify standard system install
    PASS: Verify product verification maintenance regression (4 runs)
    PASS: Verify MNFA threshold increase and below threshold behavior
    PASS: Verify MNFA with reduced timeout behavior for
          ... nested case that does not timeout
          ... case that does not timeout
          ... case that does timeout
    
    Closes Bug: 1892877
    Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 497a6f93f422bdaab0a5779d5345ba814d1ab3bc
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue Mar 16 13:45:18 2021 +0200

Fix reinstall of controller nodes
    
    At shutdown, systemd will try to remount everything read-only
    before attempting to unmount it. In the wipedisk script we
    are deleting the partitions without unmounting
    their corresponding filesystems. This leads to errors because
    systemd will try to remount filesystems
    whose partitions were deleted.
    
    To fix this we have to unmount the filesystems that are linked to the
    removed partitions.
    
    Closes-Bug: 1919153
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I49a3c06ae6bce1324dd06f4fc63fb3e5cd4d28c1

commit 4f5bf78f55ec8b0983262ee351183b1edd8443ad
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Mar 12 17:10:00 2021 -0500

Improve mtcAgent interrupted thread cleanup
    
    A BMC command send will be rejected if its thread
    is not in the IDLE state going into the call.
    
    This issue is seen to occur over a reprovisioning action
    while the bmc access alarmable condition exists.
    
    Maintenance will do retries. So the only visible side affect
    of this issue is a failure to provision to 'redfish' over a
    provisioning switch to 'dynamic' (learn mode). Instead
    ipmi is selected.
    
    The non-return to idle can occur when the bmc handler FSM
    is interrupted by a reprovisioning request while a bmc
    command is in flight.
    
    This update enhances the thread management module by
    introducing a thread consumption utility that is called
    by the bmc command send utility. If the send finds that
    its thread is not in the IDLE state it will either kill
    the thread if it is running or free a completed but-not-
    consumed thread result.
    
    Note: Maintenance only supports the execution of
    a single thread per host per process at one time.
    
    Test Plan:
    
    PASS: Verify BMC provisioning change from ipmi to dynamic
          while the ipmi provisioning was failing prior to
          re-provisioning. Verify the previous error is cleaned
          up and the reprovisioning request succeeds as expected.
    
    PASS: Verify thread 'execution timeout kill' cleanup handling.
    PASS: Verify thread 'complete but not consumed' cleanup handling.
    PASS: Verify logging during regression soaks
    
    Regression:
    
    PASS: Verify bmc protocol reprovisioning script soak
    PASS: Verify sensor monitoring following BMC reprovisioning
    PASS: Verify product verification mtce regression test suite
    
    Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178
    Closes-Bug: 1864906
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 4f7d82308f5f7c663223344873f8b392a1311d82
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Mar 11 11:13:59 2021 -0500

Add NonRecoverable property to Hardware Monitor's Redfish
    
    This update adds 'NonRecoverable' sensor health property
    to the Hardware Monitor's Redfish platform management
    protocol support.
    
    Test Plan:
    
    PASS: Verify handling of Redfish NonRecoverable sensor
          ... using redfish
          ... switching between ipmi and redfish and back
    PASS: Verify sensor model relearn over change of bmc protocol
    
    Regression:
    
    PASS: Verify sensor model relearn by command
    PASS: Verify sensor suppression
    PASS: Verify sensor alarm and degrade management
          ... as sensor events come and go
          ... on sensor suppression and unsuppression
    PASS: Verify sensor monitoring regression test
    PASS: Verify update as a patch (apply/remove)
    
    Change-Id: I2770e63f4d44e269b4410f392707f3cd01e9a2cc
    Closes-Bug: 1918152
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 6cf5e848256c7612e2d5dc3c0a86ac7b76684b6e
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Feb 24 12:36:31 2021 -0500

Add alarmed process audit to Process Monitor
    
    A failure to query process monitor alarms from
    FM during process startup can lead to a stuck
    failed process alarm.
    
    Rather than hold up the process monitor startup
    sequence due to an unresponsive fault manager,
    this update introduces an in-service alarm audit
    that looks for asserted alarms and compares that
    readout to the process monitor's runtime view.
    
    A difference in view is considered a state mismatch
    that requires corrective action. The runtime state
    of the process monitor always takes precidence over
    what is found in the FM database.
    
    A mismatch is declared and corrective action is
    taken if:
    
     - FM has a process failure alarm that pmond does not
       Corrective Action: Clear alarm in FM database
    
     - FM has a process failure alarm with a severity
       that differs from the pmond runtime state.
       Corrective Action: Update severity in FM database
    
     - FM has a process failure alarm for a process
       that pmond does not recognize.
       Corrective Action: Clear alarm in FM database
    
    This update only runs the audit on process startup
    until first successful query.
    A future update may enable the audit in-service.
    
    Test Plan:
    
    PASS: Verify all mismatch case handling
    PASS: Verify handling of valid active alarm
    PASS: Verify handling severity mismatch ; unsupported
    PASS: Verify pmond failure handling regression soak
    PASS: Verify pmond process restart regression soak
    PASS: Verify alarm handling over pmond process restart
    PASS: Verify alarmed state audit period and logging
    PASS: Verify pmond process failure alarm remains ignored by pmond
    PASS: Verify handling of persistently failed process over pmond restart
    PASS: Verify audit handling while FM is not running
          - audit retries every 50 seconds until fm query is successful
    
    COND: Verify audit handling while FM is stopped/blocked/stalled
          - alarm query blocks till fm runs again or is killed
          - this is the reason the audit is not run in-service.
    
    Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
    Closes-Bug: 1892884
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit f34d51d3acf1ab45ae81e75ac620042f95d57b6f
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Fri Feb 26 17:50:35 2021 +0000

restrict kernel headers and devel package installation
    
    kernel change-id: Iafb3abe7 adds kernel headers and development
    packages to the default rootfs for pods needing to build drivers
    or other applications with kernel dependencies. This commit
    restricts installation of the above packages to worker and AIO.
    
    Story: 2008434
    Task: 41941
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: I5bb4e93a60a98dcd52be07c0baa6cb76517b30a8

commit 32fbc7e5aa8ad6e771598456961a760a875aa018
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Fri Feb 26 15:29:15 2021 +0200

Fix reinstall of worker nodes
    
    When the wipedisk code was updated, there were some
    changes that had to be used only on controllers
    but the code was doing the same thing on all the node types.
    
    In this review we add the proper branching of
    the code based on the node type.
    
    Closes-Bug: 1912623
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I91f68a7894da51a7d64602254a68cf7acbd4bcf2

commit 0a102143e9ee26485ef4b40b10bb8f32517ef5c2
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Feb 24 17:15:54 2021 -0600

Fix mtce compiling issue with gcc8
    
    Remove superfluous 'const' to fix error:
      "type qualifiers ignored on cast result type
       [-Werror=ignored-qualifiers]"
    
    Update the usage of 'operater++' on type of 'bool'
    to fix error:
      "use of an operand of type 'bool' in 'operator++'
       is deprecated [-Werror=deprecated]"
    
    Change-Id: I0ce7b2d48f8365f1dcc23eb48e4c5148db817630
    Story: 2007506
    Task: 39279
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 5619e3e8b626e1d592f8b99b455de97438910df5
Author: Angie Wang <angie.wang@windriver.com>
Date:   Tue Feb 23 18:19:26 2021 -0500

Increase cgts-vg size for dc-vault fs
    
    Increase the partition size for cgts-vg to include
    dc-vault fs(15G) on AIO.
    
    Tested installation of AIO-DX and AIO-DX DCSC
    
    Partial-bug: 1916797
    Change-Id: I00427820f710946275f99970ad9a7c1d8437955c
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 95e5906a6b2b3e50cc04d661acf9821f657418f9
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Fri Feb 12 00:31:58 2021 +0000

Add ice kernel module filters
    
    This is in support of the new ice kernel module which is
    initially added to support Intel E810.
    
    Story: 2008436
    Task: 41821
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Ic78988e3396cd2504c2d345bc4ca9fd99f2b53ac

commit c3c7ef80e2e165760f317a51c6c5ace600c49794
Author: Nicolas Alvarez <nicolas.alvarez@windriver.com>
Date:   Fri Jan 29 14:55:45 2021 -0300

Filter snmp rpm from non controller nodes
    
    Remove SNMP Host-Based entries
    Add SNMP Armada App entry
    
    Story: 2008132
    Task: 41715
    Depends-On: https://review.opendev.org/766088
    Depends-On: https://review.opendev.org/765381
    Depends-On: https://review.opendev.org/765875
    Signed-off-by: Nicolas Alvarez <nicolas.alvarez@windriver.com>
    Change-Id: I186a1eefb234d9e9e73df41c5e1df29c866c38bf

commit 2d5c5b04edf0d84f78a87e971cf1646e6efda00f
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 25 10:20:05 2021 -0500

Make mtcClient stop collectd before shutdown
    
    The collectd process has been seen to segfault
    in its internal network plugin during system
    shutdown.
    
    This update modifies the mtcClient to stop
    collectd when it is commanded to reboot the
    system.
    
    Change-Id: I681ff45a2afb1ae66d2a929a64027ea3ed75721e
    Partial-Bug: 1872979
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 9ab726b0eba645d5b8a60fbce306035bb6c13149
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Sep 14 16:42:54 2020 -0400

Add support for peer controller reset via mtcClient
    
    This update adds the ability for SM to passively
    request the mtcClient to BMC reset its peer controller
    as a means to recover a severely loaded active controller.
    
    To do this the mtcAgent is modified keep the controllers'
    mtcClients updated with the BMC info of its peer.
    
    The mtcClient is modified to audit for the SM signal
    and then when asserted issue a BMC reset of its peer
    controller using ipmitool system call.
    
    The ability to command the peer mtcCient to 'sync'
    prior to the BMC reset is implemented but configured
    disabled for now.
    
    Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
    Partial-Bug: #1895350
    Co-Authored-By: Bin.Qian@windriver.com
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 5ab03b5222f223e93ee299ed91a70a2df95647c4
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Jan 8 09:59:24 2021 -0500

Mtce heartbeat cluster state change notification improvement
    
    The current heartbeat cluster state change notification
    needs to be sent when heartbeat pulses begin to be missed
    rather than only after the host has reached the Heartbeat
    Loss threshold. This buys SM more time, almost a full
    second, and in doing so provides more accurate data for
    it to make its SM heartbeat failure handling decisions.
    
    This update also begins sending maintenance heartbeat
    cluster state change notifications just before the next
    multicast pulse request but after the cluster vault is
    updated from the last pulse period. This ensures that
    SM gets the most up-to-date cluster information.
    
    This update also changes the hbsAgent's service file
    to depend on the local hbsClient. By doing so, the
    hbsAgent shuts down earlier over a graceful reboot
    thereby preventing the hbsAgent from continuing to
    report healthy response to the inactive controller
    during active controller shutdown.
    
    This way the inactive SM sees the failed active
    controller when it queries the cluster in its
    fail-pending state resulting in an inactive SM
    take-over rather than stand-down.
    
    Additional hbsAgent service file changes were made to
    prevent systemd from auto recovering a failed hbsAgent
    process, as its monitored and managed by pmond, and
    fixed the ExecStop command line.
    
    Test Plan:
    
    PASS: Verify active controller graceful reboot.
          Standby controller takes over rather than shutdown
          - 30 of 30 iterations
    PASS: Verify active controller forced reboot
    PASS: Verify enabled standby controller graceful reboot
    PASS: Verify Standard System install
    PASS: Verify AIO DX system install
    
    Regression:
    
    PASS: Verify SM Uncontrolled Swact if active
          controller Mgmnt link drops.
    PASS: Verify handling of downed cluster interface in
          - AIO DX (fail) and Standard (degrade) system
    PASS: Verify no coredumps
    PASS: Verify update as a patch
    
    Change-Id: I6869631e091eb28a3cbb6f15d9a8ccd939c54410
    Closes-Bug: 1906556
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit f00de2a3114cbd906e18daf908a276c80fe032cb
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Dec 22 17:03:55 2020 -0500

Add controller-0 to Mtce Heartbeat Service in AIO SX
    
    All system types with the exception of AIO SX
    adds controller-0 to the heartbeat service.
    
    There is no enabled heartbeating in AIO SX so
    controller-0 was never added. However, without
    being added the alarms the hbsAgent raises are
    not cleared over a process startup.
    
    The local hbsClient was designed to monitor
    pmon, effectively monitor the process monitor,
    and report to the hbsAgent its onging health
    state. This way if pmond stops functioning
    maintenance is able to alarm that condition.
    
    However, because in AIO SX controller-0 is never
    added to the heartbeat service the current method
    of looping over the internal heartbeat service
    inventory clearing all the hbsAgent owned alarms
    for each host over a process restart is bypassed.
    
    So, the failure mode where pmond is failing and
    the hbsAgent has raised an alarm against it and is
    followed by a restart of the hbsAgent that coincides
    with 'pmond' process recovery, the pmond alarm gets
    stuck asserted.
    
    This update adds controller-0 to the heartbeat
    service inventory list for all system types so
    the hbsAgent managed alarms are cleared over a
    process restart regardless of the system type.
    
    Additionally, the following logging improvements
    were made:
    
     - add the network name to the heartbeat start log.
     - avoid heartbeat stop log when already stopped.
    
    Test Plan:
    
    PASS: Verify pmond alarm clears over hbsAgent process
          restart in AIO SX, AOI DX, Standard and Storage
          Systems.
    
    Regression:
    
    PASS: Verify Storage System Install and heartbeat
    PASS: Verify Standard System install and heartbeat
    PASS: Verify AIO DX install and heartbeat
    PASS: Verify AIO SX install and heartbeat
    PASS: Verify heartbeat logs and failure handling
    PEND: Verify update as a patch
    
    Change-Id: I9afd92a0b54296ef1f87ce7d912510649ae7560c
    Closes-Bug: 1904918
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 821f2840cc77250d55b6e3281936ebb92ae73f0c
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:26:24 2020 -0500

Add auto-version for remaining stx/metal packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Change-Id: I9fa1ceea76fa13ead2fed325e96a0be3028aa01e
    Story: 2008455
    Task: 41448
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 484d662cb748747aea4c5137c340cc7ac316d21c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Dec 16 21:16:48 2020 -0500

Fix hbsAgent log flooding when SM heartbeat fails persistently
    
    If the SM part of this update is missing or the SM heartbeat
    is missing for a long period of time the hbsAgent produces
    5 logs every 10 seconds reporting the missing SM heartbeat.
    
    This is a follow-up update to its parent update
    https://review.opendev.org/c/starlingx/metal/+/751558
    
    This update throttles the warning log and corresponding
    cluster dump when SM heartbeat is persistently missing.
    
    PASS: Verify hbsAgent service and log behavior when SM
          heartbeat is persistently missing.
    
    Change-Id: Ib379ed5d37b5349ca170b5661a930b6a71c2bed1
    Partial-Fix: 1895350
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 7f7ba86d4f2bc2c5e9ea30e29ff37d83e7fab2a2
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Mon Jun 22 16:00:52 2020 +0800

Add rook provisioned osd check in kickstart for restore case
    
    After rook deployed, osd disk like /dev/sdx or /dev/nvmex will
    be provisioned as pv in volume group named with "ceph" prefixed.
    When user make restore system, kickstart will check all disk
    whether it is osd provisioned, if not wipe the disk. Add the rook
    provsioned osd disk in not wipe list to enable rook restore.
    
    Story: 2005527
    Task: 39076
    
    Change-Id: Id0a5718dcdd1d9230ab1be4a33bc4af5cb356e14
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 0e89acc83c616741952a068a3ff07ba91440eff8
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Thu Aug 27 11:15:17 2020 +0000

Align partitions created by kickstarters
    
    Partitions on some disks may be created unaligned.
    
    The cause is that the creation of partitions is done between
    specific intervals expressed in MBs. The kernel exposed a
    specific variable for each disk for providing an offset to
    align each partitions (/sys/block/<disk>/alignment_offset).
    
    For better granular control, we transform MB units into
    logical sector units and use the alignment_offset variable
    to properly align the partitions.
    
    Change-Id: I971c232fe0969eac14b85c5796908f0c85e23dbf
    Closes-bug: 1883975
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>

tags:

added: in-f-centos8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-03: Fix merged to stx-puppet (f/centos8)

#35

Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

Enable kubelet support for pod pid limit

Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

We fix this by adding dependencies between the above classes
in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

This dependency does not exist for configuring the VF rate-limits
however. There is a cha...

Reviewed:  https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu May 13 18:03:45 2021 +0300

Enable kubelet support for pod pid limit
    
    Enable limiting the number of pids inside of pods.
    
    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.
    
    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Tue May 11 10:21:57 2021 -0400

Safe restart of the etcd SM service in etcd upgrade runtime class
    
    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.
    
    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.
    
    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Mon May 10 13:09:52 2021 +0300

Serialize updates to global_filter in the AIO manifest
    
    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.
    
    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.
    
    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <steven.webster@windriver.com>
Date:   Fri May 7 15:33:43 2021 -0400

Add SR-IOV rate-limit dependency
    
    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking.  This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.
    
    This dependency does not exist for configuring the VF rate-limits
    however.  There is a chance that the VF rate-limiting configuration
    happens before the VFs are actually created.
    
    This commit fixes the issue by creating a dependency on
    platform::networking from the sriov::config class, which ensures
    the VFs are created before both driver binding and rate
    limiting configuration occurs.
    
    Closes-Bug: #1927758
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: Ic452247eb8c980e1b18bdc54832eb635d7a9fc54

commit 0b429c7cb0c16e34755c1b1e146ebb8b006d44dc
Author: Jim Gauld <james.gauld@windriver.com>
Date:   Thu May 6 12:33:24 2021 -0400

Configure etcd service critical process nice and ionice
    
    The etcd server is a critical "interactive" process that requires
    low-latency. This process has many etcd threads, each worker does
    minimal work and wakes up frequently. The threads do small amount of
    writes to commit.
    
    The etcd server will start exceeding heartbeat interval of 100ms and
    the election timeout of 1000ms under load and independent disk stress,
    if not properly tuned as a critical process. This cascades into many
    failures.
    
    This requires io-scheduler 'cfq' to take advantage of io-nice policy
    and priority. This bumps up to best-effort/0 from best-effort/4.
    
    This sets nice -19 from nice 0. This helps tremendously with
    interactive processes for linux CFS (completely-fair-scheduler).
    
    With tuned settings, under application load and additional disk stress,
    we see a dramatic reduction of 'blocked_max' and no more kern.log
    etcdserver related errors for exceeding the timeouts.
    We see dramatic improvement to system responsiveness for kubectl,
    kube-apiserver. This prevents pods from failing when clients they
    cannot renew lease.
    
    Note that 'blocked_max' scheduler stats for this process represents
    involuntary wait for disk related delay, scheduling delay, etc.
    
    Testing coverage:
    - various root disk HW: RAID, NVMe, SSD, VBox
    - sanity on multiple labs: R730_1 with RAID, WFP13_14
    
    Configuration change used in testing:
    - baseline: deadline, best-effort/4,
    - system under test: cfq, best-effort/0, nice -19
    - dd stress was single writer to root disk:
      while true; do
        dd if=/dev/zero of=./test.dd bs=200K count=20000 conv=fsync
      done
    
    Compared results and observe system behaviour:
    - watch kern.log for etcserver 'took too long', and 'wal: sync'
    - watch fm alarms
    - watch kubectl pod status
    - observe performance with: iotop, schedtop, iostat
    
    Tests performed:
    - DRBD resync with and without dd writer stress
    - swact with and without dd stress
    - large application apply + dd writer stress
    - launch large number of pods (eg, scale nginx with 80 pods),
      watch systemctl status commands using strace to check for hang
    - copy very large files, create big tarballs, write mkisofs iso
    - host install
    
    Closes-Bug: 1927515
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/790098
    Signed-off-by: Jim Gauld <james.gauld@windriver.com>
    Change-Id: Ieeeba5c1375d8d99401f839c7409a9de356fda87

commit 9782bb104c07b4aed0876d88d1743d4816a34515
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 7 08:51:19 2021 -0400

Update dnsmasq.conf for UEFI pxeboot
    
    Due to recent grub2 update for CVE-2020-15705, pxeboot must use the
    shim.efi file for secure boot, rather than grubx64.efi directly.
    
    Change-Id: I864ff46f449e92dfd5f1667379bc56aaaf6dfe2c
    Closes-Bug: 1927730
    Depends-On: https://review.opendev.org/c/starlingx/metal/+/790253
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/790254
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit c120fb798091db9fb756e51b895dccfa8d80a947
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Wed May 5 17:30:19 2021 -0400

AIO-SX reboots after change OAM ip address
    
    On HW tests, it was detected that openstack-endpoints restart was
    happening at the same as the service-manager restart, this creating
    a conflict that preventing SM services to reach enabled-active.
    This was provoking the reboot.
    
    The correction creates openstack::keystone::endpoint::runtime::post
    class to be executed the post stage and not on the main stage, to
    avoid conflict with service-manager
    
    Also marking platform::network::runtime to be run at the pre stage
    to avoid some encountered apply errors related to the delay of
    haproxy bringup due to the lack of the IP address on the interface
    as it was only configured later. This way the other restarted
    services will have the address on the interface as restart happens
    
    Tested on AIO-SX, by monitoring manifest apply and validating that
    no reboot happens
    
    Closes-Bug: 1927275
    
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: Ia70a3395753e43b3c1e2c037818c8c23e4ec0fd6

commit cb7858c65982c250f07a5022719d4f2b6d547d64
Author: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>
Date:   Wed May 5 11:11:27 2021 -0300

Fix for failure during AIO-SX to AIO-DX migration on standalone system
    
    Fix drbd-cephmon mount error by manually remounting monitor DRBD after
    DRBD::Resource creation. Removed patching of Kubernetes Persistent
    Volumes from puppet manifest since Kubelet and kube-api are no longer
    available during puppet run.
    
    Partial-Bug: 1927224
    Signed-off-by: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>
    Change-Id: Id5565ac734499b617b470499cfc2aa1ae2972da3

commit 5695a29e6a5ed8ee5d211e937496384027d7fd4e
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Apr 29 13:35:38 2021 -0400

Fix missing kubelet service enable for worker nodes
    
    Previous commit:
      https://review.opendev.org/c/starlingx/stx-puppet/+/780600/
    kubelet enable is skipped for the worker nodes.
    
    Change-Id: I7769aebb4a9e38404af0c883640e1a27bb1e9e84
    Closes-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 94ec35ff2d5363d3816f6d267a77a4efba6c6aa8
Author: Zhixiong Chi <zhixiong.chi@windriver.com>
Date:   Wed Apr 14 23:28:03 2021 -0400

Increase min_free_kbytes to 256M for storage to avoid OOM issue
    
    Help to prevent the OOM issue that it failed to allocate memory
    with error message 'page allocation failure: order:2, mode:0x104020'
    
    As the min_free_kbytes in the linux documentation shows:
    This is used to force the Linux VM to keep a minimum number
    of kilobytes free.  The VM uses this number to compute a
    watermark[WMARK_MIN] value for each lowmem zone in the system.
    Each lowmem zone gets a number of reserved free pages based
    proportionally on its size.
    
    Keeping more memory free in those zones means that the os itself is
    less likely to run out of memory during high memory pressure and high
    allocation events.
    
    Based on the issue occurs on the storage node so far, we only update
    the value for the storage node.
    
    Closes-Bug: #1924209
    
    Change-Id: Iae2e5a0787f69c62ba5da53663371fd2be148e15
    Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>

commit 736199af4106378b86b4cdca784105fe2cd8ed05
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Wed Apr 28 14:50:21 2021 -0400

On runtime, kube-sriov-device-plugin needs to be restarted
    
    The previous correction for bug 1918139 removed the sriov plugin
    restart necessary during runtime, done during the interface sriov
    assign to a datanetwork (allowed on an unlocked AIO-SX). Without
    it, the pod creation will not be able to use a datanetwork created
    on runtime.
    
    The correction bring back the platform::kubernetes::worker::sriovdp
    class to be used only on runtime
    
    Closes-Bug: 1918139
    
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: Ied19bf3138b58b279b350d067ae0c1080e220f31

commit 69b9809465b5e7a837917cce7d0a731ddf257f0d
Author: Steven Webster <steven.webster@windriver.com>
Date:   Tue Apr 27 17:54:24 2021 -0400

Fix interface (re)configuration for single-nic system
    
    Currently, the apply-network-config manifest step launches a script
    that detects differences between puppet's view of what the
    ifcfg-* network scripts should be and what the value
    of the ifcfg files actually are in the /etc/sysconfig/network-scripts/
    directory.
    
    If there are differences, the puppet representation of the interface
    configuration is copied to the system network-scripts directory and
    the interface is brought down and up to apply the config.
    If there are no changes between the puppet view and the system view,
    the interface is left alone.
    
    An issue can occur in a single-nic system comprising a physical
    lower ethernet interface configured for SR-IOV with upper vlan
    interfaces (oam, mgmt, etc).  If the lower interface is
    re-configured, it is subsequently brought down/up to apply
    the changes.  This causes the upper vlan interfaces to also
    be brought down by the kernel.  In the case of an IPv6 system,
    the interfaces will lose their addresses as well as any configured
    default route.  In the case of an IPv4 system, the default route
    will be wiped out, which could cause issues in a distributed cloud
    environment.
    
    This commit addresses the issue by detecting whether any lower
    interface associated with a vlan interface has been marked for
    re-configuration.  If this is the case, the vlan interface is
    also added to the up/down list to cause it to re-apply the
    existing static configuration (if it is not already in the list).
    
    Closes-Bug: 1926366
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: I40177900ef58a9619fecb34ceffc412f31d1a965

commit 139ba4aa6c143e495b8b7136b359254ceb3ba296
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon Apr 26 14:59:51 2021 -0400

Reset N3000 fpgas only when it exists
    
    Remove calling reset n3000 fpga before detecting h/w exists.
    
    Closes-Bug: 1918139
    Change-Id: I81b7fbc9500fac7e86424537551c1e9aac7492ec
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit e6b1ae7d222f83625110d80a576b95f88f5ed04a
Author: Charles Short <charles.short@windriver.com>
Date:   Mon Apr 26 11:16:00 2021 -0400

Fix zuul errors due to changes in dependencies
    
    Pin hacking to < 4.0.1 to fix zuul gate issues.
    
    Test:
    Ran tox -e pep8 command to validate the pep8 job and result.
    
    Related-Bug: 1926172
    
    Signed-off-by: Charles Short <charles.short@windriver.com>
    Change-Id: Ia85b584d7ff4e5e7cb19a820d6f6323aa672f52e

commit 16f0b0cc66b23a9e74005a9cd9379de6a2d78234
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Fri Apr 23 10:02:35 2021 -0400

Rename the dnsmasq runtime class
    
    As the platform::dns:runtime class only referencing the resource of
    dnsmasq, this commit renames it as platform::dns::dnsmasq::runtime in
    order to indicate its function clearly.
    
    Story: 2008774
    Task: 42365
    
    Change-Id: I79dd23bf64abfd63906daa59ec59c4496dedda31
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit 70971df9f35886f5ece04c82bfccee105d3d0861
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Mar 30 15:58:15 2021 -0400

AIO manifest to start kubernetes once
    
    This change is to avoid restarting kubernetes.
    Also calling sysinv-reset-n3000-fpgas to reset N3000 FPGAS
    on host start up.
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/785683
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/780600
    Change-Id: I4a27840820fd45ad86cef4dfce6ea0389e583f68
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit f4694f8a30f1e5cbe0f7d354f95949a1601eb1e1
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon Feb 8 13:00:38 2021 -0500

Single puppet for AIO controllers
    
    This change includes:
    1. create aio.pp for AIO controller nodes
    2. execute aio.pp for nodes with subfunctions of 'controller,worker'
    3. remove sriov device plugin restart code as now kubelet starts
       after related config are applied.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/784761
    Change-Id: I54b90a76454c6c545bf2891b81225bbf2ba15b03
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit accc39cefe9f54efa656b99bb3fad949ba030367
Author: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>
Date:   Sun Mar 7 19:38:23 2021 -0300

DRBD replication, rebuilding monitor and PVCs during migration to AIO-DX
    
    Given the system capability "simplex_to_duplex_migration" exists on the
    system to indicate that it is going through a migration from AIO-SX
    to AIO-DX, this commit will during the unlock process, create a DRBD
    replicated filesystem for the floating monitor, rebuild the monitor
    store.db from the existing Ceph OSDs on the system, recover the
    previously existing cephfs filesystems, updates the ceph crushmap
    and updates the Ceph monitor IP on existing PersistentVolume resources.
    
    Story: 2008587
    Task: 42078
    
    Signed-off-by: Pedro Linhares <PedroHenriqueLinhares.Silva@windriver.com>
    Change-Id: Iba6ec8bf812c9623724c357455a370d79ffd7b60
    Signed-off-by: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>

commit 569b457592d3f3c95aba72f5f52108316842b6fe
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 14 14:54:40 2021 -0400

Generate admin ep cert on subcloud controllers in puppet
    
    Enabled admin endpoint cert to be generated in manifest directly
    from k8s secret data (via secure hieradata). This operation is
    consistant to the system controller as well as admin endpoint cert
    renewal.
    
    Partial-Bug: 1923510
    
    Change-Id: I442f3c2c97cf83588aefa8b4fe808834a31fdcc5
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit ffddc103ca66f87fb96ae02e9cfbb656d39f38ab
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Thu Apr 15 09:59:55 2021 -0400

OAM IP change needs double lock/unlock controllers for IPV6 system
    
    Added IPv6 address fields on the list used to detect if the interface
    have changed on apply_network_config.sh. Without it was only copying
    the interface config file from /var/run/network-scripts.puppet/ to
    /etc/sysconfig/network-scripts/ which explains why it was working
    on the second reboot.
    
    Tested on:
    AIO-DX
    AIO-SX
    
    Closes-Bug: 1895555
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: I25e60a04b4aec38c254ff3e3a7b2f0d80ce5daaf

commit f46c154188b5d90bdd19ba2a5952b4f8c565d5d3
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Wed Apr 14 17:13:59 2021 -0400

kdump config remove intel eth drivers from ramdisk
    
    Problem:
    On a kernel crash, such as the watchdog timer firing, kexec
    tries booting the crash recovery kernel in order to capture
    a vmcore so that the issue can be debugged. This normally
    succeeds unless the platform has ice network hardware. Why?
    Because the crash recovery kernel has only a small amount of
    memory set aside for it, and the ice driver allocates enough
    memory to cause memory exhaustion.  This causes the crash
    recovery kernel's startup to fail, leading to complete platform
    hang.  In order to break out of the hang, one needs to manually
    do a hardware reset or power cycle.
    
    Solution:
    Change kdump.conf to leave the ice driver module out of the
    initramfs that is used by the crash recovery kernel.  In
    fact, leave all of the intel ethernet drivers out since they
    are not needed and increase the risk of memory exhaustion.
    Upon changing kdump.conf, the kdump service is restarted to
    regenerate the initramfs.
    
    Verification:
    Install, check the kdump.conf file and unpack the initramfs file
    making sure that those modules are gone.  Check controller,
    worker, and storage node types.  Reboot node, make sure things
    behave as expected ie. no extra kdump.conf mangling and no
    unexpected kdump service restarts.
    Also crash a node with intel ethernet hardware on it and make
    sure it comes back up with a vmcore left in /var/log/crash.
    
    Change-Id: I9112f722cee8e199d94393bca887d3bb9bb89b39
    Closes-Bug: 1923879
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit f21842a2c46c656086234b9b006c224a41485acb
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Mon Apr 12 17:06:33 2021 -0400

Creates a LDAP client runtime class
    
    This commit creates a wrapper class platform::ldap::client::runtime to
    update the LDAP client in runtime.
    
    Tested with apply this class in runtime to update the LDAP server URI.
    
    Change-Id: Ia3e40617c9e628deeca839734bd3a3b41431f336
    Story: 2008774
    Task: 42248
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit 2a80652598f399995edfc434f1aa0154f1b8299c
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Tue Mar 30 13:46:11 2021 -0400

Applying SRIOV VF configuration at runtime
    
    During runtime, if a user converts a non pci-sriov classed interface
    to a pci-sriov classed interface with type 'ethernet', or creates an
    SR-IOV interface of type 'VF', logic is implemented to enable and
    configure the interface'
    
    Story: 2008531
    Task: 42203
    Change-Id: I0edb4abf2cea6dc29b9485fa09d1fecab4b76c65
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>

commit 28ef813cda9fd0191d8cee9c1f2bd80d64175f6f
Author: Don Penney <don.penney@windriver.com>
Date:   Tue Mar 30 18:10:02 2021 -0400

Add aggregate to DX service group reprovisioning
    
    The following prior update added service group reprovisioning on DX
    nodes, but was missing the aggregate option necessary to ensure
    certain groups were active on the same controller:
    https://review.opendev.org/c/starlingx/stx-puppet/+/773277
    
    As a result, failures during swact could lead to these groups being
    assigned to different nodes, causing other failures in the system.
    
    This update adds the missing aggregate option.
    
    Partial-Bug: 1893669
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: I063d1549aa456bd4bb68c4c69c50dbc078ae7be0

commit 6a4907694c386aa6e85b0c51ac1963903f9092c8
Author: Robert Church <robert.church@windriver.com>
Date:   Sat Oct 24 03:30:47 2020 -0400

Add support for setting optional k8s cpu configuration flag
    
    If the host-label 'kube-ignore-isol-cpus=enabled' is added to a host,
    then file '/etc/kubernetes/ignore_isolcpus' will be created for kubelet
    to consume so that it can determine how to handle application-isolated
    CPUs.
    
    Story: 2008760
    Task: 42166
    Signed-off-by: Robert Church <robert.church@windriver.com>
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Change-Id: Ifbcc245d0e2716b7abb7726d38d3662e7b53d770

commit 5927be3eed92dee4192bf76af04b171c9758bfd5
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Thu Mar 11 05:59:50 2021 -0500

Added classes to restart service manager and vim-webserver
    
    For service manager, created a class to stop, modify the OAM IP and
    restart it. In the case of vim-webserver, a new class to only restart the
    service during runtime
    
    Story: 2008531
    Task: 42061
    Change-Id: I7846c5ab3f1f8d0adb741356164a20932f9ed25f
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>

commit e1552be5bcd4f32ae5d9c30a4158ca98368005a6
Author: Robert Church <robert.church@windriver.com>
Date:   Thu Mar 11 01:30:24 2021 -0500

Enabling Ceph MDS as part of adding Ceph at runtime
    
    A metadata server is assigned to every node that has a monitor.
    
    Restructure the metadataserver class to ensure that the metadata server:
     - is started after the Ceph monitor and the Ceph manager on controllers
     - is started after the Ceph monitor on a worker assigned a monitor
    
    If the metadata server is started prior to the monitor, it will not
    start properly.
    
    Future optimization may be to create a MDS SM service on the
    controllers, but based on current testing, it seems unnecessary.
    
    Tested:
     - Adding Ceph pre-controller-0 unlock
       - AIO-SX, AIO-DX, Standard 2+2, Storage 2+2+2
     - Adding Ceph at runtime after installed nodes are fully provisioned.
       - AIO-SX, AIO-DX, Standard 2+2
     - For all the above configs also added storage tiers and confirmed
       proper functionality
     - NOTE: No Ceph runtime option for labs with storage node
       configuration.
    
    Change-Id: I27b53b55738d0aec70db6a9e4004c920029869fa
    Closes-Bug: #1919276
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 1aef5b8968d8ce28c4fbc42a5160f77a8ebff642
Author: Robert Church <robert.church@windriver.com>
Date:   Thu Mar 11 01:29:04 2021 -0500

Re-enable adding bare-metal Ceph storage backend at runtime
    
    Adding the bare-metal Ceph storage backend at runtime fails as
    $::platform::rook::params::service_enabled can not be resolved.
    
    This update explicitly includes the class to allow resolution and enable
    the Ceph storage backend to be added.
    
    Change-Id: I1bd12910784387c2a2d37a29d2f299e3cebb8cd2
    Closes-Bug: #1919274
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit a79bc74c31350fc05cf16d89b1c2cdf35af5ef5f
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Fri Mar 12 13:47:19 2021 -0500

Cleanup config.toml.erb of MARK_* comments
    
    The comments that contain strings "MARK_BEGIN" and "MARK_END"
    are not used anymore in ansible bookstrap and they need
    to be cleaned up from config.toml.erb template.
    
    Change-Id: Id53bc58d2624581b6e50ead1c77a5cd424631ae5
    Closes-Bug: 1892768
    Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/779047
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>

commit 13ba2d4a7e6f9337eda22d98df872cb40ec983ac
Author: John Kung <john.kung@windriver.com>
Date:   Sun Feb 28 10:38:54 2021 -0600

puppet manifest apply check hieradata rsync
    
    Update the puppet manifest apply to check whether the hieradata
    has been rsync successfully.  Check return value and, in certain
    cases reattempt, before continuing.  This is needed because
    the hieradata is actually generated by the controller node,
    and this script may be running on another host.
    
    It has been observed that there are instances on worker host
    whereby some of the hieradata is missing (e.g. missing
    system.yaml openstack_host in puppet.log).
    
    Verified:
        install, deployment and sanity on multinode and AIO
        backup and restore with Ceph
        platform upgrade
    
    Change-Id: I9e7a0a02dd28c06d914fafe8234f4fee5e05247c
    Closes-Bug: 1917229
    Signed-off-by: John Kung <john.kung@windriver.com>

commit fab4ea75c03d96ece44f441725f7f385202e737c
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Feb 26 16:23:35 2021 -0500

Increase haproxy timeout for patching
    
    Some patching operations can take a significant amount of time.
    Thus, in this commit the haproxy timeouts for
    patching-restapi-admin-internal and patching-restapi-internal
    are updated to be 600s.
    
    Change-Id: I1b73793c2963be2d1e40634ed6f85d747c6d6985
    Story: 2007267
    Task: 41944
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 8300408337a5051aba0c7106add4f0068ba7d461
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Wed Feb 17 15:56:40 2021 +0000

Add container runtime interface (CRI) placeholder to config.toml
    
    This commit extends containerd config.toml template file  to include
    placeholder for custom CRI entries. The custom CRI entries can be
    specified via service-parameter method (Change-Id: Icc5fd16 stx/config).
    
    Story: 2008434
    Task: 41389
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/776220
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Ib1dd5bd2fbb5e386cf06ab4161226c3bf6f107ac

commit a8cf39d9d37d51869503cfa4d239faa4ced7e67f
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 11 13:23:23 2021 -0500

Device image repository
    
    The device images are stored in the drbd filesystem
    (/opt/platform/device_images) in the active controller.
    In order to allow the other worker hosts to retrieve the device images
    from the active controller over lighttpd, the directory
    /www/pages/device_images is created as a bind mount of the drbd
    directory. This mount resource is managed by SM.
    The 'device_images' is added to the lighttpd static content list.
    
    Tests performed on the following systems:
    AIO-DX, AIO-DX plus compute, Standard 2+1
    DC with AIO-DX plus subcloud
    DC with Standard subcloud
    
    Story: 2007875
    Task: 41877
    Depends-On: https://review.opendev.org/c/starlingx/ha/+/776489
    
    Change-Id: I4e7686ece49546d7ef84f5724370167afaf21375
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 2e92d0ec7e3e39b69ac4838d54f5ec8e0ad752bc
Author: Litao Gao <litao.gao@windriver.com>
Date:   Thu Feb 18 07:59:17 2021 -0500

Add retry to tolerate 'ip link set' failure
    
    If 'ip link set' is executed too fast for X710, it is possible
    that some of them fails with 'Resource temporarily unavailable'.
    Add retry in puppet Exec resource to tolerate this failure case.
    
    Story: 2008470
    Task: 41936
    
    Signed-off-by: Litao Gao <litao.gao@windriver.com>
    Change-Id: Ib80ea77d36a0b0f63d3db2015dadb3911c56d1e9

commit 598427b294ee25fa817fb2de5e56bb18825c984e
Author: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
Date:   Thu Feb 25 07:49:26 2021 -0500

Increase timeout for sriovdp deletion
    
    In an AIO-SX, pods get launched by kubelet soon after puppet is done
    with controller's manifest but is still working with worker's manifest.
    When pods are several the concurrency may lead the SRIOV device plugin
    to not be deleted (then restarted) in the expected time frame.
    The final solution for this problem is on the way, by refactoring the
    current AIO to orchestrate between pods bring-up and worker setup. In
    the meantime, the workaround solution in this change is the increase
    of the original timeout for deletion of the SRIOV device plugin.
    
    Closes-Bug: 1916620
    Implements: increased timeout value to manage sriovdp in kubernetes.pp
    Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
    Change-Id: I0f6fb20a0ed5086fc80794b35715eea8d3d74cb8

commit 98601d637cb8f421a0fdccb2acb63339309d0dbe
Author: albailey <Al.Bailey@windriver.com>
Date:   Fri Feb 12 11:33:36 2021 -0600

Use kubelet.conf instead of admin.conf on worker nodes during upgrade
    
    Specifying a config file that does not exist causes kubelet upgrades
    to fail on worker nodes when some of the commands return errors.
    
    admin.conf does not exist on worker nodes, but exists on controllers.
    The code has been updated to use kubelet.conf during worker kubelet
    upgrade actions.
    
    The worker init code has also been changed when pulling the pause
    image so that it does not try to contact k8s.gcr.io.
    The kubernetes-version needed to be passed in when querying for the
    pause image.
    
    Story: 2008137
    Task: 41828
    Change-Id: I6565132bd587927bd26c845c2ea56a995ac6da1c
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit 20f211cbef89be56ad7dd26e93cb720d81a93172
Author: albailey <Al.Bailey@windriver.com>
Date:   Fri Feb 19 12:14:38 2021 -0600

Add bindep target to tox
    
    bindep is a helpful tox target to assist in determining
    what components a test environment needs to have installed.
    
    For stx-puppet, puppet-lint needs ruby headers otherwise
    the tox linters target will fail.
    
    Partial-Bug: #1907678
    Signed-off-by: albailey <Al.Bailey@windriver.com>
    Change-Id: Iaccd8d8f3af292ef29028cde59f0d344b94f1d72

commit c67bd455f8b8bd06ab9611d1b5d6ce7f2f948337
Author: albailey <Al.Bailey@windriver.com>
Date:   Fri Feb 19 10:01:19 2021 -0600

Fix running tox linters in a python2 env
    
    The bandit target is python3, and the package
    fails to be installed in a python2 env.
    
    Partial-Bug: #1907678
    Signed-off-by: albailey <Al.Bailey@windriver.com>
    Change-Id: I9d683c99274dc3120995e0376ace53644dc2a050

commit 54be537f9edea23df45dc3221d9be41d83f13778
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Feb 12 17:45:58 2021 -0600

Add support for dcmanager-audit-worker service
    
    We're moving the bulk of the dcmanager subcloud audits to separate
    worker processes, so we need to add a service for the main worker
    processes (which will then spawn additional workers).
    
    Story: 2007267
    Task: 41869
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Depends-On: https://review.opendev.org/c/starlingx/ha/+/775457
    Change-Id: I119d24ae67ec4a40c360ac721582b45388231cbf

commit 3c2f1530c9ee8ccd2c27cb757655a9c851b926ae
Author: Babak Sarashki <zbsarashki@gmail.com>
Date:   Thu Feb 4 23:52:37 2021 +0000

platform puppet: Config ACC100 bbdev with QMGR val
    
    The ACC100 PF and VF configuration takes the same puppet
    config code path as the N3000 except that the ACC100 does
    not require a reset, but requires bbdev config.
    
    This patch adds platform::devices::acc100::fec class to
    exec pf-bb-config to configure QMGR on the Intel ACC100
    (Mt. Bryce) with number of 5G UL/DL qgroups and configures
    the device with the number of VF's.
    
    Story: 2008440
    Task: 41530
    
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/775252
    
    Signed-off-by: Babak Sarashki <zbsarashki@gmail.com>
    Change-Id: I7d42852009fedba5136d9d726092f273ef41c7fd

commit 5a555ad98eb4fb978c7b553d463dfedf4d9b3a25
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Feb 3 12:11:16 2021 -0500

Change collectd plugin search path
    
    This update changes the collectd's plugin
    search path from /etc/collectd.d to
    /etc/collectd.d/starlingx to avoid loading
    the collectd default plugins.
    
    Partial-Fix: 1905581
    Depends-On: https://review.opendev.org/c/starlingx/monitoring/+/772516
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I1999f25244465430d9c1385a2fc3c002d0e108c9

commit fbb4cdef07c52acb66cfcaf91bc2d029ffb00ff1
Author: Don Penney <don.penney@windriver.com>
Date:   Sun Jan 31 22:02:43 2021 -0500

Reprovision SM services on duplex
    
    Update SM provisioning for duplex to reprovision services
    if needed. The default configuration in SM is duplex services,
    and a simplex node will reprovision these to be simplex. In
    order to support SX to DX migration, these services will also
    be reprovisioned on duplex to ensure the configuration
    is correct.
    
    Story: 2008587
    Task: 41743
    
    Change-Id: Ifb61a6046c680d0dee7c76660397c6fe8c2cbe73
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit b2e37caaeb90e5931cb1522d8d23e6258d506fdb
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Wed Feb 3 09:26:47 2021 -0500

Etcd parameters lost when changing kube-apiserver parameters
    
    Etcd parameters are getting lost when changing kube-apiserver
    parameters. This is due to no default values being present. The
    missing etcd parameters causes kube-apiserver to fail to start up.
    This commit makes the script for changing kube-apiserver parameters
    keep any existing etcd parameters in the previous config.
    
    Change-Id: I83eb5426ba72a36a5eed3ecbddcbbacdf38803c5
    Closes-bug: 1914291
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit a4decc6fbc0f796e03f25119b635fad51962fbdd
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Wed Jan 27 14:29:59 2021 -0500

Added class to handle pci runtime config on kubernetes
    
    The new class adds a handler on puppet to trigger
    configuration on runtime when an SR-IOV interface is
    assigned to a data network on an unlocked host
    
    Story: 2008531
    Task: 41707
    Depends-On: https://review.opendev.org/c/starlingx/config/+/772759
    Change-Id: Iddbc272eb6b3321c987c2700e63734ee57244cf9
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>

commit a34cd954e92f37994d40a31ccc2777249598622d
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 25 10:01:29 2021 -0500

Modify collectd manifest to not start collectd
    
    The collectd puppet manifest auto starts the
    collectd process before a node's configuration
    is complete. This has been see to lead to a
    collectd process core dump in the collectd's
    network plugin due to being started before
    networking is setup or fully operational.
    
    Collectd has a service file that has been
    modified by the Depends-On update to start
    collectd after config is complete.
    
    Partial-Bug: 1872979
    Depends-On: https://review.opendev.org/772349
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I70ded9b745b7dadb7c50b1d5f9ba8bdcb5ffa2da

commit 389f37582a1568bc34089956dc52b6fe5c274b83
Author: John Kung <john.kung@windriver.com>
Date:   Thu Jan 21 09:41:41 2021 -0600

Adjust dcorch database pool size for dcorch scaling
    
    The dcorch database pool sizes are updated based upon
    the delivery of feature with multiple dcorch engine workers.
    
    As the workload is allocated amongst 5 multiple workers,
    the values for the dcorch database pools can be lowered
    from the defaults previously set for the single process case.
    
    The max theoretical database connections allowable per worker
    is based on 100 audit and 100 sync threads.
    
    Furthermore, the dcorch scaling feature audits based on
    audit timestamp so the peak loads are also likely more balanced.
    
    Testcases:
    In multiple subclouds environment, monitor each
    dcorch engine each workers database connections usage:
    subcloud add
    subcloud initial manage
    subcloud resource sync
    subcloud manage and unmanage
    
    Change-Id: Id3386df6289d42080a90b9d97cc0834054160805
    Story: 2007267
    Task: 41650
    Signed-off-by: John Kung <john.kung@windriver.com>

commit b112fce71e8ce69b7065fa3bf8f4da896cd637a3
Author: Litao Gao <litao.gao@windriver.com>
Date:   Tue Jan 12 10:18:08 2021 -0500

VF rate limiting support
    
    This commit implements puppet part logic to perform
    VF max_tx_rate setting according to the configuration.
    
    Story: 2008470
    Task:  41508
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/770135
    Signed-off-by: Litao Gao <litao.gao@windriver.com>
    Change-Id: Ic599f9ac70430529f31d57a74d0809f7077b98e5

commit ce66ffd30cb33f2b770587d7731732e34593e8f1
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Sun Jul 5 20:42:55 2020 +0800

Add puppet class for rook
    
    Create a new drbd device for rook. For duplex system, device mount to
    folder /var/lib/ceph/mon-a for mon data sync on two controllers.
    
    Story: 2005527
    Task: 40281
    
    Change-Id: Ic5edca16e2dce905aeb582b0359446bd222e5ad3
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 2680e463198cd75b01a8f04140e2d4f72e4844c9
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Tue Jan 19 10:47:17 2021 -0600

Migrate etcd after both controllers are upgraded
    
    The flag to trigger the data migration is now set by the conductor on
    controller-1 and the migration will be performed on controller-0. The
    flag is now set in a drbd synced filesystem so it is accessible to both
    controllers.
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/771668
    Story: 2008055
    Task: 41631
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I761740f4de24f33f2d314ec1bc8fbc5941607900

commit 978dea28f21592ad4aa79e99821b70a1b07ab438
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Wed Jan 20 09:47:34 2021 -0300

Remove trap destination from fm.conf
    
    With the host-based SNMP removal,
    remove trap_destination entry from fm.conf
    
    Story: 2008132
    Task: 41350
    Change-Id: I3f0298233beedc3370fa8c4c2dbc65fe678b14a6
    Depends-On: https://review.opendev.org/765381
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>

commit 0f7418e761fa49b0f5a5edc9593ff9f6c0921206
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon Sep 28 11:39:17 2020 -0400

Configure SQL as helm storage backend
    
    Configmap is the default helmv2 storage backend to store
    release information but its 1MB resource limit prevents
    scaling up stx openstack worker nodes, so we want to use
    SQL as helm storage backend.
    
    Add class in helm puppet manifest to setup helm database
    during ansible bootstrap.
    
    This commit also fixes the IP address in postgres pg_hba.conf.
    
    Currently, we have the following rules for both IPv4 and
    IPv6 systems:
    Rule Name: allow access to all users with encrypted password
    from all IPv4 addresses.
    host  all  all         0.0.0.0/0   md5
    Rule Name: deny access to postgresql user.
    host  all  postgres    0.0.0.0/32 reject
    
    For the IPv6 system, the address of pods is IPv6. The CIDR
    address in the rule should be changed to corresponding
    IPv6 address (::0/0) to allow tiller running in container
    to access helm database.
    
    Depends-On: https://review.opendev.org/#/c/761645/
    Change-Id: Ifd072000e0680a59d5be0f2f1ef2ce1cbabc1e4f
    Partial-Bug: 1887677
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 4b97414655f5126ce65acf9b15be635483955c74
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Thu Jan 7 11:12:11 2021 -0300

Support trap_server_port configurable
    
    Add parameter for trap_server_port to make user can
    configure snmp trap server port number through
    user helm override.
    
    Story: 2008132
    Task: 41548
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
    Change-Id: Iac44d813447881591efd7b4a088185f2d59986be

commit 777d5d0de78c97fdc223e56662f7d3db6def2768
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Sat Oct 31 01:15:33 2020 +0800

Enable etcd with security setting.
    
    Update etcd puppet to support security settings.
    
    Partial-Bug: 1894870
    
    Change-Id: Ifb5bb2506a260186bf4e8caa487bbeaae04df80b
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 6182d3f94990ea282004245e2d821eae5ac573ea
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:21:50 2020 -0500

Add auto-version for remaining stx/stx-puppet packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Change-Id: I110ef3a10c3164f8edb706b9257f33178b4a2517
    Story: 2008455
    Task: 41456
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit f8397fe71bae28a4126bbdf38da0731ba529b4c0
Author: Nicolas Alvarez <nicolas.alvarez@windriver.com>
Date:   Thu Nov 26 16:51:32 2020 -0300

Delete SNMP Host-Based entries.
    
    Delete entries related with SNMP Host-Based.
    
    Story: 2008132
    Task: 41323
    Signed-off-by: Nicolas Alvarez <nicolas.alvarez@windriver.com>
    Depends-On: https://review.opendev.org/766094
    
    Change-Id: I2c4a89fd7c4bac9895311787663a6d693600b090

commit b1997248da4bcb1d3ec0ce15d423eb42d2219a3e
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Mon Oct 5 10:33:36 2020 +0000

Add mds support in puppet for CephFS.
    
    Mds configuration needs to be present on every node that
    has a ceph monitor in order for CephFS to be available.
    
    Change-Id: Ic4270e401b2c3e5123aecfab21af1e874b733830
    Story: 2008162
    Task: 40908
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>

commit 6f881cc84e3d3c922423441304e7157effc505e7
Author: Andy Ning <andy.ning@windriver.com>
Date:   Thu Dec 3 09:41:08 2020 -0500

Skip platform ceph osds puppet manifest following DOR
    
    ceph::osd puppet manifest will fail during controller puppet
    manifests apply following DOR, because as both controllers are
    booting up, there is no ceph monitor cluster so puppet is unable
    to validate or invalidate the existing configuration.
    
    This change updated platform::ceph::controller class to skip
    platform ceph osds in the case of DOR.
    
    Change-Id: I0254ce28869bc87c5e939ea8984d175244ebb65f
    Partial-Bug: 1904739
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 8ba9e81db4e238c69edebdfec4738063aad7eb14
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Tue Dec 1 22:22:57 2020 -0500

Fix directory permissions for /var/log/rabbitmq
    
    Updated /var/log/rabbitmq directory permissions to 750 from 755
    to disallow world access to rabbitmq log files but at the same
    time to allow group access.
    The changes are made to comply as much as possible with
    openscap rules security requirements.
    Verified that installation is successful for AIO-SX
    and Standard 2+2 system configurations.
    
    Story: 2008037
    Task: 40694
    
    Change-Id: I1c0112575033c04983c56298e2131882911333de
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>

commit 3b7c55174aafffd8f35545ad8e20d928322de2f9
Author: Lu Yao Chen <luyao.chen@windriver.com>
Date:   Wed Nov 25 14:28:48 2020 -0500

Retain more puppet log files
    
    Increased max log directories to retain more
    debugging info from puppet.log.
    
    Was tested by looping system host-cpu-modify
    commands, /var/log/puppet caps at 50 log directories
    instead of 20.
    
    Closes-Bug: 1903994
    
    Signed-off-by: Lu Yao Chen <luyao.chen@windriver.com>
    Change-Id: Ia8458396867f988d5061d3aa49fa2a21ee6ebac2

commit 77d3382d2c63dba2e04cb92333a37b0370992cd5
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Mon Nov 23 18:22:12 2020 -0500

Fix permission of puppet saved logs tar file
    
    Changed the permissions of puppet saved logs tar file from
    644 to 600 to comply with openscap rules security requirements.
    Verified that installation is successful for AIO-SX
    and Standard 2+2 system configurations.
    
    Story: 2008037
    Task: 40694
    
    Change-Id: I1fe365e808a085999667e898788afacf61fd6612
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>

commit e5ff48c2ca6931eadff3566de33519a3496beeab
Author: Andy Ning <andy.ning@windriver.com>
Date:   Mon Nov 23 14:07:35 2020 -0500

Remove comments in keystone::upgrade class
    
    The TODO comments in keystone::upgrade class no longer applies.
    This update removed them.
    
    Change-Id: Id9f7b39c15db1f73428d4f23d93ef3e3b4ad50f5
    Partial-Bug: 1886064
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit c10b5897b9d972555228f6510803c48981050e5f
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Thu Nov 19 11:05:12 2020 -0500

Update dnsmasq config for slow DNS servers
    
    When a configured DNS server is taking a long time to respond to
    unknown domains or hosts, registry interactions like push, pull,
    and querying for images through system commands will fail due to
    hostname resolution for registry.local. This is because it attempt
    to resolve registry.local using the A record first, which times out
    since it is hitting the configured external DNS server. This
    prevents the process from looking up the AAAA record which would
    resolve to the dnsmasq CNAME record. This commit updates the dnsmasq
    config to prevent forwarding the local domain to upstream servers.
    
    Change-Id: Ic3cf6aae87f8f2d5c61a24db00a4cb814c20aac6
    Closes-Bug: 1904885
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 3ca2387ddbb455a081689be72632b408988c5d39
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Tue Nov 3 15:35:29 2020 -0300

Add variables for snmp in fm.conf
    
    Snmp trap client needs the following three variables
    to connect to snmp trap server.
    - trap_server_ip
    - trap_server_port
    - snmp_enabled
    Modify puppet to add these variables. trap_server_ip
    and trap_server_port are fixed. snmp_enabled takes
    True/False depends on snmp armada app is applied
    or not (True when applied).
    
    Change-Id: Ibedaf772153f49c6dfefe644044da07b5d32bb20
    Story: 2008132
    Task: 41207
    Depends-On: https://review.opendev.org/761213
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-07: Fix merged to monitoring (f/centos8)

#36

Download full text (7.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/792244
Committed: https://opendev.org/starlingx/monitoring/commit/fdc0d099fb0d65cbf8f037fe0cc9ac8125410284
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300

Format 2 lines ntpq data into 1 lines

    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.

    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.

Closes-Bug: 1889101

Signed-off-by: Takamasa Takenaka <email address hidden>
Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400

Add alarm audit to starlingx collectd fm notifier plugin

    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:

       Error Case Correction Action
       ----------------------- -----------------
     - stale alarm ; delete alarm
     - missing alarm ; assert alarm
     - alarm severity mismatch ; refresh alarm

The common audit is enabled for the fm_notifier plugin that supports
alarm managment for the following resources.

     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104

Other plugins may use this common audit in the future but only the
above resources have the audit enabled for them by this update.

Test Plan:

    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features

Regression:

    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
  ...

StarlingX

collectd core dump generated after lock/unlock controller-0

Bug Description

CVE References

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches