StarlingX

A large number of files generated by filebeat pod are not removed

Bug #1865924 reported by Tee Ngo on 2020-03-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Kevin Smith

Bug Description

Brief Description
-----------------
The following issue was observed in a distributed cloud configuration. The /var/log partition was filled up due to space taken by a large number of filebeat deleted files.

Severity
--------
Critical

Steps to Reproduce
------------------
Set up a large distributed cloud with stx-monitor applied and soak for a few days with some test activities such as deploying, managing/unamaging and removing subclouds.

Expected Behavior
------------------
Service logs are saved to disks and rotated accordingly

Actual Behavior
----------------
logmgmt process was hogging cpu, no logs were flushed to disk. Log files were rotated rapidly with almost no content and filesystem critical alarm was generated.

The problem documented here (courtesy of Al Bailey)
https://www.elastic.co/guide/en/beats/filebeat/master/faq-deleted-files-are-not-freed.html
might be the cause of this issue

Reproducibility
---------------
Seen once

System Configuration
--------------------
IPv6 Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Feb 22 master code

Last Pass
---------
N/A

Timestamp/Logs
--------------
As logs were not flushed to disk, there are
See list of deleted files as a result of running the command "sudo lsof|grep deleted" attached

Test Activity
-------------
Evaluation

Workaround
----------
Kill logmgmt process and delete filebeat pods.

See original description

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-03:

deleted_files.txt Edit (9.4 MiB, text/plain)

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-03-04:

stx.4.0 / high priority - stx-monitor resulting in running out of log space on distributed cloud

tags:	added: stx.4.0 stx.distcloud stx.monitor
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Kevin Smith (kevin.smith.wrs)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-19: Fix merged to config (master)

Reviewed: https://review.opendev.org/713957
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=241ea2871b15965bd694895f796660f7f1fddbf3
Submitter: Zuul
Branch: master

commit 241ea2871b15965bd694895f796660f7f1fddbf3
Author: Tee Ngo <email address hidden>
Date: Thu Mar 19 13:54:15 2020 -0400

Set time limit for filebeat open filehandlers

    In a large system, filebeat can harvest a large number of files
    and with the default file closing policies, many deleted files are
    not freed. Over time, this leads to /var/log partition running out
    of space, services not being able to flush their logs to disk and
    logmgmt process continously rotating logs.

This commit sets a default time limit for each open file harvester.
This value can be adjusted as needed via user overrides.

    Closes-Bug: 1865924
    Change-Id: I9dbf9cb2128157834b937357dcc6c4945dc5d2f3
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-31: Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716137

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-31: Fix merged to config (f/centos8)

Download full text (32.3 KiB)

Reviewed: https://review.opendev.org/716137
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cb4cf4299c2ec10fb2eb03cdee3f6d78a6413089
Submitter: Zuul
Branch: f/centos8

commit 16477935845e1c27b4c9d31743e359b0aa94a948
Author: Steven Webster <email address hidden>
Date: Sat Mar 28 17:19:30 2020 -0400

Fix SR-IOV runtime manifest apply

    When an SR-IOV interface is configured, the platform's
    network runtime manifest is applied in order to apply the virtual
    function (VF) config and restart the interface. This results in
    sysinv being able to determine and populate the puppet hieradata
    with the virtual function PCI addresses.

    A side effect of the network manifest apply is that potentially
    all platform interfaces may be brought down/up if it is determined
    that their configuration has changed. This will likely be the case
    for a system which configures SR-IOV interfaces before initial
    unlock.

    A few issues have been encountered because of this, with some
    services not behaving well when the interface they are communicating
    over suddenly goes down.

    This commit makes the SR-IOV VF configuration much more targeted
    so that only the operation of setting the desired number of VFs
    is performed.

    Closes-Bug: #1868584
    Depends-On: https://review.opendev.org/715669
    Change-Id: Ie162380d3732eb1b6e9c553362fe68cbc313ae2b
    Signed-off-by: Steven Webster <email address hidden>

commit 45c9fe2d3571574b9e0503af108fe7c1567007db
Author: Zhipeng Liu <email address hidden>
Date: Thu Mar 26 01:58:34 2020 +0800

Add ipv6 support for novncproxy_base_url.

For ipv6 address, we need url with below format
[ip]:port

Partial-Bug: 1859641

Change-Id: I01a5cd92deb9e88c2d31bd1e16e5bce1e849fcc7
Signed-off-by: Zhipeng Liu <email address hidden>

commit d119336b3a3b24d924e000277a37ab0b5f93aae1
Author: Andy Ning <email address hidden>
Date: Mon Mar 23 16:26:21 2020 -0400

Fix timeout waiting for CA cert install during ansible replay

    During ansible bootstrap replay, the ssl_ca_complete_flag file is
    removed. It expects puppet platform::config::runtime manifest apply
    during system CA certificate install to re-generate it. So this commit
    updated conductor manager to run that puppet manifest even if the CA cert
    has already installed so that the ssl_ca_complete_flag file is created
    and makes ansible replay to continue.

    Change-Id: Ic9051fba9afe5d5a189e2be8c8c2960bdb0d20a4
    Closes-Bug: 1868585
    Signed-off-by: Andy Ning <email address hidden>

commit 24a533d800b2c57b84f1086593fe5f04f95fe906
Author: Zhipeng Liu <email address hidden>
Date: Fri Mar 20 23:10:31 2020 +0800

Fix rabbitmq could not bind port to ipv6 address issue

    When we use Armada to deploy openstack service for ipv6, rabbitmq
    pod could not start listen on [::]:5672 and [::]:15672.
    For ipv6, we need an override for configuration file.

Upstream patch link is:
https://review.opendev.org/#/c/714027/

Test pass for deploying rabbitmq service on both ipv...

Reviewed:  https://review.opendev.org/716137
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cb4cf4299c2ec10fb2eb03cdee3f6d78a6413089
Submitter: Zuul
Branch:    f/centos8

commit 16477935845e1c27b4c9d31743e359b0aa94a948
Author: Steven Webster <steven.webster@windriver.com>
Date:   Sat Mar 28 17:19:30 2020 -0400

Fix SR-IOV runtime manifest apply
    
    When an SR-IOV interface is configured, the platform's
    network runtime manifest is applied in order to apply the virtual
    function (VF) config and restart the interface.  This results in
    sysinv being able to determine and populate the puppet hieradata
    with the virtual function PCI addresses.
    
    A side effect of the network manifest apply is that potentially
    all platform interfaces may be brought down/up if it is determined
    that their configuration has changed.  This will likely be the case
    for a system which configures SR-IOV interfaces before initial
    unlock.
    
    A few issues have been encountered because of this, with some
    services not behaving well when the interface they are communicating
    over suddenly goes down.
    
    This commit makes the SR-IOV VF configuration much more targeted
    so that only the operation of setting the desired number of VFs
    is performed.
    
    Closes-Bug: #1868584
    Depends-On: https://review.opendev.org/715669
    Change-Id: Ie162380d3732eb1b6e9c553362fe68cbc313ae2b
    Signed-off-by: Steven Webster <steven.webster@windriver.com>

commit 45c9fe2d3571574b9e0503af108fe7c1567007db
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Thu Mar 26 01:58:34 2020 +0800

Add ipv6 support for novncproxy_base_url.
    
    For ipv6 address, we need url with below format
    [ip]:port
    
    Partial-Bug: 1859641
    
    Change-Id: I01a5cd92deb9e88c2d31bd1e16e5bce1e849fcc7
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit d119336b3a3b24d924e000277a37ab0b5f93aae1
Author: Andy Ning <andy.ning@windriver.com>
Date:   Mon Mar 23 16:26:21 2020 -0400

Fix timeout waiting for CA cert install during ansible replay
    
    During ansible bootstrap replay, the ssl_ca_complete_flag file is
    removed. It expects puppet platform::config::runtime manifest apply
    during system CA certificate install to re-generate it. So this commit
    updated conductor manager to run that puppet manifest even if the CA cert
    has already installed so that the ssl_ca_complete_flag file is created
    and makes ansible replay to continue.
    
    Change-Id: Ic9051fba9afe5d5a189e2be8c8c2960bdb0d20a4
    Closes-Bug: 1868585
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 24a533d800b2c57b84f1086593fe5f04f95fe906
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Fri Mar 20 23:10:31 2020 +0800

Fix rabbitmq could not bind port to ipv6 address issue
    
    When we use Armada to deploy openstack service for ipv6, rabbitmq
    pod could not start listen on [::]:5672 and [::]:15672.
    For ipv6, we need an override for configuration file.
    
    Upstream patch link is:
    https://review.opendev.org/#/c/714027/
    
    Test pass for deploying rabbitmq service on both ipv4 and ipv6 setup
    
    Partial-Bug: 1859641
    
    Change-Id: I6495c45fbd8cc1de3c9f5d9ef5003447079d91b8
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 08aa950393a7e3c5fd5299b88e134307800584aa
Author: Kevin Smith <kevin.smith@windriver.com>
Date:   Sun Mar 22 14:29:15 2020 -0400

application-apply error string too long
    
    During application-apply exception handling, str(e) is
    used as the input to the progress column of the kube_app
    table in the database, which may be longer than the 255
    character limit.  The result is an application stuck
    in 'applying' status.  This update adds a more readable
    error message to just check logs.
    
    There are other instances where str(e) is used as input to
    the database and could cause a similar problem which should
    also be looked at.
    
    Change-Id: I01a5e8f56a628726163e2cfffc58143ae8d5f845
    Closes-Bug: 1867019
    Signed-off-by: Kevin Smith <kevin.smith@windriver.com>

commit c1c18871d72cdcd877b95f593bd119b47b3ddbb6
Author: Andy Ning <andy.ning@windriver.com>
Date:   Tue Feb 18 14:52:06 2020 -0500

Support multiple CA certificates installation
    
    This update enhanced sysinv certificate install API to be able to
    install multiple CA certs from a file. The returns from the API call
    indicates the certs actually installed in the call (ie, excluding these
    that are already in the system). This is neccessary especially for DC to
    support multiple CA certs synchronization.
    
    This update also added sysinv certficate uninstall API. The API is to
    be used to remove a particular CA certficate from the system, identified
    by its uuid. The API returns a json body with information about the
    certificate that has been removed. This is required by DC sysinv api
    proxy for certificate deletion synchronization, since DC tracks subcloud
    certificates resource by signature while the uninstall API request
    contains only uuid.
    
    The uninstall API only supports ssl_ca certificate.
    
    cgtsclient and system CLI are also updated to align with the updated
    and new APIs. User can use "system certificate-install ..." to install
    one or multiple CA certificates, and "system certificate-uninstall ..."
    to remove a particular CA certificate from the system.
    
    When multiple CA certificates are installed in the system,
    "system certificate-list" will display each of the individual
    certificates.
    
    THe sysinv certificate configuration API reference is updated with the
    new uninstall API. Unit tests are added for CA certificate install and
    delete APIs.
    
    Change-Id: I7dba11e56792b7d198403c436c37f71d7b7193c9
    Depends-On: https://review.opendev.org/#/c/711633/
    Closes-Bug: 1861438
    Closes-Bug: 1860995
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 241ea2871b15965bd694895f796660f7f1fddbf3
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Thu Mar 19 13:54:15 2020 -0400

Set time limit for filebeat open filehandlers
    
    In a large system, filebeat can harvest a large number of files
    and with the default file closing policies, many deleted files are
    not freed. Over time, this leads to /var/log partition running out
    of space, services not being able to flush their logs to disk and
    logmgmt process continously rotating logs.
    
    This commit sets a default time limit for each open file harvester.
    This value can be adjusted as needed via user overrides.
    
    Closes-Bug: 1865924
    Change-Id: I9dbf9cb2128157834b937357dcc6c4945dc5d2f3
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit d7c3822a52ecc3b4288106c4e544e67add80fbf5
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Mar 13 12:37:39 2020 -0400

Remove usage of /etc/kubernetes/kubeadm.yaml
    
    /etc/kubernetes/kubeadm.yaml could contain stale data, for example, from
    changing kube-apiserver parameters. There are currently no system impacts
    from using the stale file, but as we change more parameters, there could
    be system impact. This commit makes the existing usage of kubeadm.yaml
    generate a temp copy of the file with current data first.
    
    Change-Id: I62391d184e3e5d6397a9af4f43c7c7ec19314afc
    Partial-bug: 1866695
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 8ecdcbbbcdc2807113c7b7004f92653acffa0b41
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Tue Mar 10 16:46:04 2020 -0400

Add platform network type for storage
    
    Added a new platform network type for optional backend storage.
    
    Story: 2007391
    Task: 39018
    
    Change-Id: I1a389b8aede49095e4f7f7d24ed8224504575d45
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 2528dce84b5891038ca56c6959304ac4c1fc934a
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Thu Feb 13 18:52:15 2020 -0500

Allow VF type interface to detect underlying port
    
    Do `host-if-show` on VF interface whose underlying port supports
    dpdk will now display accelerated [True]. Before this fix, only
    ethernet, vlan, and ae type interfaces supports detecting
    underlying ports that support dpdk.
    
    Closes-Bug: 1846260
    
    Change-Id: Ifdee31811824a38ebc7d3a8febde2341d39ba986
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit 95d8bb436b625c82e78ebb2a2134e0e861bd5574
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Wed Mar 4 16:07:22 2020 -0500

Support post-bootstrap config of kube-apiserver parameters
    
    Add system service parameters for each of the kube-apiserver parameters
    for openid connect.
    
    Story: 2006711
    Task: 38944
    
    Depends-On: https://review.opendev.org/711336
    
    Change-Id: Ib4b9aee036447087f88f803548e3f982446ccda4
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 6f162c3422df6c11b0d9f548487bfb3b9e401ca5
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Fri Feb 7 15:28:42 2020 -0500

Fixed address interface foreign key inconsistency
    
    Foreign key in sysinv.object.address.Address is `interface_uuid`,
    which is inconsistent with the foreign key `interface_id` defined
    in the database schema. This fix corrected that.
    
    Added a unit test to verify that addresses associated with an interface
    could be deleted.
    
    Additionally wrote a set of TODO unit tests blocked by
    the bug: tested delete address for orphaned-routes case, unlocked
    host state, and the case where address is allocated from pool.
    
    Modified interface querying mechanism to look up all interfaces.
    This modification is necessary because the current implementation of
    add_interface_filter only looks up those of type ethernet, ae and
    vlan. Attempting to get an virtual-type interface will raise an
    exception, causing Jenkins installation to fail.
    
    After a visual inspection of interface_uuid occurrences, fixed a few
    other occurrences of bad address.interface_uuid that are not caught
    by the unit test. Added new unit test suites in place to cover the
    code paths.
    
    Closes-Bug: 1861131
    
    Change-Id: I6f2449bbbb69d6f2353e521bfcd138d880ce878f
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit 964a2b7c6238ce91d4ace34dcac790fa5a37d55c
Author: Kevin Smith <kevin.smith@windriver.com>
Date:   Tue Mar 3 14:17:42 2020 -0500

stx-monitor: only delete pvcs on app delete.
    
    It may be desired to keep the persistent volumes after removing the
    stx-monitor application.  This update will not remove the pvcs on
    application-remove, but remove them on application-delete
    
    Closes-Bug: 1865568
    
    Change-Id: I9b06008fe6b6033e5a1ce6808cc5d4fa6aabcd05
    Signed-off-by: Kevin Smith <kevin.smith@windriver.com>

commit c5d43da89e7fd2a12407bc4bebd14ab87d16c638
Author: Angie Wang <angie.wang@windriver.com>
Date:   Tue Feb 25 17:00:53 2020 -0500

Allow users to override a single image with a custom registry
    
    In the case that the user overrides a single image with a
    custom registry that is not from any known registries
    in Sysinv. This image downloading will fail as it
    prepends the docker.io registry to the image reference
    , then generates an invalid image tag.
    
    The original purpose of adding that logic is to handle
    the image that comes from docker.io but do not have
    docker.io explicitly specified in its image name. This
    case has already been updated to handle in the class
    "AppImageParser".
    
    This commit removes the related logic that causing the
    issue.
    
    Tested:
     - system helm-override-update stx-openstack nova openstack \
         --set images.tags.nova_api=mycustomregistry.com/stx-nova:latest
     - system application-apply stx-openstack
    
    Change-Id: I07d1a658c3cf56a3e09e81e1f947f93de50b513d
    Closes-Bug: 1859881
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 347af170f9cf1fd49be2a52107f0594d9d4b8ba8
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Tue Feb 25 21:13:59 2020 -0500

Update PTP API ref and unit tests
    
    Add the PTP apply function to the API ref and the unit tests.
    
    Story: 2006759
    Task: 38848
    Change-Id: Iae3cc9e90b653fd92a83a0d9a216d87016cf4c6c
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>

commit 8e2e5f7e82efde39407d34c1a26daffb97dbe26d
Author: Kevin Smith <kevin.smith@windriver.com>
Date:   Fri Feb 21 07:56:04 2020 -0500

Set elasticsearch pod java options according to ip config
    
    The "-Djava.net.preferIPv6Addresses=true" java option was set
    for both ipv4 and ipv6 configurations which worked fine in both
    configs.  At some point recently in ipv4 configurations, the
    stx-monitor application stopped applying successfully due to
    elasticsearch cluster discovery failure.  Why the ipv4 failures
    are only recently occurring is unknown, but removal of this
    unnecessary java option for ipv4 eliminates the failures.
    
    This update will set the above java option for elasticsearch
    pods only if the cluster service network is ipv6.
    
    Closes-Bug: 1864193
    
    Change-Id: I2952f1c799b121d0812314156162af7696ebd6b0
    Signed-off-by: Kevin Smith <kevin.smith@windriver.com>

commit 6065f1318af289001d2017111cc8633c3320efda
Author: Matt Peters <matt.peters@windriver.com>
Date:   Thu Feb 20 16:22:02 2020 -0500

Remove system name from default index naming
    
    Remove the system name from the default index naming
    since it causes a large number of small independent
    indexes to be created that does not scale well against
    the current daily index rotation.
    
    Change-Id: Ia880a1d8c48703a0741a72e999c0cdb93c229423
    Story: 2006990
    Task: 38834
    Signed-off-by: Matt Peters <matt.peters@windriver.com>

commit 73d407bdf44933673e8e975e2523828b9c43e25d
Author: Matt Peters <matt.peters@windriver.com>
Date:   Thu Feb 20 16:21:40 2020 -0500

Add normalized percentages to cpu metric collection
    
    CPU metric collection which has been normalized against the
    number of cores is not enabled.  This update adds the
    appropriate configuration option to enable these metrics.
    
    Change-Id: I1e2dcd0fac144236dab3718a917344c339444003
    Closes-Bug: 1864128
    Signed-off-by: Matt Peters <matt.peters@windriver.com>

commit bbb9a477c1cb33ca51a134d742073cc200f89fb0
Author: Angie Wang <angie.wang@windriver.com>
Date:   Thu Feb 20 11:56:01 2020 -0500

Reject the k8s first control plane upgrade after networking is upgraded
    
    The first upgraded control plane shouldn't be allowed to re-upgrade
    after the k8s networking upgrade is done. This commit adds a check
    to prevent this action.
    
    Change-Id: I01c6539fe89749663dff6159e56d14f9a510ebe0
    Story: 2006781
    Task: 38761
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit cb2b83365e823cd69a0e8e2a3c54b3e679f48776
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 20 11:37:03 2020 -0500

Support for https in OIDC client
    
    Changed OIDC client to use HTTPS by default.
    
    Story: 2006711
    Task: 38481
    
    Depends-On: https://review.opendev.org/#/c/708911
    Change-Id: I567b224030cfe2278cdca57f2d40ad36c98d7ff6
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 4687ea36b5fadb7dad0cfe0a1ede4b488a0b5aeb
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Fri Feb 14 15:30:41 2020 -0500

Apply PTP configuration at runtime
    
    Allow PTP configuration to be applied at runtime. Previously this would
    have required a lock/unlock of the host. A new command 'system
    ptp-apply' has been added to apply the ptp configuration.
    
    Note we will not apply ptp configurations to hosts that have switched
    from ntp to ptp. That change will require a lock/unlock as before.
    
    Depends-On: https://review.opendev.org/707904
    Change-Id: I098bd12336f34324a77615a20a4e36b7620ab79b
    Story: 2006759
    Task: 38770
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>

commit d93d5804c626955fb711897745dce4a61136183b
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Feb 14 16:47:03 2020 -0500

Fixed error responses in controller-fs
    
    Error response given by controller-fs-modify erroneously mentions
    filesystem names which are not controller filesystems. To fix this,
    hard-coded filesystem names have been completely removed.
    
    Change-Id: Ic6f563dd0b347ac7ece628f6e716c952205c1687
    Closes-Bug: 1862416
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit f6eebbd318f3c596c7d408696ce1558fd03a5497
Author: Bart Wensley <barton.wensley@windriver.com>
Date:   Wed Feb 19 12:56:21 2020 -0600

Disable keystone caching on subclouds
    
    The use of keystone caching on subclouds causes problems because
    the syncing of fernet keys to the subcloud results in stale
    cache entries. This causes authentication failures until the
    cache entries age out or new tokens are created.
    
    Since the keystone load in a subcloud is light, there is really
    no need for caching at this time - it is being disabled in
    subclouds.
    
    Change-Id: I777c57c46cf1bcd701fbbac73228a2cb81d8424b
    Closes-Bug: 1860372
    Signed-off-by: Bart Wensley <barton.wensley@windriver.com>

commit b330498aecb7068e8bfa65c41c71e974b2d674aa
Author: Mingyuan Qi <mingyuan.qi@intel.com>
Date:   Tue Feb 18 03:48:44 2020 +0000

Change docker client to crictl in cert rotation
    
    When container runtime moving to containerd, the containers are
    created by containerd. Accordingly, the client tool is changed
    to crictl. In the kube cert rotation script, the containers will
    be stopped by crictl and automatically started by kubelet to
    update the renewed certificates within the container.
    
    Story: 2006145
    Task: 37619
    
    Change-Id: Ia8cf76c15811f8f9d88199158e83ccba31534e4e
    Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com>

commit 7afe5de64d0d23ec951620e0380fb65e2f49f4c3
Author: Angie Wang <angie.wang@windriver.com>
Date:   Tue Feb 11 17:25:11 2020 -0500

Add semantic checks for k8s upgrade
    
    Semantic checks added:
      - verify whether all installed applications are compatible with
        the new k8s version before starting k8s upgrade
      - prevent host-unlock if the host kubelet upgrade is in progress
        (allow --force to do force unlock).
      - prevent application-apply/update if the app is incompatible with
        the current k8s version.
    
    For the application that has k8s version restriction, the following
    keys need to be optionally specified in its metadata file:
    ie...
    supported_k8s_version:
      minimum: v1.16.1
      maximum: v1.16.3
    
    The k8s version related information in metadata file will be used for
    compatibility check. The metadata file is updated to copy over to the
    drbd fs during application-upload.
    
    Tests conducted:
      - "system kube-upgrade-start" rejected if any installed app's k8s
        version check failed
      - host-unlock rejected if the host is in upgrading-kubelet status
      - was able to forcibly unlock host even if it's upgrading kubelet
      - application-apply/update testing
    
    Change-Id: I1ef852cccddf7ae39eca4b4e25b80a7f4347d8a4
    Story: 2006781
    Task: 38761
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 2b49e9f3f93c9913961b437d4e51d1e7d46f1222
Author: Robert Church <robert.church@windriver.com>
Date:   Thu Feb 13 10:00:56 2020 -0600

Workaround for cleaning up MatchNodeSelector pods after host reboot
    
    Added a K8sPodOperator class to look for and remove Failed pods with a
    MatchNodeSelector reason.
    
    MatchNodeSelector pods related to applications will not be removed by
    K8S automatically. These pods may block subsequent application applies
    as tiller expects these pods to be in a non failed state.
    
    A check for this condition is added in two locations:
    - to the _k8s_application_audit() which is run immediately on
      sysinv-conductor startup and runs every minute. This runs 4 times in a
      5 minute window at startup on a simplex install. This should catch all
      cases unless there is a delay accessing the k8s API that lasts longer
      than 5 minutes at startup.
    - to the application-apply path. This would cover any case that occurs
      after the initial 5 minute conductor startup OR any occurance on a
      non-simplex installation (so far only observed on AIO-SX)
    
    NOTE: This commit will be reverted once a proper upstream k8S fix is
    provided.
    
    Related upstream bugs:
    - https://github.com/kubernetes/kubernetes/issues/80745
    - https://github.com/kubernetes/kubernetes/issues/85334
    
    The following PR was tested and fixed this issue but has not landed
    upstream in a new k8s release:
    - https://github.com/kubernetes/kubernetes/pull/80976
    
    Change-Id: Ia5418794a44e7821933e8335d5c5db25b58a739f
    Closes-Bug: #1849688
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 34e410821b7b0699444b303fcdec1ab89d860cc6
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Thu Feb 13 15:38:51 2020 -0500

Fix inconsistent disk space calculation
    
    Integer division in Python 2 behaves like floating-point
    division in Python 3. Thus, changes are made to rectify this
    behavior.
    
    Change-Id: I6a5905a4d97df5b9e73e165580801c865006f316
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>
    Closes-Bug: 1862668

commit e6e37c949a39e4ee3d4f4c9407a85089e7514345
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Mon Feb 10 16:26:13 2020 -0500

Added unit test cases for host file system.
    
    Test cases added for API endpoints used by:
     1. host-fs-list
     2. host-fs-modify
     3. host-fs-show
    
    This commit also fixes the issue of Host FS disk space calculations
    yielding different values in Python 2 and Python 3.
    
    Change-Id: I50a1ca43c43c3bba30730c616b3788664920d0c9
    Story: 2007082
    Task: 38013
    Partial-Bug: 1862668
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 227ddec6189fdabdc75d45162fc22b9af7118982
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Thu Feb 13 10:47:55 2020 -0500

Fix device plugin port handling for pci-passthrough
    
    While generating the SR-IOV device plugin configuration data,
    it is necessary to get the underlying port information.
    For SR-IOV ports there is special handling required to deal
    with the case of a 'VF' subinterface.  For PCI-Passthrough,
    the port can and should be accessed directly.
    
    Closes-Bug: 1856587
    
    Co-Authored-By: Steven Webster <steven.webster@windriver.com>
    
    Change-Id: I70f315669776a591e23e69c6653098e720815b99
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit cab522030f79c0060b80050c6a560696d7db80d9
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Fri Jan 31 17:31:17 2020 +0200

Make Ceph storage backend optional
    
    Changes included in this commit:
    - change consistency checks to allow a system to
      be deployed without ceph configured
    - allow ceph to be provisioned before unlocking
      controller-0
    - add support for runtime provisioning of ceph
      on an already fully deployed system
    - move default cluster and storage tier config
      from conductor initialization to storage-backend
      creation
    - move CephOperator initialization from conductor
      initialization to a greenthread that waits for
      the ceph cluster to become responsive
    - make adding ceph storage-backend timing consistent
      across all setups: you can add it before unlocking
      controller-0 or only after all controller nodes
      have been unlocked.
    
    Tests run:
    - all tests were run on AIO-SX, AIO-DX, Standard
      and Storage configs
    - deploy system without ceph
    - configure ceph after running ansible bootstrap,
      but before unlocking controller-0
    - configure ceph at runtime on an already deployed
      system
    - swacting
    
    Change-Id: I05fbd494d9a22a535eae200a26c21b1702500194
    Depends-On: https://review.opendev.org/705234
    Story: 2007064
    Task: 37931
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit f1605d465b5cb10a9d46803e88096951cdacc3a5
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Mon Feb 3 14:35:45 2020 -0500

PTP Configuration Enhancements
    
    Add PTP service parameters. Any service parameters in the global ptp
    section will be written to the ptp4l conf. phc2sys service parameters
    will be used to specify the command line options used with the phc2sys
    service.
    
    Values specified in the service parameters will take precedence over
    values specified by the PTP table.
    
    Story: 2006759
    Task: 38669
    Depends-On: https://review.opendev.org/#/c/706364
    Change-Id: I791ec251be44d963bfb5eb69268fbc7a8a75391a
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>

commit 173eb3bea75e2a774976461a5caef482c20a814a
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Mon Feb 3 16:21:42 2020 -0500

Added unit test cases for controller file system
    
    Test cases added for API endpoints used by:
     1. controllerfs-list
     2. controllerfs-modify
     3. controllerfs-show
    
    Change-Id: Ifd525d2218a099b15139f17d6b4ae1b7279e8810
    Story: 2007082
    Task: 38003
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit aead92341082065798ee4450d804f64d63ba35f1
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Tue Jan 21 18:12:46 2020 -0500

Enabled platform interfaces to add ip address(es)
    
    Removed network type check in api controller interface to allow platform
    interfaces to have static address mode in the database.
    
    Removed broken network type check in api controller address.
    
    Loosened interface-class and network-type restrictions in puppet
    controller to allow platform interfaces to have static ip address
    during system unlock.
    
    Added unit tests to test puppet interface's new restriction logic of
    get_interface_address_method for ipv4 static mode (valid), ipv6 static
    mode (valid), and ipv4 static mode with network type (invalid).
    
    Added unit test to ensure one can add an ip address to the static
    platform interface. Enabled DAD for ipv6 tests. Renamed get_post_object
    parameter interface_id to interface_uuid to eliminate usage
    inconsistency because the former is rejected in the POST request.
    
    Closes-Bug: 1855191
    
    Change-Id: I1f2bc92bb1a97dc4afb21966de4055b12855510a
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit b27ae6b348fdd03d83859e7c1a21baf828859328
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Thu Jan 16 11:21:30 2020 -0500

Fixed semantic checks for SR-IOV VF parameters.
    
    Only interfaces of class pci-sriov may have numvfs and vf_driver.
    However, interfaces of class data attempting to add numvfs and
    vf_driver via the cli was able to pass the semantic check.
    Moreover, when an interface class changes from pci-sriov to data,
    the numvfs and vf_driver fields are not cleared.
    
    This fix tackles the above issues by altering the condition-
    check that resets the 2 fields before the semantic check such
    that faulty semantic will not pass the semantic check.
    This fix also ensures the 2 fields are permanently reset
    once interface class is changed from pci-sriov to data.
    
    Added several unit tests to verify all situations described
    above.
    
    Depends-On: https://review.opendev.org/#/c/705293
    
    Closes-Bug: 1855933
    
    Change-Id: I3c25c57edcdd50c5e76e17da658c7985821a3436
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit 4598ca8d65417b7ac9f19f6fd3954639d230b46b
Author: Al Bailey <Al.Bailey@windriver.com>
Date:   Wed Feb 5 09:38:42 2020 -0600

Deprecate sysinv.openstack.common.db in favor of oslo_db
    
    openstack.common.db was not being used except by unit tests.
    The sysinv engine had previously been converted, so the
    changes are primarily in the unit test environment.
    
    Story: 2006796
    Task: 37426
    Change-Id: Ie638ee7e347fef0ada061ed4047decd0cbb919ef
    Signed-off-by: Al Bailey <Al.Bailey@windriver.com>

commit fb84bf9bdcb7844e6ac0ea192480a43ae4ac7480
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Fri Jan 31 10:06:25 2020 -0500

Forbid unlocked hosts to modify interfaces
    
    Simplified the convoluted logic that allows certain unlocked hosts to
    modify interfaces. Now the logic simply rejects unlocked hosts.
    
    Fixed a series of unit tests that modifies unlocked test controller by
    transfer the modification operations to locked test workers. Moreover,
    hardcoded test controller id is replaced with worker id attribute.
    
    Fixed another set of tests that attempts to create ethernet, vlan, or
    bond on a unlocked test controller, even though those tests are intended
    for locked test workers. These redundant network configuration are
    promptly removed, because to keep them will force the only active
    controller node to be locked.
    
    Closes-Bug: 1855187
    
    Change-Id: I7eacba9d064a4efb2c2032c3879d11460401ca08
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit 29f38ce63725a829a165989bb134fd98ac8bea78
Author: Andy Ning <andy.ning@windriver.com>
Date:   Tue Feb 4 15:42:20 2020 -0500

Copy encryption provider config file to second controller
    
    kube-apiserver encryption provider config file is generated by ansible
    bootstrap on the first controller and stored in the shared fs. It is
    then copied over to the second controller. When kube-apiserver pod
    starts it will take this configuration file as its encryption provider
    configuration.
    
    Change-Id: Ibfcfb13c8a6685e38a1043acd7ec752ea116911c
    Story: 2007243
    Task: 38627
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit c4fa36214c444b34ae9c2b06f35758eb1ba8c987
Author: Thomas Gao <Thomas.Gao@windriver.com>
Date:   Mon Feb 3 15:41:28 2020 -0500

Forbid IPv4 DNS in an IPv6 OAM config
    
    Implemented IP version check in DNS controller api to reject patch
    operations with mismatched DNS server IP version.
    
    Enabled and fixed relevant unit tests.
    
    Rearranged unit test inheritance hierachy to eliminate undesired test
    repetitions.
    
    Closes-Bug: 1860489
    
    Change-Id: Ief4a19eeea03086bb5816a13cb3a706a48bab51a
    Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>

commit 5df1f3a89a6e1ef699fc6030a18902faf45daf88
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Feb 5 13:26:43 2020 -0500

Adding job to upload commits to GitHub
    
    Add job to publish config repo to GitHub
    Fix host_key
    
    Story: 2007252
    Task: 38657
    
    Change-Id: Id0c1fe7278cbddbf6082f452323537427fefe95f
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 8ab1e2d7c624f83d72efcbfcddcdffa567a26bad
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Wed Dec 11 16:37:03 2019 +0800

Audit local registry secret info when there is user update in keystone
    
    local registry uses admin's username&password for authentication.
    And admin's password could be changed by openstack client cmd. It will
    cause auth info in secrets obsolete, and lead to invalid authentication
    in keystone.
    To keep secrets info updated, keystone event notification is enabled.
    And event notification listener is added in sysinv. So when there is
    user password change, a user update event will be sent out by keystone.
    And sysinv will call function audit_local_registry_secrets to check
    whether kubernetes secret info need be updated or not.
    
    A periodic task is added also to ensure secrets are always synced, in
    case notification is missed or there is failure in handle notification.
    
    oslo_messaging is added to tox's requirements.txt to avoid tox failure.
    The version is based on global-requirements.txt from Openstack Train.
    
    Test:
    Pass deployment and secrets could be updated automatically with new auth
    info.
    Pass host-swact in duplex mode.
    
    Closes-Bug: 1853017
    Depends-On: https://review.opendev.org/700677
    Depends-On: https://review.opendev.org/699547
    Change-Id: I959b65288e0834b989aa87e40506e41d0bba0d59
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

tags:

added: in-f-centos8