k8s tmp files are cleared every 10 days causing config failures

Bug #1883599 reported by Ghada Khalil
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Andy

Bug Description

Brief Description
-----------------
The starlingx config code calls the k8s python client to perform a number of operations. The k8s python client creates a file under /tmp and continues to use this tmp file for the life-cycle of the sysinv-conductor process. After 10 days, sysinv starts to fail with an error that the tmp file is no longer there. There is a cleanup service in starlingx/centos that which runs daily and removes /tmp files which are not in use for 10 days.

This is a known issue with k8s:
https://github.com/kubernetes-client/python/issues/765

The best option is to use a different location other than /tmp to keep these files. This is required for any starlingx process that calls the k8s python client. Keeping the files in /var/run is a good option.

Severity
--------
Major - sysinv/config cmds will start failing after the system is up for 10 days w/o any controller swact

Steps to Reproduce
------------------
- Leave a system up for more than 10 days
- Attempt to make a config change -- For example: updating from http to https

Expected Behavior
------------------
config cmds remain functional regardless of how long the system has been up

Actual Behavior
----------------
config cmds start failing after the system is up for 10 days

Reproducibility
---------------
Was seen on one system which was up for more than 10 days, but expected to be reproducible

System Configuration
--------------------
any

Branch/Pull Time/Commit
-----------------------
Seen with a recent stx master load, but is a day 1 issue

Last Pass
---------
Never

Timestamp/Logs
--------------
sysinv 2020-06-11 20:51:51.446 106052 ERROR sysinv.puppet.puppet [-] failed to create secure_system config: ConfigException: File does not exists: /tmp/tmpFQ1byr
sysinv 2020-06-11 22:27:03.641 106052 ERROR sysinv.puppet.puppet [-] failed to create secure_system config: ConfigException: File does not exists: /tmp/tmpFQ1byr
sysinv 2020-06-11 22:29:09.146 106052 ERROR sysinv.puppet.puppet [-] failed to create secure_system config: ConfigException: File does not exists: /tmp/tmpFQ1byr
sysinv 2020-06-11 22:40:19.170 106052 ERROR sysinv.puppet.puppet [-] failed to create secure_system config: ConfigException: File does not exists: /tmp/tmpFQ1byr

Test Activity
-------------
System soak

Workaround
----------
Restart the sysinv-conductor to recover the system:
sudo sm-restart service sysinv-conductor

Ghada Khalil (gkhalil)
description: updated
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.config stx.containers
Changed in starlingx:
assignee: nobody → Andy (andy.wrs)
importance: Undecided → High
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736246

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736247

Ghada Khalil (gkhalil)
tags: added: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/736246
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cbf5837fb0238612fc7837980892bda3e1aa4f7b
Submitter: Zuul
Branch: master

commit cbf5837fb0238612fc7837980892bda3e1aa4f7b
Author: Andy Ning <email address hidden>
Date: Wed Jun 17 10:21:32 2020 -0400

    Set up /var/run/sysinv as sysinv's default temp files location

    sysinv call the k8s python client to perform a number of operations.
    The k8s python client creates temp files under /tmp and continues to
    use these tmp files for the life-cycle of the processes.

    However systemd-tmpfiles-clean.service will run every day to clean up
    files in /tmp dir that are older than 10 days. If the k8s client code
    is not triggered for more than 10 days (thus its temp files are not
    accessed for more than 10 days), these temp files will be removed as
    part of the cleanup. Certain sysinv operations then starts to fail with
    an error that the tmp file is no longer there.

    This is a known issue of kubernetes python client:
    https://github.com/kubernetes-client/python/issues/765

    The commit fixes this issue by setting TMPDIR to /var/run/sysinv when
    sm starts sysinv-conductor and sysinv-inv.

    Change-Id: I365d637abd080bd03b65758e4e8db9203d6bfa4d
    Closes-Bug: 1883599
    Signed-off-by: Andy Ning <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (master)

Change abandoned by Andy Ning (<email address hidden>) on branch: master
Review: https://review.opendev.org/736247
Reason: Abandoned as this change is not needed.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Need to re-open as there is an issue with the commit already merged. Need to revert and re-work the fix.

Changed in starlingx:
status: Fix Released → Confirmed
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736761

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/736761
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=31533adf4252e76e25a3678d78559672df6d3729
Submitter: Zuul
Branch: master

commit 31533adf4252e76e25a3678d78559672df6d3729
Author: Andy Ning <email address hidden>
Date: Wed Jun 17 10:21:32 2020 -0400

    Set up /var/run/sysinv_tmp as sysinv's default temp files location

    sysinv call the k8s python client to perform a number of operations.
    The k8s python client creates temp files under /tmp and continues to
    use these tmp files for the life-cycle of the processes.

    However systemd-tmpfiles-clean.service will run every day to clean up
    files in /tmp dir that are older than 10 days. If the k8s client code
    is not triggered for more than 10 days (thus its temp files are not
    accessed for more than 10 days), these temp files will be removed as
    part of the cleanup. Certain sysinv operations then starts to fail with
    an error that the tmp file is no longer there.

    This is a known issue of kubernetes python client:
    https://github.com/kubernetes-client/python/issues/765

    The commit fixes this issue by setting TMPDIR to /var/run/sysinv_tmp
    when sm starts sysinv-conductor and sysinv-inv.

    Change-Id: I8544272b2431607ed1041473c5da2eecb64635af
    Closes-Bug: 1883599
    Signed-off-by: Andy Ning <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This appears to be an issue with dcmanager and the vim code as well. Re-opening to source additional fixes for these components.

Initially, it was mistakenly determined that the bug would not affect dcmanager or the VIM because they did not cache the kubernetes client (sysinv does cache the client). This is incorrect. Even though dcmanager and the VIM don’t cache the client, they call a function in the client module (kubernetes.config.load_kube_config) that creates a file under /tmp that then gets deleted after 10 days (see the upstream bug for details).

Changed in starlingx:
status: Fix Released → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754417

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/736247
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=e8d916756d439cf63eed52f347a2a547fc5124cb
Submitter: Zuul
Branch: master

commit e8d916756d439cf63eed52f347a2a547fc5124cb
Author: Andy Ning <email address hidden>
Date: Wed Jun 17 10:40:23 2020 -0400

    Set up /var/run/dcmanager as dcmanager's default temp files location

    dcmanager call the k8s python client to perform a number of
    operations. The k8s python client creates temp files under /tmp and
    continues use these tmp files for the life-cycle of the processes.

    However systemd-tmpfiles-clean.service will run every day to clean up
    files in /tmp dir that are older than 10 days. If the k8s client code
    is not triggered for more than 10 days (thus its temp files are not
    accessed for more than 10 days), these temp files will be removed as
    part of the cleanup. Certain dcmanager operations then starts to fail
    with an error that the tmp file is no longer there.

    This is a known issue of kubernetes python client:
    https://github.com/kubernetes-client/python/issues/765

    The commit fixes this issue by setting TMPDIR to /var/run/dcmanager
    when sm starts dcmanager-manager.

    Change-Id: Ib147c2ab26e303032e18da51a506e3768bc471e0
    Closes-Bug: 1883599
    Signed-off-by: Andy Ning <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/754417
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=43e6a9f768a9a4d7c23abe02d7dfb45a31befc6d
Submitter: Zuul
Branch: master

commit 43e6a9f768a9a4d7c23abe02d7dfb45a31befc6d
Author: Andy Ning <email address hidden>
Date: Fri Sep 25 10:48:11 2020 -0400

    Set up /var/run/nfv-vim as vim's default temp files location

    nfv vim call the k8s python client to perform a number of
    operations. The k8s python client creates temp files under /tmp and
    continues use these tmp files for the life-cycle of the processes.

    However systemd-tmpfiles-clean.service will run every day to clean up
    files in /tmp dir that are older than 10 days. If the k8s client code
    is not triggered for more than 10 days (thus its temp files are not
    accessed for more than 10 days), these temp files will be removed as
    part of the cleanup. Certain vim operations then starts to fail
    with an error that the tmp file is no longer there.

    This is a known issue of kubernetes python client:
    https://github.com/kubernetes-client/python/issues/765

    The commit fixes this issue by setting TMPDIR to /var/run/nfv-vim
    when sm starts vim.

    Change-Id: I4f0544055e9d10ba2374e9fdb5133d767c1fa2c3
    Closes-Bug: 1883599
    Signed-off-by: Andy Ning <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/801131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/801131
Committed: https://opendev.org/starlingx/config/commit/e255896eea3a2c773be5293df1bad4ed6c9c4c51
Submitter: "Zuul (22348)"
Branch: master

commit e255896eea3a2c773be5293df1bad4ed6c9c4c51
Author: Kyle MacLeod <email address hidden>
Date: Fri Jul 16 10:42:19 2021 -0400

    Set cert-mon temp dir location to /var/run/cert-mon_tmp

    Redirect the k8s python client's use of /tmp to /var/run/cert-mon_tmp
    via setting TMPDIR

    This is a known issue of kubernetes python client:
    https://github.com/kubernetes-client/python/issues/765

    The fix is the same as for
    https://bugs.launchpad.net/starlingx/+bug/1883599
    See commit message there for more details.

    Related-Bug: 1883599
    Closes-Bug: 1936435

    Signed-off-by: Kyle MacLeod <email address hidden>
    Change-Id: I0e163bd1b4d5a19f07267dd4cd14bad1b8cb20bb

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.