Large number of orphaned pod volumes

Bug #2027810 reported by Gleb Aronsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
In Progress
Medium
Gleb Aronsky

Bug Description

Brief Description
-----------------
Large number of orphaned pod volumes on some worker nodes(Worker-11 and Worker-13).
This possibly causes unexpected "NodeNotReady" state on the worker nodes.
After the "NodeNotReady", the system can not be recovered automatically.

Severity
--------
Critical

Steps to Reproduce
------------------
1. Worker-13 rebooted.

<Worker-13 kern.log>
2023-05-30T02:31:41.753 worker-13 kernel: notice [ 0.000000] Linux version 5.10.112-200.27.tis.el7.x86_64 (<email address hidden>) (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3), GNU ld version 2.30-55.el7.2) #1 SMP PREEMPT Thu Feb 16 14:05:56 EST 2023
2023-05-30T02:31:41.753 worker-13 kernel: info [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.10.112-200.27.tis.el7.x86_64 root=UUID=4df4067a-05a7-44a7-84e7-64c50a48cf84 ro security_profile=standard module_blacklist=integrity,ima tboot=false crashkernel=512M biosdevname=0 console=tty0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 softdog.soft_panic=1 intel_iommu=on user_namespace.enable=1 intel_idle.max_cstate=0 LANG=en_US.UTF-8 hugepagesz=1G hugepages=60 hugepagesz=2M hugepages=0 default_hugepagesz=1G irqaffinity=1-6,13,16-18,25,28-55,57-62,69,72-74,81,84-111 isolcpus=7-12,14-15,19-24,26-27,63-68,70-71,75-80,82-83 rcu_nocbs=1-55,57-111 nohz_full=7-12,14-15,19-24,26-27,63-68,70-71,75-80,82-83 kthread_cpus=0,56 audit=0 audit_backlog_limit=8192 nopti nospectre_v2 nospectre_v1
2. Kubelet Server started

<Worker-13 daemon.log.1>
2023-05-30T02:33:08.120 worker-13 systemd[1]: info Starting Kubernetes Kubelet Server...
3. Kubelete process started

<Worker-13 daemon.log.1>
2023-05-30T02:33:15.221 worker-13 kubelet[43341]: info I0530 02:33:15.221448 43341 server.go:1190] "Started kubelet"
4. The orphaned pod volume message started.

<Worker-13 daemon.log.1>
2023-05-30T02:33:17.431 worker-13 kubelet[43341]: info I0530 02:33:17.430994 43341 kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/0016f912-d586-4291-9040-8e407313645f/volumes"
2023-05-30T02:33:17.431 worker-13 kubelet[43341]: info I0530 02:33:17.431011 43341 kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/0016f912-d586-4291-9040-8e407313645f/volumes"
2023-05-30T02:33:17.431 worker-13 kubelet[43341]: info I0530 02:33:17.431452 43341 kubelet_volumes.go:140] "Cleaned up orphaned pod volumes dir" podUID=0016f912-d586-4291-9040-8e407313645f path="/var/lib/kubelet/pods/0016f912-d586-4291-9040-8e407313645f/volumes"
5. "There were many similar errors. Turn up verbosity to see them." message issued.

<Worker-13 daemon.log.1>
2023-05-30T02:33:17.482 worker-13 kubelet[43341]: info E0530 02:33:17.482942 43341 kubelet_volumes.go:225] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"0016f912-d586-4291-9040-8e407313645f\" found, but failed to rmdir() subpath at path /var/lib/kubelet/pods/0016f912-d586-4291-9040-8e407313645f/volume-subpaths/localtime-cuupoam/oamcuup/0: not a directory" numErrs=118

Expected Behavior
------------------
Orphaned volumes are cleanedup.

Actual Behavior
----------------
Orphaned files are not cleaned-up.

Reproducibility
-----------------
100%

System Configuration
--------------------
<One node system, Two node system, Multi-node system, Dedicated storage, https, IPv4, IPv6 etc.>

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
Did this test scenario pass previously? If so, please indicate the load/pull time info of the last pass.
Use this section to also indicate if this is a new test scenario.

Timestamp/Logs
--------------
<Worker-13 daemon.log.1>
2023-05-30T02:33:17.480 worker-13 kubelet[43341]: info I0530 02:33:17.480675 43341 kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/ef8363d3-0368-4c0d-8bf5-26ddd44e5f64/volumes"
2023-05-30T02:33:17.480 worker-13 kubelet[43341]: info I0530 02:33:17.480690 43341 kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/ef8363d3-0368-4c0d-8bf5-26ddd44e5f64/volumes"
2023-05-30T02:33:17.481 worker-13 kubelet[43341]: info I0530 02:33:17.481132 43341 kubelet_volumes.go:140] "Cleaned up orphaned pod volumes dir" podUID=ef8363d3-0368-4c0d-8bf5-26ddd44e5f64 path="/var/lib/kubelet/pods/ef8363d3-0368-4c0d-8bf5-26ddd44e5f64/volumes"
2023-05-30T02:33:17.481 worker-13 kubelet[43341]: info I0530 02:33:17.481309 43341 kubelet_volumes.go:113] "Cleaned up orphaned volume from pod" podUID=f262085c-4a97-4e84-9bef-d407048338bf path="/var/lib/kubelet/pods/f262085c-4a97-4e84-9bef-d407048338bf/volumes/kubernetes.io~projected/kube-api-access-wzpgx"
2023-05-30T02:33:17.481 worker-13 kubelet[43341]: info I0530 02:33:17.481585 43341 kubelet_volumes.go:113] "Cleaned up orphaned volume from pod" podUID=f434ac8c-02dc-4c49-a49e-a7e08b82ed54 path="/var/lib/kubelet/pods/f434ac8c-02dc-4c49-a49e-a7e08b82ed54/volumes/kubernetes.io~projected/kube-api-access-g66q9"
2023-05-30T02:33:17.481 worker-13 kubelet[43341]: info I0530 02:33:17.481651 43341 kubelet_volumes.go:140] "Cleaned up orphaned pod volumes dir" podUID=f434ac8c-02dc-4c49-a49e-a7e08b82ed54 path="/var/lib/kubelet/pods/f434ac8c-02dc-4c49-a49e-a7e08b82ed54/volumes"
2023-05-30T02:33:17.482 worker-13 kubelet[43341]: info I0530 02:33:17.482197 43341 kubelet_volumes.go:113] "Cleaned up orphaned volume from pod" podUID=f8eb7eef-7395-4bde-b069-686a3ecc5b7d path="/var/lib/kubelet/pods/f8eb7eef-7395-4bde-b069-686a3ecc5b7d/volumes/kubernetes.io~projected/kube-api-access-pcf6x"
2023-05-30T02:33:17.482 worker-13 kubelet[43341]: info E0530 02:33:17.482942 43341 kubelet_volumes.go:225] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"0016f912-d586-4291-904

Test Activity
-------------

Workaround
----------

Revision history for this message
Gleb Aronsky (gleb-aronsky) wrote :
Changed in starlingx:
status: New → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.co
tags: added: stx.9.0 stx.containers
removed: stx.co
Changed in starlingx:
assignee: nobody → Gleb Aronsky (gleb-aronsky)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/893091

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/893091
Committed: https://opendev.org/starlingx/stx-puppet/commit/7d870177c6c01f5b47a33ef1cd7ce92c3c58694f
Submitter: "Zuul (22348)"
Branch: master

commit 7d870177c6c01f5b47a33ef1cd7ce92c3c58694f
Author: Gleb Aronsky <email address hidden>
Date: Tue Aug 29 09:03:14 2023 -0700

    Update kubelet system overrides on unlock

    Move generation of kubelet's systemd override file,
    kube-stx-override.conf, from platform::kubernetes::master::init
    to platform::kubernetes::configuration so that the file will
    be generated on every host unlock. This facilitates delivery
    of systemd service changes via patches to existing installs.

    This change is needed by bug 2027810 to ensure that the
    orphan volume cleanup script is executed as part of the systemd
    ExecStartPre kubelet service override.

    Test Plan:

    Pass: - Update the kube-stx-override.conf.erb file
           - Lock the host
           - Unlock the host
           - Verify that kube-stx-override.conf has been updated
           - Verify AIO-SX
           - Verify Standard config

    Partial-Bug: 2027810
    Change-Id: I3b496abc807bf75716d28079c62ef4700dcd3244
    Signed-off-by: Gleb Aronsky <email address hidden>

Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/896154

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/896154
Committed: https://opendev.org/starlingx/stx-puppet/commit/f426f5c67a22d9eedd941b6033e24cf2cb25f61a
Submitter: "Zuul (22348)"
Branch: master

commit f426f5c67a22d9eedd941b6033e24cf2cb25f61a
Author: Gleb Aronsky <email address hidden>
Date: Mon Sep 18 05:17:53 2023 -0700

    Update kubelet system overrides on unlock

    Add logic to the platform::kubernetes::configuration method
    to generate the kubelet's systemd override file. This
    change ensures the file is generated every time a host is
    unlocked. This facilitates delivery of systemd service changes
    via patches to existing installs.

    This change is needed by bug 2027810 to ensure that the
    orphan volume cleanup script is executed as part of the systemd
    ExecStartPre kubelet service override.

    This bug is an update for the this reverted commit:
    https://review.opendev.org/c/starlingx/stx-puppet/+/895364

    Test Plan:

    Pass: - Update the kube-stx-override.conf.erb file
           - Lock the AIO-SX host
           - Unlock the AIO-SX host
           - Verify that kube-stx-override.conf has been updated
           - Verify AIO-SX fresh install
           - Verify Standard Duplex lock/unlock and
             verify that kube-stx-override.conf has been updated
           - Verify Standard Duplex Install

    Partial-Bug: 2027810
    Change-Id: I4e47bce634c21396acb2e5f1540cac0be3ed34ec
    Signed-off-by: Gleb Aronsky <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/898972

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening. There is still an active review linked to this LP. The original reviews which were merged previously were also previously reverted.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/898972
Committed: https://opendev.org/starlingx/stx-puppet/commit/88f8f06d393646ff310813ae9542ec4769c9647d
Submitter: "Zuul (22348)"
Branch: master

commit 88f8f06d393646ff310813ae9542ec4769c9647d
Author: Gleb Aronsky <email address hidden>
Date: Fri Oct 20 08:56:44 2023 -0700

    Update kubelet system overrides on unlock

    Add logic to the `platform::kubernetes::configuration` method
    to generate the kubelet's systemd override file. This
    change ensures the file is generated every time a host is
    unlocked. This facilitates delivery of systemd service changes
    via patches to existing installs.

    We only want to update on lock and unlock, so we need to
    check that the flag is_initial_k8s_config is not
    set before creating the resource in platform::kubernetes::configuration.
    This ensures that the file is only regenerated on host unlock and
    not during the initial installation, which is currently handled
    in platform::kubernetes::master::init.

    This change is needed by bug 2027810 to ensure that the
    orphan volume cleanup script is executed as part of the systemd
    ExecStartPre kubelet service override.

    This bug is an update for this reverted commit:
    https://review.opendev.org/c/starlingx/stx-puppet/+/896154

    Test Plan:
    PASS - Verify successful installation from an ISO on AIO-SX with
           the controller unlocked.
    PASS - Verify successful installation from an ISO on AIO-DX with
           the controllers unlocked.
    PASS - Verify successful installation from an ISO on STANDARD with
           the controllers unlocked.
    PASS - Verify that kube-stx-override.conf is updated on AIO-SX:
           - Update the kube-stx-override.conf.erb file.
           - Lock/Unlock the AIO-SX host.
           - Verify that kube-stx-override.conf has been updated.
    PASS - Verify that kube-stx-override.conf is updated on STANDARD:
           - Update the kube-stx-override.conf.erb file on compute-0.
           - Lock/Unlock compute-0 and verify that kube-stx-override.conf
             is updated.

    Partial-Bug: 2027810
    Change-Id: Id473fd0e2c807d1e9d1e3fdd707bc3e9e36688b1
    Signed-off-by: Gleb Aronsky <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.