StarlingX

100.104 alarm raised after Lock/Reboot/Unlock Standby Controller operation

Bug #1964111 reported by Daniel Safta on 2022-03-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Daniel Safta

Bug Description

Brief Description
-----------------
The 100.104 alarm (see below) was raised after Lock/Reboot/Unlock operation on the Standby Controller (Controller-0 in this case).

100.104', 'host=controller-0.filesystem=/
+--------------------------------------+----------+------------------------------------------------------------------+--------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------+--------------------------------+----------+----------------------------+
| f137b20e-456e-406a-8f63-a63d29d581b9 | 100.104 | File System threshold exceeded ; threshold 80.00%, actual 89.11% | host=controller-0.filesystem=/ | major | 2021-08-01T04:41:45.285620 |
+--------------------------------------+----------+------------------------------------------------------------------+--------------------------------+----------+----------------------------+

Severity
--------
Major

Steps to Reproduce
------------------
Install an app that should react to incoming SIGTERM from kubelet and run:
Run testcases/functional/mtc/test_host_reboot.py::test_host_reboot[controller]

def test_host_reboot(host_type):

Verify that host-reboot to an unlock node is rejected and host-reboot cmd working on locked hosts

Test Steps:
        - Attempt to host-reboot from the active controller
        - Verify that the host-reboot was rejected unlocked hosts
        - lock host
        - host reboot
        - unlock host

Expected Behavior
------------------

No alarms were raised after the test execution and the pods stops properly.

Actual Behavior
----------------
100.104 alarm raised after the test execution because the pod generate a huge coredump.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DX

Branch/Pull Time/Commit
-----------------------
2021-06-09_18-58-11

Last Pass
---------
-
Timestamp/Logs
--------------
-
Test Activity
-------------
Regression testing

Workaround
----------
-

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-08: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/832519

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-09: Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/832519
Committed: https://opendev.org/starlingx/integ/commit/f3c18b0f79e3b145d378474b24d861926dd61a13
Submitter: "Zuul (22348)"
Branch: master

commit f3c18b0f79e3b145d378474b24d861926dd61a13
Author: Daniel Safta <email address hidden>
Date: Wed Mar 9 06:36:13 2022 -0500

Add k8s container cleanup

    When executing a reboot/shutdown
    k8s pods are not receiving the SIGTERM
    signal which leads some of them to
    unexpected behaviour such as generating
    huge coredumps.

    There is an upstream issue regarding this:
    https://github.com/kubernetes/kubernetes/issues/107158
    The problem seems to be systemd related
    but this commit addresses the problem
    with a workaround.

    This commit introduces a new script that
    will cleanup all the remaing pods and will
    be run after kubelet is stopped.

    The script is executed successfully when
    kubelet stops and the pods are stopped
    before the system shuts down.

    Closes-bug: 1964111
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: Ia0376aa510dd0dc3983e16cd89840726c15d6c92

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-03-11

Changed in starlingx:
assignee:	nobody → Daniel Safta (dsafta)
importance:	Undecided → Medium
tags:	added: stx.7.0 stx.containers

Revision history for this message

Jim Gauld (jgauld) wrote on 2022-03-23:

Note that the fix did not work as expected, it does not always work.
The new k8s-container-cleanup script is called by kubelet.service stop, but there is a service timing issue between kubelet.service and containerd.service. Since there is no enforced dependency between these services, the shutdown timing is not predictable. If containerd.service completes shutdown prior to kubelet.service, then the 'crictl' and 'crictl stop' commands will not work to shutdown all containers.

The standard error of the 'crictl' command was being captured in daemon.log indicating:
2022-03-21T18:04:38.590 localhost k8s-container-cleanup[190298]: info time="2022-03-21T18:04:38Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"

The recommended change is to add simple log to the k8s-container-cleanup, and move this script to ExecStop of the containerd.service so that it is executed before shutting down containerd.

Ideally a order relationship should be added to these services so that containerd starts before kubelet, and containerd stops after kubelet (eg, with a After/Before type service directive), but that is not required to solve this issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-23: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/834973

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-23: Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config-files/+/834979

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2022-03-24:

Re-opening as additional fixes are required

Changed in starlingx:
status:	Fix Released → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-14: Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/834973
Committed: https://opendev.org/starlingx/integ/commit/169a0c0ee3416d0c082abeca59639cc479cf1866
Submitter: "Zuul (22348)"
Branch: master

commit 169a0c0ee3416d0c082abeca59639cc479cf1866
Author: Jim Gauld <email address hidden>
Date: Wed Mar 23 16:51:14 2022 -0400

Move k8s container cleanup to containerd service

    This introduces k8s-container-cleanup script that will be called
    when containerd.service is stopped. The script detects whether systemd
    state is 'stopping' due to shutdown/reboot, then stops all running
    containers before the service shuts down.

    During shutdown/reboot, some containers are not receiving the
    SIGTERM signal. This leads to unexpected behaviour such as
    generating huge coredumps.

    There is an upstream issue regarding this:
    https://github.com/kubernetes/kubernetes/issues/107158
    The problem seems to be systemd related but this commit
    addresses the problem with a workaround.

    This reverts commit f3c18b0f79e3b145d378474b24d861926dd61a13.
    The k8s-container-cleanup script is moved from kubelet.service
    to containerd.service. The ExecStopPost that calls this script
    is removed, and replaced with ExecStop in containerd.service
    to call the script (in config-files repo).

    The k8s-container-cleanup script requires containerd is running
    in order to use crictl utility. The shutdown of kubelet and
    containerd have unpredictable timing, so the cleanup must be done
    in containerd.

    Test Plan: On AIO-SX
    PASS: Verify k8s-container-cleanup logs to daemon.log during 'stopping.
    PASS: Manual change containerd/kubelet shutdown timing and verify.
    k8s-container-cleanup running to completion before containerd stopped.
    PASS: Reboot and verify k8s-container-cleanup running to completion.
    PASS: Lock/unlock and verify k8s-container-cleanup running to completion.
    PASS: Manually run spellintian tool against k8s-container-cleanup.sh.
    PASS: Manually run shellcheck tool against k8s-container-cleanup.sh.
    PASS: Zuul tox bashate tool against k8s-container-cleanup.sh.

    Partial-Bug: 1964111
    Change-Id: Ic8a9e257f861ae218a8520205eced3eaa580dd20
    Signed-off-by: Jim Gauld <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/integ/+/834973
Committed: https://opendev.org/starlingx/integ/commit/169a0c0ee3416d0c082abeca59639cc479cf1866
Submitter: "Zuul (22348)"
Branch:    master

commit 169a0c0ee3416d0c082abeca59639cc479cf1866
Author: Jim Gauld <james.gauld@windriver.com>
Date:   Wed Mar 23 16:51:14 2022 -0400

Move k8s container cleanup to containerd service
    
    This introduces k8s-container-cleanup script that will be called
    when containerd.service is stopped. The script detects whether systemd
    state is 'stopping' due to shutdown/reboot, then stops all running
    containers before the service shuts down.
    
    During shutdown/reboot, some containers are not receiving the
    SIGTERM signal. This leads to unexpected behaviour such as
    generating huge coredumps.
    
    There is an upstream issue regarding this:
    https://github.com/kubernetes/kubernetes/issues/107158
    The problem seems to be systemd related but this commit
    addresses the problem with a workaround.
    
    This reverts commit f3c18b0f79e3b145d378474b24d861926dd61a13.
    The k8s-container-cleanup script is moved from kubelet.service
    to containerd.service. The ExecStopPost that calls this script
    is removed, and replaced with ExecStop in containerd.service
    to call the script (in config-files repo).
    
    The k8s-container-cleanup script requires containerd is running
    in order to use crictl utility. The shutdown of kubelet and
    containerd have unpredictable timing, so the cleanup must be done
    in containerd.
    
    Test Plan: On AIO-SX
    PASS: Verify k8s-container-cleanup logs to daemon.log during 'stopping.
    PASS: Manual change containerd/kubelet shutdown timing and verify.
    k8s-container-cleanup running to completion before containerd stopped.
    PASS: Reboot and verify k8s-container-cleanup running to completion.
    PASS: Lock/unlock and verify k8s-container-cleanup running to completion.
    PASS: Manually run spellintian tool against k8s-container-cleanup.sh.
    PASS: Manually run shellcheck tool against k8s-container-cleanup.sh.
    PASS: Zuul tox bashate tool against k8s-container-cleanup.sh.
    
    Partial-Bug: 1964111
    Change-Id: Ic8a9e257f861ae218a8520205eced3eaa580dd20
    Signed-off-by: Jim Gauld <james.gauld@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-14: Fix merged to config-files (master)

Reviewed: https://review.opendev.org/c/starlingx/config-files/+/834979
Committed: https://opendev.org/starlingx/config-files/commit/e7b7dbad7c8820af96bfc284b74e9e9c08df34ed
Submitter: "Zuul (22348)"
Branch: master

commit e7b7dbad7c8820af96bfc284b74e9e9c08df34ed
Author: Jim Gauld <email address hidden>
Date: Wed Mar 23 18:50:31 2022 -0400

Add k8s container cleanup to containerd service

    This adds ExecStop=/usr/local/sbin/k8s-container-cleanup
    to containerd.service. This will execute the container
    cleanup prior to containerd.service is stopped.

Depends-On: https://review.opendev.org/c/starlingx/integ/+/834973
Closes-Bug: 1964111

Signed-off-by: Jim Gauld <email address hidden>
Change-Id: I4cd585b9fae630a278e830057cf71496fdf41007

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-14: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/838039

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-15:

#10

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/838075

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-07-06: Fix merged to integ (master)

#11

Reviewed: https://review.opendev.org/c/starlingx/integ/+/838039
Committed: https://opendev.org/starlingx/integ/commit/fcd8b87c030fe779a130d6257271cb8841cac10c
Submitter: "Zuul (22348)"
Branch: master

commit fcd8b87c030fe779a130d6257271cb8841cac10c
Author: Jim Gauld <email address hidden>
Date: Thu Apr 14 15:54:35 2022 -0400

Debian: Enable containerd package customization

    This provides the original Debian containerd package files:
    rules, containerd.install. These files are contained within
    the tarball: containerd-debian-1.4.12_ds1-1.tar.gz .

Subsequent changes to these are package customizations.

Test Plan: Debian
PASS: Build Debian containerd package

Partial-Bug: 1964111

Signed-off-by: Jim Gauld <email address hidden>
Change-Id: Icf5356c94b64b2c786ee988ad34cdd0a6e25c915

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-07-06:

#12

Reviewed: https://review.opendev.org/c/starlingx/integ/+/838075
Committed: https://opendev.org/starlingx/integ/commit/c1b1d85a9321cda634968931788d10f840c89615
Submitter: "Zuul (22348)"
Branch: master

commit c1b1d85a9321cda634968931788d10f840c89615
Author: Jim Gauld <email address hidden>
Date: Fri Apr 15 02:28:55 2022 +0000

Debian: containerd package customization with k8s-container-cleanup

This provides the Debian containerd package changes to include
k8s-container-cleanup script.

    Test Plan: Debian:
    PASS: Build containerd package
    PASS: Build image
    PASS: Install ISO for AIO-SX
    PASS: Reboot host, verify we get daemon.log:
          k8s-container-cleanup(283049): info : Stopping all containers.

Closes-Bug: 1964111

Signed-off-by: Jim Gauld <email address hidden>
Change-Id: I56170b98cf32c2e7e51b1c35779305a90cdc6db8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-kubernetes-kubernetes #107158
[closed kind/bug sig/node needs-triage] Edit

Bug watches keep track of this bug in other bug trackers.