pods do not get restarted in an AIO-DX system

Bug #1900920 reported by Steven Webster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Douglas Henrique Koerich

Bug Description

Brief Description
-----------------

Pods that are in a k8s deployment, daemonset, etc can be labeled as restart-on-reboot="true", which will automatically cause them to be restarted after the worker manifest has completed in an AIO system. This label is primarily used for pods using SR-IOV interfaces, as the pod will start coming up after the controller manifest is completed, but before the SR-IOV devices are bound with an appropriate driver.

In an AIO-DX system however, the reboot can fail to occur if no node selector has been set, as the query for labeled pods depends on a field selector specifying the host the recovery script is running on.

The reboot will fail to occur if the script looks for labeled pods before the pod has been scheduled on the node the script is running on.

Severity
--------
Provide the severity of the defect.
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
- As part of a daemonset, label a pod with restart-on-reboot=true
- Ensure the pod cannot be scheduled on the other AIO-DX node (label, taint, etc)
- Reboot the node the pod is scheduled on and observe the k8s-pod-recovery logs in /var/log/daemon.log
- Observe no log specifying the pod has been recovered

Expected Behavior
------------------
The pod should be recovered by the script

Actual Behavior
----------------
The pod may not be recovered by the script

Reproducibility
---------------
50/50

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
master 2020-10-20

Test Activity
-------------
Developer Testing

Workaround
----------
Use an init container for the pod in question with a few second delay
or
restart the pod manually

tags: added: stx.networking
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Changed in starlingx:
assignee: nobody → Cole Walker (cwalops)
Revision history for this message
Steven Webster (swebster-wr) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Cole Walker (cwalops) → Douglas Henrique Koerich (dkoerich-wr)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Douglas Henrique Koerich (dkoerich-wr) wrote :

By following the steps indicated in the bug description above it was possible to reproduce the issue in an AIO-DX environment, according to the following timeline (at the host the pod(s) was(were) scheduled on):

t=0s: Finished controller manifest
t=8s: Started worker manifest
t=37s: Start of k8s-pod-recovery
t=38s: Finished worker manifest
t=63s: Started created "restart-on-reboot" labeled pod(s)
t=281s: Same labeled pod(s) verified w/o restarting

The restart of the pod(s) is not performed because the query on the labeled pods to be recovered returns an empty set when the k8s-pod-recovery is launched.

By moving the handling of labeled pods to after they are in a stable state, the restart of them is correctly performed:

t=0s: Finished controller manifest
t=9s: Started worker manifest
t=66s: Start of k8s-pod-recovery
t=67s: Finished worker manifest
t=73s: Started created "restart-on-reboot" labeled pod(s)
t=190s: Labeled pod(s) is(are) restarted
t=408s: New labeled pod(s) verified

Revision history for this message
Douglas Henrique Koerich (dkoerich-wr) wrote :
Revision history for this message
Douglas Henrique Koerich (dkoerich-wr) wrote :

Tests were done with the additional waiting procedure for labeled pods and results can be inspected in the attached daemon.log file.

Revision history for this message
Douglas Henrique Koerich (dkoerich-wr) wrote :

Pods used in the test are described by the test.yaml file attached.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.