AIO-SX: armada pod stuck in Unknown after host-lock/unlock

Bug #1928018 reported by Angie Wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Angie Wang

Bug Description

Brief Description
-----------------
After a reboot or lock/unlock of an AIO-SX, Armada pod stuck in an unknown state and does not recover.

Same issue with the following LPs but this impacts Armada pod
https://bugs.launchpad.net/starlingx/+bug/1874858
https://bugs.launchpad.net/starlingx/+bug/1893977

Severity
--------
Medium

Steps to Reproduce
------------------
Apply stx-openstack application to an AIO-SX
system host-lock controller-0
system host-unlock controller-0

Expected Behavior
------------------
All pods should recover and be in a ready/running state shortly after the controller recovers.

Actual Behavior
----------------
Armada pod stuck in unknown state

Reproducibility
---------------
Intermittent - seen rarely

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
stx master

Timestamp/Logs
--------------
[2021-04-21 19:50:21,796] 314 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces --field-selector=status.phase=Running -o=wide | grep --color=never -v -E '([0-9])+/\1''
[2021-04-21 19:50:22,133] 436 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
armada armada-api-84f66996f6-ztjmv 0/2 Unknown 0 8h <none> controller-0 <none> <none>

  Warning FailedMount 105m kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-etc armada-api-token-g846b pod-tmp pod-etc-armada]: timed out waiting for the condition
  Warning FailedMount 103m kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-etc-armada armada-etc armada-api-token-g846b pod-tmp]: timed out waiting for the condition
  Warning FailedMount 97m kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-tmp pod-etc-armada armada-etc armada-api-token-g846b]: timed out waiting for the condition
  Warning FailedMount 37m (x22 over 101m) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-api-token-g846b pod-tmp pod-etc-armada armada-etc]: timed out waiting for the condition
  Warning FailedMount 32m (x43 over 108m) kubelet, controller-0 MountVolume.SetUp failed for volume "armada-etc" : stat /var/lib/kubelet/pods/10faba32-eea1-4af5-91fa-7ce8072f7114/volumes/kubernetes.io~configmap/armada-etc: no such file or directory
  Warning FailedMount 18m kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-tmp pod-etc-armada armada-etc armada-api-token-g846b]: timed out waiting for the condition
  Warning FailedMount 8m11s (x3 over 14m) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-etc-armada armada-etc armada-api-token-g846b pod-tmp]: timed out waiting for the condition
  Warning FailedMount 4m4s (x3 over 16m) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-etc armada-api-token-g846b pod-tmp pod-etc-armada]: timed out waiting for the condition
  Warning FailedMount 2m (x3 over 20m) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-api-token-g846b pod-tmp pod-etc-armada armada-etc]: timed out waiting for the condition
  Warning FailedMount 103s (x18 over 22m) kubelet, controller-0 MountVolume.SetUp failed for volume "armada-etc" : stat /var/lib/kubelet/pods/10faba32-eea1-4af5-91fa-7ce8072f7114/volumes/kubernetes.io~configmap/armada-etc: no such file or directory

Test Activity
-------------
Sanity

Workaround
----------
Delete the unknown pod

Angie Wang (angiewang)
Changed in starlingx:
assignee: nobody → Angie Wang (angiewang)
Angie Wang (angiewang)
description: updated
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/790530
Committed: https://opendev.org/starlingx/integ/commit/03665ae745babb4524e2b9b9cc0f768eaf1e8781
Submitter: "Zuul (22348)"
Branch: master

commit 03665ae745babb4524e2b9b9cc0f768eaf1e8781
Author: Angie Wang <email address hidden>
Date: Mon May 10 18:54:07 2021 -0400

    Add armada namespace in k8s pod recovery

    Update k8s pod recovery service to include armada namespace
    so armada pod that stuck in an unknown state after host
    lock/unlock or reboot could be recovered by the service.

    Change-Id: Iacd92637a9b4fcaf4c0076e922e1bd739f69a584
    Closes-Bug: 1928018
    Signed-off-by: Angie Wang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0 stx.containers
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.