AIO-SX:Openstack pods are stuck in unknown/Init after lock/unlock and after reboot

Bug #1893977 reported by Frank Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bob Church

Bug Description

Brief Description
-----------------
After a reboot or lock/unlock of an AIO-SX, some stx-openstack pods remain in an unknown or init state and do not recover.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Apply stx-openstack application to an AIO-SX
system host-lock controller-0
system host-unlock controller-0

Expected Behavior
------------------
All pods should recover and be in a ready/running state shortly after the controller recovers.

Actual Behavior
----------------
One or multiple stx-openstack pods remain in unknown/init state.

Reproducibility
---------------
Intermittent - seen rarely on some labs and 25-50% of the time on other labs.

System Configuration
--------------------
<One node system, Two node system, Multi-node system, Dedicated storage, https, IPv4, IPv6 etc.>

Branch/Pull Time/Commit
-----------------------
Any STX master branch load from an August build

Last Pass
---------
unknown

Timestamp/Logs
--------------
From a fairly recent test, here is an example of the pod states after an AIO-SX lock/unlock:
controller-0:~$ kubectl get pods --all-namespaces -o wide | grep -v -e Running -e Completed
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
monitor mon-elastic-services-8684f65895-nb8mg 0/1 Unknown 0 4d22h <none> controller-0 <none> <none>
monitor mon-elasticsearch-client-0 0/1 Unknown 0 4d22h <none> controller-0 <none> <none>
monitor mon-elasticsearch-data-0 0/1 Unknown 0 4d22h <none> controller-0 <none> <none>
monitor mon-elasticsearch-master-0 0/1 Unknown 0 4d22h <none> controller-0 <none> <none>
monitor mon-filebeat-42vd2 0/1 Init:CrashLoopBackOff 20 4d21h 172.16.192.67 controller-0 <none> <none>
monitor mon-kibana-7f6cfc6bb7-9lgkf 0/1 Unknown 0 4d22h <none> controller-0 <none> <none>
monitor mon-logstash-0 0/1 Unknown 0 4d1h <none> controller-0 <none> <none>
monitor mon-metricbeat-metrics-77fbfc68d6-756rd 0/1 Unknown 0 4d21h <none> controller-0 <none> <none>
monitor mon-metricbeat-vmvvh 0/1 Init:CrashLoopBackOff 25 4d21h 192.168.204.2 controller-0 <none> <none>
openstack cinder-api-5fdf48bf5d-h8rjv 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack cinder-backup-6548b6767-pfwjz 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack cinder-scheduler-65f4b69f66-7s8wx 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack cinder-volume-5d98966645-5d977 0/1 Init:0/4 0 122m 172.16.192.66 controller-0 <none> <none>
openstack cinder-volume-usage-audit-1595861100-k9rn5 0/1 Init:0/1 0 122m 172.16.192.112 controller-0 <none> <none>
openstack glance-api-5bfd4f599c-4s274 0/1 Init:0/3 0 122m 172.16.192.88 controller-0 <none> <none>
openstack heat-api-5b9598987f-8n9fb 0/1 Init:0/1 0 122m 172.16.192.78 controller-0 <none> <none>
openstack heat-cfn-679bc9cbfc-v8mf2 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack heat-engine-78fb44c4c6-gvvqv 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack heat-engine-cleaner-1595861100-kwhhx 0/1 Init:0/1 0 122m 172.16.192.81 controller-0 <none> <none>
openstack horizon-6d6dbcd779-vbrzp 0/1 Init:0/1 0 122m 172.16.192.119 controller-0 <none> <none>
openstack ingress-79d7f888cd-8hl67 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack ingress-error-pages-6554f75d57-ndjqd 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack keystone-api-cc7995bbf-rtwfk 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack libvirt-libvirt-controller-0-937646f6-6g5kg 0/1 StartError 1 4d23h 192.168.204.2 controller-0 <none> <none>
openstack mariadb-ingress-5d6c5b7944-gxzz7 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack mariadb-ingress-error-pages-598984c99f-cxz2n 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack mariadb-server-0 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack networking-avs-avr-agent-controller-0-937646f6-l297w 0/1 Unknown 0 4d23h 192.168.204.2 controller-0 <none> <none>
openstack networking-avs-avs-agent-controller-0-937646f6-7mkqd 0/1 Unknown 0 4d23h 192.168.204.2 controller-0 <none> <none>
openstack neutron-dhcp-agent-controller-0-937646f6-985kq 0/1 Unknown 0 4d23h 192.168.204.2 controller-0 <none> <none>
openstack neutron-metadata-agent-controller-0-937646f6-7q9pp 0/1 Init:0/2 0 122m 192.168.204.2 controller-0 <none> <none>
openstack neutron-server-58cb698cf-56j95 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack neutron-sriov-agent-controller-0-937646f6-54cr2 0/1 Unknown 0 4d23h 192.168.204.2 controller-0 <none> <none>
openstack nova-api-metadata-6545d5dddc-lcsqx 0/1 Unknown 1 4d23h <none> controller-0 <none> <none>
openstack nova-api-osapi-555c5474cd-bwcr4 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack nova-api-proxy-98497fbdf-dfdf5 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack nova-compute-controller-0-937646f6-w8rzf 0/2 Unknown 0 4d23h 192.168.204.2 controller-0 <none> <none>
openstack nova-conductor-6f6d9df696-sfqzx 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack nova-novncproxy-58fd88b78f-szpsq 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack nova-scheduler-69bf6574f7-92sfg 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack nova-service-cleaner-1595862000-8pbdw 0/1 Init:0/1 0 107m 172.16.192.74 controller-0 <none> <none>
openstack osh-openstack-memcached-memcached-85f5694d98-4jph9 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack osh-openstack-rabbitmq-rabbitmq-0 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>
openstack placement-api-575f9f9f8c-5sndm 0/1 Unknown 0 4d23h <none> controller-0 <none> <none>

Test Activity
-------------
System Testing

Workaround
----------
delete the pods in unknown state which causes them to start back up.

Revision history for this message
Frank Miller (sensfan22) wrote :

Chris Friesen took a look and believes this is similar to https://bugs.launchpad.net/starlingx/+bug/1874858 - however I opened a new LP as this is now impacting stx-openstack pods.

tags: added: stx.5.0
tags: added: stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Bob Church (rchurch)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/749634

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/749635

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/749637

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/749634
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=17c1b8894deeb973dfb29a5fcac9fd630591b649
Submitter: Zuul
Branch: master

commit 17c1b8894deeb973dfb29a5fcac9fd630591b649
Author: Robert Church <email address hidden>
Date: Wed Sep 2 00:59:44 2020 -0400

    Introduce k8s pod recovery service

    Add a recovery service, started by systemd on a host boot, that waits
    for pod transitions to stabilize and then takes corrective action for
    the following set of conditions:
    - Delete to restart pods stuck in an Unknown or Init:Unknown state for
      the 'openstack' and 'monitor' namespaces.
    - Delete to restart Failed pods stuck in a NodeAffinity state that occur
      in any namespace.
    - Delete to restart the libvirt pod in the 'openstack' namespace when
      any of its conditions (Initialized, Ready, ContainersReady,
      PodScheduled) are not True.

    This will only recover pods specific to the host where the service is
    installed.

    This service is installed on all controller types. There is currently no
    evidence that we need this on dedicated worker nodes.

    Each of these conditions should to be evaluated after the next k8s
    component rebase to determine if any of these recovery action can be
    removed.

    Change-Id: I0e304d1a2b0425624881f3b2d9c77f6568844196
    Closes-Bug: #1893977
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/749637
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=35959f30c778bf56c2374686b1d4317c7bce56a5
Submitter: Zuul
Branch: master

commit 35959f30c778bf56c2374686b1d4317c7bce56a5
Author: Robert Church <email address hidden>
Date: Wed Sep 2 02:24:22 2020 -0400

    Remove NodeAffinity workaround from sysinv

    This functionality is now delivered as part of the k8s-pod-recovery
    service.

    Change-Id: Ie29b8ae4854b401aa500e95d8bd1e07dc19d0d20
    Partial-Bug: #1893977
    Depends-On: https://review.opendev.org/#/c/749634/
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/749635
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=49e5911fdbbf6dce199758ad7f07f4cf124718ce
Submitter: Zuul
Branch: master

commit 49e5911fdbbf6dce199758ad7f07f4cf124718ce
Author: Robert Church <email address hidden>
Date: Wed Sep 2 02:13:55 2020 -0400

    Limit installation of k8s-pod-recovery service

    Only install the service on controllers

    Change-Id: Ib75b54770690fdce64907d19be1a40b788af547b
    Partial-Bug: #1893977
    Depends-On: https://review.opendev.org/#/c/749634/
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/790530

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.