StarlingX

apps failed to upload or apply due to armada pod not ready

Bug #1886429 reported by Peng Peng on 2020-07-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Jim Gauld

Bug Description

Brief Description
-----------------
During system installation, 3 apps (cert-manager, oidc-auth-apps and platform-integ-apps), failed to upload or apply due to armada pod not ready.

Severity
--------
Major

Steps to Reproduce
------------------
Installing a system and apply applications on it

Expected Behavior
------------------
Application applied success

Actual Behavior
----------------
Application apply-failed or upload-failed

Reproducibility
---------------
Intermittent
1/10 It seems to happen once in two weeks.

System Configuration
--------------------
One node system
Two node system

Lab-name: AIO-Plus

Branch/Pull Time/Commit
-----------------------
2020-07-02_22-00-00

Last Pass
---------
2020-07-01_22-00-00

[2020-07-03 07:18:08,570] 314 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-07-03 07:18:09,969] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------+---------+-----------------------------------+-------------------------------------+---------------+----------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+-------------------------------------+---------------+----------------------------------------------------------------------------------------------------+
| cert-manager | 1.0-5 | cert-manager-manifest | certmanager-manifest.yaml | apply-failed | operation aborted, check logs for detail |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest. | applied | completed |
| | | | yaml | | |
| | | | | | |
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | upload-failed | Upload of application oidc-auth-apps (1.0-27) failed: Failed to validate application manifest. |
| platform-integ-apps | 1.0-9 | platform-integration-manifest | manifest.yaml | upload-failed | Upload of application platform-integ-apps (1.0-9) failed: Failed to validate application manifest. |
+--------------------------+---------+-----------------------------------+-------------------------------------+---------------+----------------------------------------------------------------------------------------------------+

sysinv 2020-07-03 06:37:13.618 104562 INFO sysinv.conductor.kube_app [-] PluginHelper: Installing platform-integ-apps plugin /scratch/apps/platform-integ-apps/1.0-9/plugins/k8sapp_platform-1.0-py2.py3-none-any.whl to /opt/platform/helm/20.06/platform-integ-apps/1.0-9/plugins.
sysinv 2020-07-03 06:37:13.788 104562 ERROR sysinv.conductor.kube_app [-] Failed to copy /etc/build.info to armada-api-86587b4c76-5r5vk:/tmp, error: Unexpected error while running command.
Command: kubectl --kubeconfig /etc/kubernetes/admin.conf cp -n armada /etc/build.info armada-api-86587b4c76-5r5vk:/tmp --container armada-api
Exit code: 1
Stdout: ''
Stderr: 'error: unable to upgrade connection: container not found ("armada-api")\n'
sysinv 2020-07-03 06:37:13.789 104562 INFO sysinv.conductor.kube_app [-] Could not get Armada service : armada pod not ready

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces | grep arma
armada armada-api-86587b4c76-5r5vk 0/2 Unknown 0 20h

kubectl describe above armada pod:

Events:
Type Reason ---- ------ Normal Scheduled Normal Pulled Normal Created Normal Started Normal Pulled Normal Created Normal Started Normal Pulled Normal Created Normal Started Warning NodeAffinity 20h Warning FailedMount 20h Warning FailedMount 20h Warning FailedMount Warning FailedMount Warning FailedMount Warning FailedMount Warning FailedMount Sat Jul 4 03:10:07 UTC 2020 Age From Message
---- ---- -------
20h default-scheduler Successfully assigned armada/armada-api-86587b4c76-5r5vk to controller-0
20h kubelet, controller-0 Container image "registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1" already present on machine
20h kubelet, controller-0 Created container init
20h kubelet, controller-0 Started container init
20h kubelet, controller-0 Container image "registry.local:9001/quay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic" already present on machine
20h kubelet, controller-0 Created container armada-api
20h kubelet, controller-0 Started container armada-api
20h kubelet, controller-0 Container image "registry.local:9001/gcr.io/kubernetes-helm/tiller:v2.16.1" already present on machine
20h kubelet, controller-0 Created container tiller
20h kubelet, controller-0 Started container tiller
kubelet, controller-0 Predicate NodeAffinity failed
kubelet, controller-0 MountVolume.SetUp failed for volume "armada-etc" : failed to sync configmap cache: timed out waiting for the condition
kubelet, controller-0 MountVolume.SetUp failed for volume "armada-api-token-2jqvc" : failed to sync secret cache: timed out waiting for the condition
83m (x96 over 20h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-etc armada-api-token-2jqvc pod-tmp pod-etc-armada]: timed out waiting for the condition
58m (x129 over 19h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-tmp pod-etc-armada armada-etc armada-api-token-2jqvc]: timed out waiting for the condition
13m (x107 over 20h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-etc-armada armada-etc armada-api-token-2jqvc pod-tmp]: timed out waiting for the condition
8m15s (x607 over 20h) kubelet, controller-0 MountVolume.SetUp failed for volume "armada-etc" : stat /var/lib/kubelet/pods/b391e965-011d-4cd2-b3fa-75520543390f/volumes/kubernetes.io~configmap/armada-etc: no such file or directory
3m30s (x252 over 20h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-api-token-2jqvc pod-tmp pod-etc-armada armada-etc]: timed out waiting for the condition

Test Activity
-------------
installation

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2020-07-06:

Collect log added at
https://files.starlingx.kube.cengn.ca/launchpad/1886429

tags:

added: stx.retestneeded

Frank Miller (sensfan22) on 2020-07-07

Changed in starlingx:
importance:	Undecided → Critical
importance:	Critical → High
status:	New → Confirmed
tags:	added: stx.4.0 stx.containers
Changed in starlingx:
assignee:	nobody → Bob Church (rchurch)

Revision history for this message

Dan Voiculeasa (dvoicule) wrote on 2020-07-07:

One way to reproduce during B&R of AIO-DX:
deploy AIO-DX, remove armada label on node controller-0. delete armada pod, armada pod moves to controller-1.
take the backup. do the restore up to including controller-0 unlock. Then easiest way to reproduce is to constantly apply cert-manager which will constantly fail with `Failed to copy /etc/build.info`

> /usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py(3072)_start_armada_service()
-> if pod and pod.status.phase == 'Running':

print pods[0]
'phase': 'Running',
'pod_ip': '172.16.166.134',

armada armada-api-6d7cdc4555-nk4xh 2/2 Terminating 0 165m 172.16.166.134 controller-1 <none> <none>
armada armada-api-74cffccf8c-d48jd 2/2 Running 2 115m 172.16.192.102 controller-0 <none> <none>

line 3063 pod = pods[0] well..the logic only takes one pod into account.
But why isn't kubernetes returning the phase `Terminating`? seems to me a k8s bug

Revision history for this message

Dan Voiculeasa (dvoicule) wrote on 2020-07-08:

Download full text (4.0 KiB)

In case there are multiple armada pods: one stuck in Terminating, one Ready.

Doesn't work selecting the good pod[reference 1] instead of `pod = pods[0]` because at a later time helmv2-cli is called:

helmv2-cli
helmv2-cli[3839105] ERROR Could not find tiller listen port.

Kubernetes python client doesn't have support for pod delete --force flag.
A call to kubectl --force must be done:
kubectl delete pods -n armada armada-pod-hash47 --force

+ def _check_pod_ready_probe(self, pod):
+ """ Pod is of the form returned by self._kube.kube_get_pods_by_selector
+ Returns true if last probe shows the container is in `Ready` state
+ """
+ conditions = list(filter(lambda x: x.type == 'Ready', pod.status.conditions))
+ return conditions[0].status == 'True'
+
+ def _prefer_select_one_running_ready_pod(self, pods):
+ """ Find one running and ready pod.
+ Return found if one, otherwise first pod.
+ """
+ for pod in pods:
+ if pod.status.phase == 'Running' and \
+ self._check_pod_ready_probe(pod):
+ return pod
+
+ return pods[0]
+
     def _start_armada_service(self):
         """Armada pod is managed by Kubernetes / Helm.
            This routine checks and waits for armada to be providing service.
@@ -3058,7 +3076,7 @@ class ArmadaHelper(object):
                     "application=%s" % ARMADA_APPLICATION, "")
                 if not pods:
                     raise RuntimeError('armada pod not found')
- pod = pods[0]
+ pod = self._prefer_select_one_running_ready_pod(pods)

                 if pod and pod.status.phase != 'Running':
                     # Delete the pod, it should restart if it can
@@ -3067,7 +3085,8 @@ class ArmadaHelper(object):
                         LOG.warning("Pod %s/%s deletion unsuccessful...",
                             ARMADA_NAMESPACE, pod.metadata.name)

- if pod and pod.status.phase == 'Running':
+ if pod and pod.status.phase == 'Running' and \
+ self._check_pod_ready_probe(pod):
                     # Test that we can copy files into armada-api container
                     src = '/etc/build.info'
                     dest_dir = '{}:{}'.format(pod.metadata.name, '/tmp')
@@ -3081,6 +3100,16 @@ class ArmadaHelper(object):
                     else:
                         return True
                     return True
+ #
+ # elif pod and pod.status.phase == 'Running':
+ # LOG.warning("Pod %s/%s running but not ready",
+ # ARMADA_NAMESPACE, pod.metadata.name)
+ #
+ # # Delete the pod, it should restart if it can
+ # if n...

In case there are multiple armada pods: one stuck in Terminating, one Ready.

Doesn't work selecting the good pod[reference 1] instead of `pod = pods[0]` because at a later time helmv2-cli is called:

helmv2-cli 
  helmv2-cli[3839105] ERROR Could not find tiller listen port.

Kubernetes python client doesn't have support for pod delete --force flag.
A call to kubectl --force must be done:
kubectl delete pods -n armada armada-pod-hash47 --force

[1]
diff --git a/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py b/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py
index 8f5f5cb0..03a07041 100644
--- a/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py
+++ b/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py
@@ -3027,6 +3027,24 @@ class ArmadaHelper(object):
                 return False
         return True
 
+    def _check_pod_ready_probe(self, pod):
+        """ Pod is of the form returned by self._kube.kube_get_pods_by_selector
+         Returns true if last probe shows the container is in `Ready` state
+        """
+        conditions = list(filter(lambda x: x.type == 'Ready', pod.status.conditions))
+        return conditions[0].status == 'True'
+
+    def _prefer_select_one_running_ready_pod(self, pods):
+        """ Find one running and ready pod.
+            Return found if one, otherwise first pod.
+        """
+        for pod in pods:
+            if pod.status.phase == 'Running' and \
+                    self._check_pod_ready_probe(pod):
+                return pod
+
+        return pods[0]
+
     def _start_armada_service(self):
         """Armada pod is managed by Kubernetes / Helm.
            This routine checks and waits for armada to be providing service.
@@ -3058,7 +3076,7 @@ class ArmadaHelper(object):
                     "application=%s" % ARMADA_APPLICATION, "")
                 if not pods:
                     raise RuntimeError('armada pod not found')
-                pod = pods[0]
+                pod = self._prefer_select_one_running_ready_pod(pods)
 
                 if pod and pod.status.phase != 'Running':
                     # Delete the pod, it should restart if it can
@@ -3067,7 +3085,8 @@ class ArmadaHelper(object):
                         LOG.warning("Pod %s/%s deletion unsuccessful...",
                             ARMADA_NAMESPACE, pod.metadata.name)
 
-                if pod and pod.status.phase == 'Running':
+                if pod and pod.status.phase == 'Running' and \
+                        self._check_pod_ready_probe(pod):
                     # Test that we can copy files into armada-api container
                     src = '/etc/build.info'
                     dest_dir = '{}:{}'.format(pod.metadata.name, '/tmp')
@@ -3081,6 +3100,16 @@ class ArmadaHelper(object):
                     else:
                         return True
                     return True
+                #
+                # elif pod and pod.status.phase == 'Running':
+                #     LOG.warning("Pod %s/%s running but not ready",
+                #                 ARMADA_NAMESPACE, pod.metadata.name)
+                #
+                #     # Delete the pod, it should restart if it can
+                #     if not self._kube.kube_delete_pod(pod.metadata.name,
+                #             ARMADA_NAMESPACE, grace_periods_seconds=0, force=None):
+                #         LOG.warning("Pod %s/%s deletion unsuccessful...",
+                #             ARMADA_NAMESPACE, pod.metadata.name)
 
             except Exception as e:
                 LOG.info("Could not get Armada service : %s " % e)
@@ -3151,7 +3180,7 @@ class ArmadaHelper(object):
                     "status.phase=Running")
                 if not pods:
                     raise RuntimeError('armada pod not found')
-                armada_pod = pods[0].metadata.name
+                armada_pod = self._prefer_select_one_running_ready_pod(pods).metadata.name
                 if not self.copy_manifests_and_overrides_to_armada(armada_pod, manifest_file):
                     raise RuntimeError('could not access armada pod')

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-07-20:

Lowering importance to medium as the issue is very intermittent.

Changed in starlingx:
importance:	High → Medium

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-07-21:

Note: This is related to the upversion to helm v3 done in stx.4.0.

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-07-21:

As this is related to helm v3 assigning to Jim as he was the prime for the upversion to helm v3.

Changed in starlingx:
assignee:	Bob Church (rchurch) → Jim Gauld (jgauld)

Revision history for this message

Jim Gauld (jgauld) wrote on 2020-07-27:

If there are lingering armada pods in non-Running state (eg, Unknown, Terminating) then Dan's proposed solution should work. Ideally we should also cleanup stale armada pods, but that requires --force.

WORKAROUND: Forcibly delete all armada pods (note this is safe to do):

kubectl get pods -n armada | awk '{print $1}' | \
xargs --no-run-if-empty -i{} kubectl delete pod -n armada {} --grace-period=0 --force

Revision history for this message

Peng Peng (ppeng) wrote on 2020-07-29:

Issue was reproduced on
2020-07-29_00-00-00
sm-3

log uploaded at
https://files.starlingx.kube.cengn.ca/launchpad/1886429

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-07-30:

As per 2020-07-30 release meeting, this will be included in a 4.0 mtce release.

Ghada Khalil (gkhalil) on 2020-08-05

tags:

added: not-yet-in-r-stx40

Frank Miller (sensfan22) on 2020-08-11

Changed in starlingx:
assignee:	Jim Gauld (jgauld) → Suvro Ghosh (suvr0)

Ghada Khalil (gkhalil) on 2020-08-26

tags:

added: stx.5.0

Revision history for this message

Peng Peng (ppeng) wrote on 2020-09-03:

#10

Issue was reproduced on
SM-3
2020-09-03_00-00-00

log:
sysinv 2020-09-03 07:43:48.334 99598 ERROR sysinv.conductor.kube_app [-] Failed to copy /etc/build.info to armada-api-5ff859b99-c4bs5:/tmp, error: Unexpected error while running command.
Command: kubectl --kubeconfig /etc/kubernetes/admin.conf cp -n armada /etc/build.info armada-api-5ff859b99-c4bs5:/tmp --container armada-api
Exit code: 1
Stdout: ''
Stderr: 'Error from server: error dialing backend: dial tcp [abcd:204::2]:10250: connect: connection refused\n': RuntimeError: armada pod not ready

Jim Gauld (jgauld) on 2020-09-09

Changed in starlingx:
assignee:	Suvro Ghosh (suvr0) → Jim Gauld (jgauld)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-16: Fix proposed to integ (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/752304

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-16: Fix proposed to config (master)

#12

Fix proposed to branch: master
Review: https://review.opendev.org/752305

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Fix merged to integ (master)

#13

Reviewed: https://review.opendev.org/752304
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=a4d7cb98329b95e3182597d19bc09eebe8de34b4
Submitter: Zuul
Branch: master

commit a4d7cb98329b95e3182597d19bc09eebe8de34b4
Author: Jim Gauld <email address hidden>
Date: Wed Sep 16 15:58:09 2020 -0400

helmv2-cli armada pod running and ready check

    This modifies helmv2-cli check for armada pod readiness.
    There can be multiple kubernetes armada pods in the system, each
    in different condition. This now selects armada pods that are
    running and ready instead of just selecting the first found pod.

i.e., status.phase 'Running' and status.conditions 'Ready=True'

    Change-Id: Ib998d60f3442835d9dbd8e93bf6aaa54ede2e218
    Closes-Bug: 1886429
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Fix merged to config (master)

#14

Reviewed: https://review.opendev.org/752305
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=e0963bff80f7e1202f19f109aa7f5ce0074972b1
Submitter: Zuul
Branch: master

commit e0963bff80f7e1202f19f109aa7f5ce0074972b1
Author: Jim Gauld <email address hidden>
Date: Wed Sep 16 15:26:09 2020 -0400

sysinv kube_app armada pod running and ready check

    This modifies sysinv kube_app check for armada pod readiness.
    There can be multiple kubernetes armada pods in the system, each
    in different condition. This now prefers armada pods that are
    running and ready instead of just selecting the first found pod.

i.e., status.phase 'Running' and status.conditions 'Ready=True'

    Have been able to recreate multiple armada pod scenarios
    where the previous check was insufficient, e.g.,
    - Manually delete armada pod, and re-apply application;
      see pod in 'Terminating' phase for some time,
      and new armada pod being created and eventually Running.
    - During B&R of AIO-DX. Deploy AIO-DX, remove armada label on
      node controller-0. Delete armada pod, armada pod moves to
      controller-1. Take backup. Do the restore up to including
      controller-0 unlock. Constantly re-apply cert-manager.
    - Change armada pod helm overrides to have 2 replicas, or scale
      armada pod to have 2 replicas.

Note that kube_app.py uses helmv2-cli which also has requires
checking for pod ready.

    Change-Id: I45523c8da36618b8ac465269ad95ea28429c03e7
    Partial-Bug: 1886429
    Depends-On: https://review.opendev.org/752304
    Signed-off-by: Jim Gauld <email address hidden>

Revision history for this message

Peng Peng (ppeng) wrote on 2020-10-13:

#15

Issue was not reproduced recently.

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1887808

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.