apps failed to upload or apply due to armada pod not ready

Bug #1886429 reported by Peng Peng
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
During system installation, 3 apps (cert-manager, oidc-auth-apps and platform-integ-apps), failed to upload or apply due to armada pod not ready.

Severity
--------
Major

Steps to Reproduce
------------------
Installing a system and apply applications on it

Expected Behavior
------------------
Application applied success

Actual Behavior
----------------
Application apply-failed or upload-failed

Reproducibility
---------------
Intermittent
1/10 It seems to happen once in two weeks.

System Configuration
--------------------
One node system
Two node system

Lab-name: AIO-Plus

Branch/Pull Time/Commit
-----------------------
2020-07-02_22-00-00

Last Pass
---------
2020-07-01_22-00-00

Timestamp/Logs
--------------
[2020-07-03 06:37:27,166] 314 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-07-03 06:37:28,608] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------+---------+-----------------------------------+----------------------------------------+-----------+---------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+-----------+---------------------------------+
| cert-manager | 1.0-5 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | uploading | extracting application tar file |
| platform-integ-apps | 1.0-9 | platform-integration-manifest | manifest.yaml | uploading | extracting application tar file |

[2020-07-03 07:18:08,570] 314 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-07-03 07:18:09,969] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------+---------+-----------------------------------+-------------------------------------+---------------+----------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+-------------------------------------+---------------+----------------------------------------------------------------------------------------------------+
| cert-manager | 1.0-5 | cert-manager-manifest | certmanager-manifest.yaml | apply-failed | operation aborted, check logs for detail |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest. | applied | completed |
| | | | yaml | | |
| | | | | | |
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | upload-failed | Upload of application oidc-auth-apps (1.0-27) failed: Failed to validate application manifest. |
| platform-integ-apps | 1.0-9 | platform-integration-manifest | manifest.yaml | upload-failed | Upload of application platform-integ-apps (1.0-9) failed: Failed to validate application manifest. |
+--------------------------+---------+-----------------------------------+-------------------------------------+---------------+----------------------------------------------------------------------------------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list --nowrap
+----------+----------------------------+-------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------+-------------------------------------+----------+----------------------------+
| 750.002 | Application Apply Failure | k8s_application=cert-manager | major | 2020-07-03T07:15:45.879778 |
| 750.001 | Application Upload Failure | k8s_application=oidc-auth-apps | warning | 2020-07-03T06:38:16.358362 |
| 750.001 | Application Upload Failure | k8s_application=platform-integ-apps | warning | 2020-07-03T06:37:45.114042 |
+----------+----------------------------+-------------------------------------+----------+----------------------------+

sysinv 2020-07-03 06:37:13.618 104562 INFO sysinv.conductor.kube_app [-] PluginHelper: Installing platform-integ-apps plugin /scratch/apps/platform-integ-apps/1.0-9/plugins/k8sapp_platform-1.0-py2.py3-none-any.whl to /opt/platform/helm/20.06/platform-integ-apps/1.0-9/plugins.
sysinv 2020-07-03 06:37:13.788 104562 ERROR sysinv.conductor.kube_app [-] Failed to copy /etc/build.info to armada-api-86587b4c76-5r5vk:/tmp, error: Unexpected error while running command.
Command: kubectl --kubeconfig /etc/kubernetes/admin.conf cp -n armada /etc/build.info armada-api-86587b4c76-5r5vk:/tmp --container armada-api
Exit code: 1
Stdout: ''
Stderr: 'error: unable to upgrade connection: container not found ("armada-api")\n'
sysinv 2020-07-03 06:37:13.789 104562 INFO sysinv.conductor.kube_app [-] Could not get Armada service : armada pod not ready

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces | grep arma
armada armada-api-86587b4c76-5r5vk 0/2 Unknown 0 20h

kubectl describe above armada pod:

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 20h default-scheduler Successfully assigned armada/armada-api-86587b4c76-5r5vk to controller-0
  Normal Pulled 20h kubelet, controller-0 Container image "registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1" already present on machine
  Normal Created 20h kubelet, controller-0 Created container init
  Normal Started 20h kubelet, controller-0 Started container init
  Normal Pulled 20h kubelet, controller-0 Container image "registry.local:9001/quay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic" already present on machine
  Normal Created 20h kubelet, controller-0 Created container armada-api
  Normal Started 20h kubelet, controller-0 Started container armada-api
  Normal Pulled 20h kubelet, controller-0 Container image "registry.local:9001/gcr.io/kubernetes-helm/tiller:v2.16.1" already present on machine
  Normal Created 20h kubelet, controller-0 Created container tiller
  Normal Started 20h kubelet, controller-0 Started container tiller
  Warning NodeAffinity 20h kubelet, controller-0 Predicate NodeAffinity failed
  Warning FailedMount 20h kubelet, controller-0 MountVolume.SetUp failed for volume "armada-etc" : failed to sync configmap cache: timed out waiting for the condition
  Warning FailedMount 20h kubelet, controller-0 MountVolume.SetUp failed for volume "armada-api-token-2jqvc" : failed to sync secret cache: timed out waiting for the condition
  Warning FailedMount 83m (x96 over 20h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-etc armada-api-token-2jqvc pod-tmp pod-etc-armada]: timed out waiting for the condition
  Warning FailedMount 58m (x129 over 19h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-tmp pod-etc-armada armada-etc armada-api-token-2jqvc]: timed out waiting for the condition
  Warning FailedMount 13m (x107 over 20h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[pod-etc-armada armada-etc armada-api-token-2jqvc pod-tmp]: timed out waiting for the condition
  Warning FailedMount 8m15s (x607 over 20h) kubelet, controller-0 MountVolume.SetUp failed for volume "armada-etc" : stat /var/lib/kubelet/pods/b391e965-011d-4cd2-b3fa-75520543390f/volumes/kubernetes.io~configmap/armada-etc: no such file or directory
  Warning FailedMount 3m30s (x252 over 20h) kubelet, controller-0 Unable to attach or mount volumes: unmounted volumes=[armada-etc], unattached volumes=[armada-api-token-2jqvc pod-tmp pod-etc-armada armada-etc]: timed out waiting for the condition
Sat Jul 4 03:10:07 UTC 2020

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
tags: added: stx.retestneeded
Frank Miller (sensfan22)
Changed in starlingx:
importance: Undecided → Critical
importance: Critical → High
status: New → Confirmed
tags: added: stx.4.0 stx.containers
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

One way to reproduce during B&R of AIO-DX:
deploy AIO-DX, remove armada label on node controller-0. delete armada pod, armada pod moves to controller-1.
take the backup. do the restore up to including controller-0 unlock. Then easiest way to reproduce is to constantly apply cert-manager which will constantly fail with `Failed to copy /etc/build.info`

> /usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py(3072)_start_armada_service()
-> if pod and pod.status.phase == 'Running':

print pods[0]
            'phase': 'Running',
            'pod_ip': '172.16.166.134',

armada armada-api-6d7cdc4555-nk4xh 2/2 Terminating 0 165m 172.16.166.134 controller-1 <none> <none>
armada armada-api-74cffccf8c-d48jd 2/2 Running 2 115m 172.16.192.102 controller-0 <none> <none>

line 3063 pod = pods[0] well..the logic only takes one pod into account.
But why isn't kubernetes returning the phase `Terminating`? seems to me a k8s bug

Revision history for this message
Dan Voiculeasa (dvoicule) wrote :
Download full text (4.0 KiB)

In case there are multiple armada pods: one stuck in Terminating, one Ready.

Doesn't work selecting the good pod[reference 1] instead of `pod = pods[0]` because at a later time helmv2-cli is called:

  helmv2-cli
  helmv2-cli[3839105] ERROR Could not find tiller listen port.

Kubernetes python client doesn't have support for pod delete --force flag.
A call to kubectl --force must be done:
kubectl delete pods -n armada armada-pod-hash47 --force

[1]
diff --git a/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py b/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py
index 8f5f5cb0..03a07041 100644
--- a/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py
+++ b/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py
@@ -3027,6 +3027,24 @@ class ArmadaHelper(object):
                 return False
         return True

+ def _check_pod_ready_probe(self, pod):
+ """ Pod is of the form returned by self._kube.kube_get_pods_by_selector
+ Returns true if last probe shows the container is in `Ready` state
+ """
+ conditions = list(filter(lambda x: x.type == 'Ready', pod.status.conditions))
+ return conditions[0].status == 'True'
+
+ def _prefer_select_one_running_ready_pod(self, pods):
+ """ Find one running and ready pod.
+ Return found if one, otherwise first pod.
+ """
+ for pod in pods:
+ if pod.status.phase == 'Running' and \
+ self._check_pod_ready_probe(pod):
+ return pod
+
+ return pods[0]
+
     def _start_armada_service(self):
         """Armada pod is managed by Kubernetes / Helm.
            This routine checks and waits for armada to be providing service.
@@ -3058,7 +3076,7 @@ class ArmadaHelper(object):
                     "application=%s" % ARMADA_APPLICATION, "")
                 if not pods:
                     raise RuntimeError('armada pod not found')
- pod = pods[0]
+ pod = self._prefer_select_one_running_ready_pod(pods)

                 if pod and pod.status.phase != 'Running':
                     # Delete the pod, it should restart if it can
@@ -3067,7 +3085,8 @@ class ArmadaHelper(object):
                         LOG.warning("Pod %s/%s deletion unsuccessful...",
                             ARMADA_NAMESPACE, pod.metadata.name)

- if pod and pod.status.phase == 'Running':
+ if pod and pod.status.phase == 'Running' and \
+ self._check_pod_ready_probe(pod):
                     # Test that we can copy files into armada-api container
                     src = '/etc/build.info'
                     dest_dir = '{}:{}'.format(pod.metadata.name, '/tmp')
@@ -3081,6 +3100,16 @@ class ArmadaHelper(object):
                     else:
                         return True
                     return True
+ #
+ # elif pod and pod.status.phase == 'Running':
+ # LOG.warning("Pod %s/%s running but not ready",
+ # ARMADA_NAMESPACE, pod.metadata.name)
+ #
+ # # Delete the pod, it should restart if it can
+ # if n...

Read more...

Revision history for this message
Frank Miller (sensfan22) wrote :

Lowering importance to medium as the issue is very intermittent.

Changed in starlingx:
importance: High → Medium
Revision history for this message
Frank Miller (sensfan22) wrote :

Note: This is related to the upversion to helm v3 done in stx.4.0.

Revision history for this message
Frank Miller (sensfan22) wrote :

As this is related to helm v3 assigning to Jim as he was the prime for the upversion to helm v3.

Changed in starlingx:
assignee: Bob Church (rchurch) → Jim Gauld (jgauld)
Revision history for this message
Jim Gauld (jgauld) wrote :

If there are lingering armada pods in non-Running state (eg, Unknown, Terminating) then Dan's proposed solution should work. Ideally we should also cleanup stale armada pods, but that requires --force.

WORKAROUND: Forcibly delete all armada pods (note this is safe to do):

kubectl get pods -n armada | awk '{print $1}' | \
  xargs --no-run-if-empty -i{} kubectl delete pod -n armada {} --grace-period=0 --force

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
2020-07-29_00-00-00
sm-3

log uploaded at
https://files.starlingx.kube.cengn.ca/launchpad/1886429

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per 2020-07-30 release meeting, this will be included in a 4.0 mtce release.

Ghada Khalil (gkhalil)
tags: added: not-yet-in-r-stx40
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Jim Gauld (jgauld) → Suvro Ghosh (suvr0)
Ghada Khalil (gkhalil)
tags: added: stx.5.0
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
SM-3
2020-09-03_00-00-00

log:
sysinv 2020-09-03 07:43:48.334 99598 ERROR sysinv.conductor.kube_app [-] Failed to copy /etc/build.info to armada-api-5ff859b99-c4bs5:/tmp, error: Unexpected error while running command.
Command: kubectl --kubeconfig /etc/kubernetes/admin.conf cp -n armada /etc/build.info armada-api-5ff859b99-c4bs5:/tmp --container armada-api
Exit code: 1
Stdout: ''
Stderr: 'Error from server: error dialing backend: dial tcp [abcd:204::2]:10250: connect: connection refused\n': RuntimeError: armada pod not ready

Jim Gauld (jgauld)
Changed in starlingx:
assignee: Suvro Ghosh (suvr0) → Jim Gauld (jgauld)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/752304

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/752305

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/752304
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=a4d7cb98329b95e3182597d19bc09eebe8de34b4
Submitter: Zuul
Branch: master

commit a4d7cb98329b95e3182597d19bc09eebe8de34b4
Author: Jim Gauld <email address hidden>
Date: Wed Sep 16 15:58:09 2020 -0400

    helmv2-cli armada pod running and ready check

    This modifies helmv2-cli check for armada pod readiness.
    There can be multiple kubernetes armada pods in the system, each
    in different condition. This now selects armada pods that are
    running and ready instead of just selecting the first found pod.

    i.e., status.phase 'Running' and status.conditions 'Ready=True'

    Change-Id: Ib998d60f3442835d9dbd8e93bf6aaa54ede2e218
    Closes-Bug: 1886429
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/752305
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=e0963bff80f7e1202f19f109aa7f5ce0074972b1
Submitter: Zuul
Branch: master

commit e0963bff80f7e1202f19f109aa7f5ce0074972b1
Author: Jim Gauld <email address hidden>
Date: Wed Sep 16 15:26:09 2020 -0400

    sysinv kube_app armada pod running and ready check

    This modifies sysinv kube_app check for armada pod readiness.
    There can be multiple kubernetes armada pods in the system, each
    in different condition. This now prefers armada pods that are
    running and ready instead of just selecting the first found pod.

    i.e., status.phase 'Running' and status.conditions 'Ready=True'

    Have been able to recreate multiple armada pod scenarios
    where the previous check was insufficient, e.g.,
    - Manually delete armada pod, and re-apply application;
      see pod in 'Terminating' phase for some time,
      and new armada pod being created and eventually Running.
    - During B&R of AIO-DX. Deploy AIO-DX, remove armada label on
      node controller-0. Delete armada pod, armada pod moves to
      controller-1. Take backup. Do the restore up to including
      controller-0 unlock. Constantly re-apply cert-manager.
    - Change armada pod helm overrides to have 2 replicas, or scale
      armada pod to have 2 replicas.

    Note that kube_app.py uses helmv2-cli which also has requires
    checking for pod ready.

    Change-Id: I45523c8da36618b8ac465269ad95ea28429c03e7
    Partial-Bug: 1886429
    Depends-On: https://review.opendev.org/752304
    Signed-off-by: Jim Gauld <email address hidden>

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was not reproduced recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.