StarlingX

Failed scheduling of armada pod resulting in subcloud bootstrap failure

Bug #1902266 reported by Tee Ngo on 2020-10-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Medium	Angie Wang

Bug Description

Brief Description
-----------------
Failed scheduling of armada pod resulted in subcloud bootstrap failure.

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------
Deploy a batch of 50 or so subclouds via dcmanager subcloud add

Expected Behavior
------------------
All subclouds are successfully deployed

Actual Behavior
----------------
One of the subclouds failed bootstrap timing out while waiting for nginx-ingress-controller to be uploaded. This app failed the upload as sysinv on the subcloud could not communicate with the armada pod which failed to be scheduled.

Bootstrap log (on the system controller):
=============
TASK [bootstrap/bringup-bootstrap-applications : Wait until application is in the uploaded state] ***
FAILED - RETRYING: Wait until application is in the uploaded state (3 retries left).
FAILED - RETRYING: Wait until application is in the uploaded state (2 retries left).
FAILED - RETRYING: Wait until application is in the uploaded state (1 retries left).
fatal: [subcloud37]: FAILED! => {"attempts": 3, "changed": true, "cmd": "source /etc/platform/openrc; system application-show nginx-ingress-controller --column status --format value", "delta": "0:00:02.829967", "end": "2020-10-30 00:46:54.724136", "rc": 0, "start": "2020-10-30 00:46:51.894169", "stderr": "", "stderr_lines": [], "stdout": "uploading", "stdout_lines": ["uploading"]}

Sysinv log (on the subcloud):
==========
sysinv 2020-10-30 00:46:54.576 512516 ERROR sysinv.conductor.kube_app [-] Upload of application nginx-ingress-controller (1.0-0) failed: Failed to validate application manifest.: KubeAppUploadFailure: Upload of application nginx-ingress-controller (1.0-0) failed: Failed to validate application manifest.
2020-10-30 00:46:54.576 512516 ERROR sysinv.conductor.kube_app Traceback (most recent call last):
2020-10-30 00:46:54.576 512516 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1974, in perform_app_upload
2020-10-30 00:46:54.576 512516 ERROR sysinv.conductor.kube_app reason="Failed to validate application manifest.")
2020-10-30 00:46:54.576 512516 ERROR sysinv.conductor.kube_app KubeAppUploadFailure: Upload of application nginx-ingress-controller (1.0-0) failed: Failed to validate application manifest.
2020-10-30 00:46:54.576 512516 ERROR sysinv.conductor.kube_app
fmAPI.cpp(137): Connected to FM Manager.
sysinv 2020-10-30 00:46:54.744 512516 ERROR sysinv.conductor.kube_app [-] Application upload aborted!.: KubeAppUploadFailure: Upload of application nginx-ingress-controller (1.0-0) failed: Failed to validate application manifest.
sysinv 2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp [-] Exception during message handling: KubeAppUploadFailure: Upload of application nginx-ingress-controller (1.0-0) failed: Failed to validate application manifest.
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp Traceback (most recent call last):
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 437, in _process_data
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp **args)
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 11334, in perform_app_upload
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp self._app.perform_app_upload(rpc_app, tarfile)
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1974, in perform_app_upload
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp reason="Failed to validate application manifest.")
2020-10-30 00:46:54.745 512516 ERROR sysinv.openstack.common.rpc.amqp KubeAppUploadFailure: Upload of application nginx-ingress-controller (1.0-0) failed: Failed to validate application manifest.

Armada pod description
======================
controller-0:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
armada armada-api-544c47fb68-gfrtw 0/2 Pending 0 14h
kube-system calico-kube-controllers-5cd4695574-t8bb9 1/1 Running 1 14h
kube-system calico-node-fdg4t 1/1 Running 0 14h
kube-system coredns-7fc965fbd7-fwg59 1/1 Running 0 14h
kube-system kube-apiserver-controller-0 1/1 Running 0 14h
kube-system kube-controller-manager-controller-0 1/1 Running 0 14h
kube-system kube-multus-ds-amd64-tv6v9 1/1 Running 0 14h
kube-system kube-proxy-dgv54 1/1 Running 0 14h
kube-system kube-scheduler-controller-0 1/1 Running 0 14h
kube-system kube-sriov-cni-ds-amd64-d5zqx 1/1 Running 0 14h

controller-0:~$ kubectl describe pod -n armada armada-api-544c47fb68-gfrtw
….
….
QoS Class: BestEffort
Node-Selectors: armada=enabled
Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s
                 node.kubernetes.io/unreachable:NoExecute for 30s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 3m56s (x587 over 14h) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
Seen once

System Configuration
--------------------
IPv6 Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Oct. 27th master load

Last Pass
---------
Not sure whether batch subcloud deployment tests have been executed since containerized armada feature was introduced

Timestamp/Logs
--------------
See attached

Test Activity
-------------
Evaluation

Workaround
----------
Delete the subcloud that failed bootstrap
Log into the subcloud using bootstrap ip and delete the app that failed manifest validation
Re-add the subcloud

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-10-30:

controller-0_20201030.145952.tar Edit (20.6 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-11-06:

stx.5.0 / medium priority- intermittent armada issue affecting DC scaling feature

tags:	added: stx.containers stx.distcloud
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.5.0
Changed in starlingx:
assignee:	nobody → Angie Wang (angiewang)

Revision history for this message

Frank Miller (sensfan22) wrote on 2021-04-15:

This issue is not seen recently and was only reported the one time. If the frequency of this issue increases then please open a new LP with a recent load.