Failed scheduling of armada pod resulting in subcloud bootstrap failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Won't Fix
|
Medium
|
Angie Wang |
Bug Description
Brief Description
-----------------
Failed scheduling of armada pod resulted in subcloud bootstrap failure.
Severity
--------
Provide the severity of the defect.
Major
Steps to Reproduce
------------------
Deploy a batch of 50 or so subclouds via dcmanager subcloud add
Expected Behavior
------------------
All subclouds are successfully deployed
Actual Behavior
----------------
One of the subclouds failed bootstrap timing out while waiting for nginx-ingress-
Bootstrap log (on the system controller):
=============
TASK [bootstrap/
FAILED - RETRYING: Wait until application is in the uploaded state (3 retries left).
FAILED - RETRYING: Wait until application is in the uploaded state (2 retries left).
FAILED - RETRYING: Wait until application is in the uploaded state (1 retries left).
fatal: [subcloud37]: FAILED! => {"attempts": 3, "changed": true, "cmd": "source /etc/platform/
Sysinv log (on the subcloud):
==========
sysinv 2020-10-30 00:46:54.576 512516 ERROR sysinv.
2020-10-30 00:46:54.576 512516 ERROR sysinv.
2020-10-30 00:46:54.576 512516 ERROR sysinv.
2020-10-30 00:46:54.576 512516 ERROR sysinv.
2020-10-30 00:46:54.576 512516 ERROR sysinv.
2020-10-30 00:46:54.576 512516 ERROR sysinv.
fmAPI.cpp(137): Connected to FM Manager.
sysinv 2020-10-30 00:46:54.744 512516 ERROR sysinv.
sysinv 2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
2020-10-30 00:46:54.745 512516 ERROR sysinv.
Armada pod description
=======
controller-0:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
armada armada-
kube-system calico-
kube-system calico-node-fdg4t 1/1 Running 0 14h
kube-system coredns-
kube-system kube-apiserver-
kube-system kube-controller
kube-system kube-multus-
kube-system kube-proxy-dgv54 1/1 Running 0 14h
kube-system kube-scheduler-
kube-system kube-sriov-
controller-0:~$ kubectl describe pod -n armada armada-
….
….
QoS Class: BestEffort
Node-Selectors: armada=enabled
Tolerations: node.kubernetes
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m56s (x587 over 14h) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.
Reproducibility
---------------
<Reproducible/
Seen once
System Configuration
-------
IPv6 Distributed Cloud
Branch/Pull Time/Commit
-------
Oct. 27th master load
Last Pass
---------
Not sure whether batch subcloud deployment tests have been executed since containerized armada feature was introduced
Timestamp/Logs
--------------
See attached
Test Activity
-------------
Evaluation
Workaround
----------
Delete the subcloud that failed bootstrap
Log into the subcloud using bootstrap ip and delete the app that failed manifest validation
Re-add the subcloud
stx.5.0 / medium priority- intermittent armada issue affecting DC scaling feature