Brief Description
-----------------
platform-integration-manifest apply was failed during the install and reapply also failed . Further investigation found to be controller-1 was unlocked-enabled but rbd-provisioner-77bfb6dbb-k98vh pods was not up and controller-1 was tainted.
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep rbd
kube-system rbd-provisioner-77bfb6dbb-k98vh 0/1 Pending 0 26m <none> <none> <none> <none>
kube-system rbd-provisioner-77bfb6dbb-vm242 1/1 Running 1 26m dead:beef::8e22:765f:6121:eb5b controller-0 <none> <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 36s (x21 over 26m) default-scheduler 0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match node selector.
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe node controller-1 | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe node controller-0 | grep Taints
Taints: <none>
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| cert-manager | 1.0-0 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applying | applying application manifest |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
Severity
--------
Major
Steps to Reproduce
------------------
1.Install 2020-05-07_21-11-18 load
System Configuration
--------------------
DC + compute wcp-63-66
Expected Behavior
------------------
Platform integration app applied
Actual Behavior
----------------
As description application apply failure.
Reproducibility
---------------
not sure not tried more than once.
Load
----
2020-05-07_21-11-18
Last Pass
---------
Same load was passed many times . May be intermittent issue
Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| cert-manager | 1.0-0 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applying | applying application manifest |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
# platform-integ-apps:
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada return _end_unary_response_blocking(state, call, False, None)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada raise _Rendezvous(state, None, None, deadline)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada status = StatusCode.UNKNOWN
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada debug_error_string = "{"created":"@1592676136.365005702","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada >
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada result = get_result()
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 239, in execute
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada timeout=timer)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 486, in install_release
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Install')
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: stx-rbd-provisioner - Tiller Message: b'Release "stx-rbd-provisioner" failed: timed out waiting for the condition'
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada ^[[00m
2020-06-20 18:02:16.480 80317 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner']^[[00m
2020-06-20 18:02:17.025 80317 INFO armada.handlers.lock [-] Releasing lock^[[00m
2020-06-20 18:02:17.030 80317 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-rbd-provisioner']
2020-06-20 18:02:17.030 80317 ERROR armada.cli Traceback (most recent call last):
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-06-20 18:02:17.030 80317 ERROR armada.cli self.invoke()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-06-20 18:02:17.030 80317 ERROR armada.cli resp = self.handle(documents, tiller)
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-06-20 18:02:17.030 80317 ERROR armada.cli return future.result()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-06-20 18:02:17.030 80317 ERROR armada.cli return self.__get_result()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-06-20 18:02:17.030 80317 ERROR armada.cli raise self._exception
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-06-20 18:02:17.030 80317 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-06-20 18:02:17.030 80317 ERROR armada.cli return armada.sync()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2020-06-20 18:02:17.030 80317 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2020-06-20 18:02:17.030 80317 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-rbd-provisioner']
# oidc:
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.wait [-] [chart=kube-system-dex]: Timed out waiting for pods (namespace=kube-system, labels=(app=dex)). These pods were not ready=['oidc-dex-6585f5f9bc-qs8v2']^[[00m
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada [-] Chart deploy [kube-system-dex] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=dex)). These pods were not ready=['oidc-dex-6585f5f9bc-qs8v2']
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada result = get_result()
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 248, in execute
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada chart_wait.wait(timer)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 134, in wait
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 294, in wait
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada modified = self._wait(deadline)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 354, in _wait
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=dex)). These pods were not ready=['oidc-dex-6585f5f9bc-qs8v2']
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada ^[[00m
2020-06-20 21:22:53.746 89262 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-dex']^[[00m
2020-06-20 21:22:54.455 89262 INFO armada.handlers.lock [-] Releasing lock^[[00m
2020-06-20 21:22:54.461 89262 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-dex']
2020-06-20 21:22:54.461 89262 ERROR armada.cli Traceback (most recent call last):
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-06-20 21:22:54.461 89262 ERROR armada.cli self.invoke()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-06-20 21:22:54.461 89262 ERROR armada.cli resp = self.handle(documents, tiller)
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-06-20 21:22:54.461 89262 ERROR armada.cli return future.result()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-06-20 21:22:54.461 89262 ERROR armada.cli return self.__get_result()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-06-20 21:22:54.461 89262 ERROR armada.cli raise self._exception
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-06-20 21:22:54.461 89262 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-06-20 21:22:54.461 89262 ERROR armada.cli return armada.sync()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2020-06-20 21:22:54.461 89262 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2020-06-20 21:22:54.461 89262 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-dex']
Test Activity
-------------
Regression test
# system looked healthy when this happened.
System was alarm free, all nodes were unlocked-enabled via system host-list, and all nodes were ready via kubectl get nodes -n deployment.
# rbd pod did not like controller-1, because it was tainted for some reason. App apply worked after removing the taint.
[sysadmin@ controller- 0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep rbd -77bfb6dbb- k98vh 0/1 Pending 0 26m <none> <none> <none> <none> -77bfb6dbb- vm242 1/1 Running 1 26m dead:beef: :8e22:765f: 6121:eb5b controller-0 <none> <none>
kube-system rbd-provisioner
kube-system rbd-provisioner
Events: anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node-role. kubernetes. io/master: }, that the pod didn't tolerate, 2 node(s) didn't match node selector. controller- 0 ~(keystone_admin)]$
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 36s (x21 over 26m) default-scheduler 0/4 nodes are available: 1 node(s) didn't match pod affinity/
[sysadmin@
[sysadmin@ controller- 0 ~(keystone_admin)]$ kubectl describe node controller-1 | grep Taints kubernetes. io/master: NoSchedule controller- 0 ~(keystone_admin)]$ kubectl describe node controller-0 | grep Taints
Taints: node-role.
[sysadmin@
Taints: <none>
# Workaround: kubernetes. io/master: NoSchedule-
kubectl taint node controller-1 node-role.
Following app apply already failed before the workaround: controller- 0 ~(keystone_admin)]$ ls -lrt /var/log/ armada/ platform- integ-apps- apply_2020- 06-2* armada/ platform- integ-apps- apply_2020- 06-20-15- 30-07.log armada/ platform- integ-apps- apply_2020- 06-20-16- 00-39.log armada/ platform- integ-apps- apply_2020- 06-20-16- 31-11.log armada/ platform- integ-apps- apply_2020- 06-20-17- 01-43.log armada/ platform- integ-apps- apply_2020- 06-20-17- 32-15.log armada/ platform- integ-apps- apply_2020- 06-21-19- 47-22.log armada/ platform- integ-apps- apply_2020- 06-21-20- 17-54.log armada/ platform- integ-apps- apply_2020- 06-21-20- 48-26.log armada/ platform- integ-apps- apply_2020- 06-21-21- 18-58.log
[sysadmin@
-rw-r--r-- 1 1000 users 21784 Jun 20 16:00 /var/log/
-rw-r--r-- 1 1000 users 21784 Jun 20 16:30 /var/log/
-rw-r--r-- 1 1000 users 21784 Jun 20 17:01 /var/log/
-rw-r--r-- 1 1000 users 21784 Jun 20 17:31 /var/log/
-rw-r--r-- 1 1000 users 21784 Jun 20 18:02 /var/log/
-rw-r--r-- 1 1000 users 21987 Jun 21 20:17 /var/log/
-rw-r--r-- 1 1000 users 22122 Jun 21 20:47 /var/log/
-rw-r--r-- 1 1000 users 22122 Jun 21 21:18 /var/log/
-rw-r--r-- 1 1000 users 9154 Jun 21 21:19 /var/log/
[sysadmin@ controller- 0 ~(keystone_admin)]$ ls -lrt /var/log/ armada/ oidc-auth- apps-apply_ 2020-* armada/ oidc-auth- apps-apply_ 2020-06- 20-20-52- 52.log
-rw-r--r-- 1 1000 users 19243 Jun 20 21:22 /var/log/