Platform intergration apply failure during the instatalltion

Bug #1884469 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Unassigned

Bug Description

Brief Description
-----------------
        platform-integration-manifest apply was failed during the install and reapply also failed . Further investigation found to be controller-1 was unlocked-enabled but rbd-provisioner-77bfb6dbb-k98vh pods was not up and controller-1 was tainted.
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep rbd
kube-system rbd-provisioner-77bfb6dbb-k98vh 0/1 Pending 0 26m <none> <none> <none> <none>
kube-system rbd-provisioner-77bfb6dbb-vm242 1/1 Running 1 26m dead:beef::8e22:765f:6121:eb5b controller-0 <none> <none>

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 36s (x21 over 26m) default-scheduler 0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match node selector.
[sysadmin@controller-0 ~(keystone_admin)]$

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe node controller-1 | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe node controller-0 | grep Taints
Taints: <none>

[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| cert-manager | 1.0-0 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applying | applying application manifest |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+

Severity
--------
Major

Steps to Reproduce
------------------
1.Install 2020-05-07_21-11-18 load

System Configuration
--------------------
DC + compute wcp-63-66
Expected Behavior
------------------
Platform integration app applied

Actual Behavior
----------------
As description application apply failure.
Reproducibility
---------------
not sure not tried more than once.
Load
----

 2020-05-07_21-11-18

Last Pass
---------
Same load was passed many times . May be intermittent issue
Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+
| cert-manager | 1.0-0 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applying | applying application manifest |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------------+------------------------------------------+

# platform-integ-apps:

2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada return _end_unary_response_blocking(state, call, False, None)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada raise _Rendezvous(state, None, None, deadline)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada status = StatusCode.UNKNOWN
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada debug_error_string = "{"created":"@1592676136.365005702","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada >
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada result = get_result()
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 239, in execute
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada timeout=timer)
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 486, in install_release
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Install')
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: stx-rbd-provisioner - Tiller Message: b'Release "stx-rbd-provisioner" failed: timed out waiting for the condition'
2020-06-20 18:02:16.479 80317 ERROR armada.handlers.armada ^[[00m
2020-06-20 18:02:16.480 80317 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner']^[[00m
2020-06-20 18:02:17.025 80317 INFO armada.handlers.lock [-] Releasing lock^[[00m
2020-06-20 18:02:17.030 80317 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-rbd-provisioner']
2020-06-20 18:02:17.030 80317 ERROR armada.cli Traceback (most recent call last):
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-06-20 18:02:17.030 80317 ERROR armada.cli self.invoke()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-06-20 18:02:17.030 80317 ERROR armada.cli resp = self.handle(documents, tiller)
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-06-20 18:02:17.030 80317 ERROR armada.cli return future.result()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-06-20 18:02:17.030 80317 ERROR armada.cli return self.__get_result()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-06-20 18:02:17.030 80317 ERROR armada.cli raise self._exception
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-06-20 18:02:17.030 80317 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-06-20 18:02:17.030 80317 ERROR armada.cli return armada.sync()
2020-06-20 18:02:17.030 80317 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2020-06-20 18:02:17.030 80317 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2020-06-20 18:02:17.030 80317 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-rbd-provisioner']

# oidc:
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.wait [-] [chart=kube-system-dex]: Timed out waiting for pods (namespace=kube-system, labels=(app=dex)). These pods were not ready=['oidc-dex-6585f5f9bc-qs8v2']^[[00m
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada [-] Chart deploy [kube-system-dex] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=dex)). These pods were not ready=['oidc-dex-6585f5f9bc-qs8v2']
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada result = get_result()
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 248, in execute
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada chart_wait.wait(timer)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 134, in wait
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 294, in wait
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada modified = self._wait(deadline)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 354, in _wait
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=dex)). These pods were not ready=['oidc-dex-6585f5f9bc-qs8v2']
2020-06-20 21:22:53.745 89262 ERROR armada.handlers.armada ^[[00m
2020-06-20 21:22:53.746 89262 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-dex']^[[00m
2020-06-20 21:22:54.455 89262 INFO armada.handlers.lock [-] Releasing lock^[[00m
2020-06-20 21:22:54.461 89262 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-dex']
2020-06-20 21:22:54.461 89262 ERROR armada.cli Traceback (most recent call last):
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-06-20 21:22:54.461 89262 ERROR armada.cli self.invoke()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-06-20 21:22:54.461 89262 ERROR armada.cli resp = self.handle(documents, tiller)
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-06-20 21:22:54.461 89262 ERROR armada.cli return future.result()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-06-20 21:22:54.461 89262 ERROR armada.cli return self.__get_result()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-06-20 21:22:54.461 89262 ERROR armada.cli raise self._exception
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-06-20 21:22:54.461 89262 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-06-20 21:22:54.461 89262 ERROR armada.cli return armada.sync()
2020-06-20 21:22:54.461 89262 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2020-06-20 21:22:54.461 89262 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2020-06-20 21:22:54.461 89262 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-dex']

Test Activity
-------------
Regression test

description: updated
tags: added: stx.retestneeded
Revision history for this message
Yang Liu (yliu12) wrote :

# system looked healthy when this happened.
System was alarm free, all nodes were unlocked-enabled via system host-list, and all nodes were ready via kubectl get nodes -n deployment.

# rbd pod did not like controller-1, because it was tainted for some reason. App apply worked after removing the taint.

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep rbd
kube-system rbd-provisioner-77bfb6dbb-k98vh 0/1 Pending 0 26m <none> <none> <none> <none>
kube-system rbd-provisioner-77bfb6dbb-vm242 1/1 Running 1 26m dead:beef::8e22:765f:6121:eb5b controller-0 <none> <none>

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 36s (x21 over 26m) default-scheduler 0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match node selector.
[sysadmin@controller-0 ~(keystone_admin)]$

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe node controller-1 | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe node controller-0 | grep Taints
Taints: <none>

# Workaround:
kubectl taint node controller-1 node-role.kubernetes.io/master:NoSchedule-

Following app apply already failed before the workaround:
[sysadmin@controller-0 ~(keystone_admin)]$ ls -lrt /var/log/armada/platform-integ-apps-apply_2020-06-2*
-rw-r--r-- 1 1000 users 21784 Jun 20 16:00 /var/log/armada/platform-integ-apps-apply_2020-06-20-15-30-07.log
-rw-r--r-- 1 1000 users 21784 Jun 20 16:30 /var/log/armada/platform-integ-apps-apply_2020-06-20-16-00-39.log
-rw-r--r-- 1 1000 users 21784 Jun 20 17:01 /var/log/armada/platform-integ-apps-apply_2020-06-20-16-31-11.log
-rw-r--r-- 1 1000 users 21784 Jun 20 17:31 /var/log/armada/platform-integ-apps-apply_2020-06-20-17-01-43.log
-rw-r--r-- 1 1000 users 21784 Jun 20 18:02 /var/log/armada/platform-integ-apps-apply_2020-06-20-17-32-15.log
-rw-r--r-- 1 1000 users 21987 Jun 21 20:17 /var/log/armada/platform-integ-apps-apply_2020-06-21-19-47-22.log
-rw-r--r-- 1 1000 users 22122 Jun 21 20:47 /var/log/armada/platform-integ-apps-apply_2020-06-21-20-17-54.log
-rw-r--r-- 1 1000 users 22122 Jun 21 21:18 /var/log/armada/platform-integ-apps-apply_2020-06-21-20-48-26.log
-rw-r--r-- 1 1000 users 9154 Jun 21 21:19 /var/log/armada/platform-integ-apps-apply_2020-06-21-21-18-58.log

[sysadmin@controller-0 ~(keystone_admin)]$ ls -lrt /var/log/armada/oidc-auth-apps-apply_2020-*
-rw-r--r-- 1 1000 users 19243 Jun 20 21:22 /var/log/armada/oidc-auth-apps-apply_2020-06-20-20-52-52.log

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as low priority given this is a highly intermittent issue seen with a relatively older load.

tags: added: stx.containers
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Yang Liu (yliu12) wrote :

This issue is seen again on DC-4 with load.
platform-integ-apps and cert-manager app apply-failed due to controller-1 is tainted after fresh install.

Both apps were applied successfully after removing the taint manually.

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep rbd
kube-system rbd-provisioner-77bfb6dbb-5l9j9 1/1 Running 1 14m dead:beef::8e22:765f:6121:eb5d controller-0 <none> <none>
kube-system rbd-provisioner-77bfb6dbb-rjszw 0/1 Pending 0 14m <none> <none> <none> <none>

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep cert
cert-manager cm-cert-manager-856678cfb7-vfndk 1/1 Running 1 20h dead:beef::8e22:765f:6121:eb49 controller-0 <none> <none>
cert-manager cm-cert-manager-856678cfb7-xdfnp 0/1 Pending 0 16h <none> <none> <none> <none>
cert-manager cm-cert-manager-cainjector-85849bd97-n7dfc 0/1 Pending 0 16h <none> <none> <none> <none>
cert-manager cm-cert-manager-cainjector-85849bd97-v64lr 1/1 Running 2 20h dead:beef::8e22:765f:6121:eb48 controller-0 <none> <none>
cert-manager cm-cert-manager-webhook-5745478cbc-nfts9 1/1 Running 1 20h dead:beef::8e22:765f:6121:eb47 controller-0 <none> <none>
cert-manager cm-cert-manager-webhook-5745478cbc-zr52l 0/1 Pending 0 16h <none> <none> <none> <none>

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 18s (x16 over 15m) default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe nodes controller-1 | grep -i taint
Taints: node-role.kubernetes.io/master:NoSchedule

New logs uploaded to:
https://files.starlingx.kube.cengn.ca/launchpad/1884469

Revision history for this message
Yang Liu (yliu12) wrote :

The load used in comment #4 was from "2020-07-24_20-00-00"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.