platform-integ-apps apply-failed after lock/unlock controller

Bug #1849688 reported by Yang Liu on 2019-10-24
40
This bug affects 4 people
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Bob Church

Bug Description

Brief Description
-----------------
platform-integ-apps in apply-failed state after lock unlock controller-0 on a simplex system.
tiller pod stuck at MatchNodeSelector.

Severity
--------
Major

Steps to Reproduce
------------------
1. Install and configure a simplex system --> Initial apply was successful
2. lock/unlock controller

TC-name: test_lock_unlock_host[controller]

Expected Behavior
------------------
2. lock/unlock succeeded, system is still healthy after that

Actual Behavior
----------------
2. lock/unlock succeeded, but platform-integ-apps failed, tiller pod stuck at MatchNodeSelector.

Reproducibility
---------------
Happened 2/3 times on simplex systems.

System Configuration
--------------------
One node system
Lab-name: wcp122

Branch/Pull Time/Commit
-----------------------
2019-10-23_20-00-00

Last Pass
---------
2019-10-21_20-00-00 on same system

Timestamp/Logs
--------------
[2019-10-24 07:04:45,331] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2019-10-24 07:05:24,844] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-10-24 07:14:26,862] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system tiller-deploy-d6b59fcb-z4jzg 0/1 MatchNodeSelector 0 34m <none> controller-0 <none> <none>

[2019-10-24 07:14:48,208] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+

Test Activity
-------------
Sanity

Yang Liu (yliu12) wrote :
Bob Church (rchurch) wrote :
Download full text (9.4 KiB)

Potentially looks like an endpoint/firewall update may be impacting an application apply that is in progress.

# Initial apply works during provisioning
2019-10-24 06:56:58.365 110188 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Applying...
2019-10-24 06:56:58.640 110188 INFO sysinv.conductor.kube_app [-] Register the initial abort status of app platform-integ-apps
2019-10-24 06:56:58.940 110188 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply started.
2019-10-24 06:58:09.356 110188 INFO sysinv.conductor.kube_app [-] All docker images for application platform-integ-apps were successfully downloaded in 70 seconds
2019-10-24 06:58:33.243 110188 INFO sysinv.conductor.kube_app [-] Application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml was successfully applied/re-applied.
2019-10-24 06:58:33.244 110188 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app platform-integ-apps
2019-10-24 06:58:33.549 110188 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply completed.

# An override change has been detected. Not sure why this is the case. Needs investigation…
2019-10-24 07:13:16.722 102497 INFO sysinv.conductor.manager [-] There has been an overrides change, setting up reapply of platform-integ-apps

# Firewall update is triggered
2019-10-24 07:13:16.726 102497 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 66c95e55-43a1-4b79-847d-43e6960123d2 {'classes': ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime', 'platform::sysinv::runtime'], 'force': False, 'personalities': ['controller'], 'host_uuids': [u'4624ddd2-6b83-4e12-ada6-f6862e120509']} to agent
2019-10-24 07:13:16.728 22171 INFO sysinv.agent.manager [req-337b5587-475e-4645-8aee-9b8013fcc669 admin None] config_apply_runtime_manifest: 66c95e55-43a1-4b79-847d-43e6960123d2 {u'classes': [u'openstack::keystone::endpoint::runtime', u'platform::firewall::runtime', u'platform::sysinv::runtime'], u'force': False, u'personalities': [u'controller'], u'host_uuids': [u'4624ddd2-6b83-4e12-ada6-f6862e120509']} controller

2019-10-24 07:13:31.950 102497 INFO sysinv.conductor.manager [req-de8da658-fc3b-423c-b47e-eb2ff9cf9342 admin admin] Updating platform data for host: 4624ddd2-6b83-4e12-ada6-f6862e120509 with: {u'availability': u'services-enabled'}
2019-10-24 07:13:32.171 102497 INFO sysinv.helm.manifest_base [req-de8da658-fc3b-423c-b47e-eb2ff9cf9342 admin admin] Delete manifest file /opt/platform/armada/19.10/platform-integ-apps/1.0-8/platform-integ-apps-manifest-del.yaml generated
2019-10-24 07:13:32.172 102497 INFO sysinv.conductor.manager [req-de8da658-fc3b-423c-b47e-eb2ff9cf9342 admin admin] There has been an overrides change, setting up reapply of platform-integ-apps

# Re-apply occurs due to reapply flag being raised
2019-10-24 07:14:13.743 102497 INFO sysinv.conductor.manager [-] Reapplying platform-integ-apps app
2019-10-24 07:14:13.747 102497 INFO sysinv.conductor.kube_app [-] Register the initial abort status of app platform-integ-apps
2019-10-24 07:14:14.054 102497 INFO sy...

Read more...

Yang Liu (yliu12) wrote :

Removed "tiller pod" portion from title - the tiller status may not be related to this LP since there are more than 1 tiller pods on system and one of them seems to be running.

There's another LP for multiple tiller pods - https://bugs.launchpad.net/starlingx/+bug/1848033

summary: - platform-integ-apps apply-failed after lock/unlock controller - tiller
- pod stuck at MatchNodeSelector
+ platform-integ-apps apply-failed after lock/unlock controller
Yang Liu (yliu12) on 2019-10-24
description: updated
Bob Church (rchurch) wrote :

Attempting a manual reapply fails:

2019-10-24 16:42:21.443 11 ERROR armada.handlers.wait [-] [chart=kube-system-rbd-provisioner]: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). These pods were not ready=['rbd-provisioner-7484d49cf6-k9bw8']
2019-10-24 16:42:21.443 11 ERROR armada.handlers.armada [-] Chart deploy [kube-system-rbd-provisioner] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). These pods were not ready=['rbd-provisioner-7484d49cf6-k9bw8']

Checking pod status, I observe some rather old pods that are stuck waiting on MatchNodeSelector

kube-system rbd-provisioner-7484d49cf6-k9bw8 0/1 MatchNodeSelector 0 9h <none> controller-0 <none> <none>
kube-system rbd-provisioner-7484d49cf6-vlr62 1/1 Running 1 9h 172.16.192.99 controller-0 <none> <none>
kube-system storage-init-rbd-provisioner-xwnrl 0/1 Completed 0 9h 172.16.192.72 controller-0 <none> <none>
kube-system tiller-deploy-d6b59fcb-j8q2s 1/1 Running 1 9h 192.168.204.3 controller-0 <none> <none>
kube-system tiller-deploy-d6b59fcb-ntd2v 0/1 MatchNodeSelector 0 9h <none> controller-0 <none> <none>
kube-system tiller-deploy-d6b59fcb-z4jzg 0/1 MatchNodeSelector 0 10h <none> controller-0 <none> <none>

Looking at the status of the pods. It looks like originally applied pods may have been scheduled and started, but eventually failed due to missing node selector.

2019-10-24 06:58:33.549 110188 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply completed.

Name: rbd-provisioner-7484d49cf6-k9bw8
Namespace: kube-system
Priority: 0
Node: controller-0/
Start Time: Thu, 24 Oct 2019 06:58:10 +0000
Labels: app=rbd-provisioner
                pod-template-hash=7484d49cf6
Annotations: cni.projectcalico.org/podIP: 172.16.192.78/32
Status: Failed
Reason: MatchNodeSelector
Message: Pod Predicate MatchNodeSelector failed

Another provisioner is started later, but does not appear associated with a particular re-apply

Name: rbd-provisioner-7484d49cf6-vlr62
Namespace: kube-system
Priority: 0
Node: controller-0/192.168.204.3
Start Time: Thu, 24 Oct 2019 07:30:51 +0000
Labels: app=rbd-provisioner
              pod-template-hash=7484d49cf6
Annotations: cni.projectcalico.org/podIP: 172.16.192.99/32
Status: Running

Bob Church (rchurch) wrote :

Killing the pod enables the app reapply to work.

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl delete pods -nkube-system rbd-provisioner-7484d49cf6-k9bw8

2019-10-24 17:35:50.842 96382 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply started.
2019-10-24 17:35:50.868 96382 INFO sysinv.conductor.kube_app [-] Generating application overrides...
2019-10-24 17:35:51.017 96382 INFO sysinv.helm.manifest_base [req-ef2900cd-a6a8-460d-9743-27eac307844f admin admin] Delete manifest file /opt/platform/armada/19.10/platform-integ-apps/1.0-8/platform-integ-apps-manifest-del.yaml generated
2019-10-24 17:35:51.017 96382 INFO sysinv.conductor.kube_app [-] Application overrides generated.
2019-10-24 17:35:51.043 96382 INFO sysinv.conductor.kube_app [-] Armada manifest file has no img tags for chart helm-toolkit
2019-10-24 17:35:51.070 96382 INFO sysinv.conductor.kube_app [-] Image registry.local:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download started from local registry
2019-10-24 17:35:51.072 96382 INFO sysinv.conductor.kube_app [-] Image registry.local:9001/docker.io/starlingx/ceph-config-helper:v1.15.0 download started from local registry
2019-10-24 17:35:51.438 96382 INFO sysinv.conductor.kube_app [-] Image registry.local:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download succeeded in 0 seconds
2019-10-24 17:35:51.452 96382 INFO sysinv.conductor.kube_app [-] Image registry.local:9001/docker.io/starlingx/ceph-config-helper:v1.15.0 download succeeded in 0 seconds
2019-10-24 17:35:51.452 96382 INFO sysinv.conductor.kube_app [-] All docker images for application platform-integ-apps were successfully downloaded in 0 seconds
2019-10-24 17:35:51.458 96382 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-24 17:35:52.456 96382 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app platform-integ-apps
2019-10-24 17:35:52.613 96382 INFO sysinv.conductor.kube_app [-] processing chart: stx-ceph-pools-audit, overall completion: 100.0%
2019-10-24 17:35:53.265 96382 INFO sysinv.conductor.kube_app [-] Application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml was successfully applied/re-applied.
2019-10-24 17:35:53.265 96382 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app platform-integ-apps
2019-10-24 17:35:53.546 96382 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply completed.

Wendy Mitchell (wmitchellwr) wrote :
Download full text (5.1 KiB)

Hit this issue again after lock/unlock on the single node system
2019-10-23_20-00-00

$ system application-list
...
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail

2019-10-24 20:09:03.486 11 ERROR armada.handlers.wait [-] [chart=kube-system-rbd-provisioner]: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). These pods were not ready=['rbd-provisioner-7484d49cf6-dfx6h']
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada [-] Chart deploy [kube-system-rbd-provisioner] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). These pods were not ready=['rbd-provisioner-7484d49cf6-dfx6h']
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada Traceback (most recent call last):
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada result = get_result()
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 248, in execute
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada chart_wait.wait(timer)
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 134, in wait
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 294, in wait
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada modified = self._wait(deadline)
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 354, in _wait
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). These pods were not ready=['rbd-provisioner-7484d49cf6-dfx6h']
2019-10-24 20:09:03.487 11 ERROR armada.handlers.armada
2019-10-24 20:09:03.489 11 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner']
2019-10-24 20:09:03.979 11 ERROR armada.cli [-] Caught internal exception: armada.exceptions...

Read more...

Frank Miller (sensfan22) wrote :

Assigning to Bob to triage.

Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - issue related to container recovery and occurs frequently

Changed in starlingx:
importance: Undecided → High
tags: added: stx.containers
Changed in starlingx:
status: New → Triaged
tags: added: stx.3.0
Yang Liu (yliu12) on 2019-10-31
tags: added: stx.retestneeded
Bob Church (rchurch) wrote :

I applied the following upstream proposed fix: https://github.com/kubernetes/kubernetes/pull/80976

With this incorporated in the build and installed on my test lab, I ran this lab in a continual lock/unlock cycle for over 36hrs without seeing the MatchNodeSelector issue. Prior to this patch inclusion, I’d see the issue in < ~45minutes.

Possible next steps here:

1) Live with the issue occurring periodically and wait for this to land upstream (if it does) and pull it in on the following rebase

2) Patch k8s in StarlingX with this change and evaluate if this fixes the issue over the coming weeks.

Frank Miller (sensfan22) wrote :

Changing the tag for this issue from stx.3.0 to stx.4.0 as the solution requires moving to a new version of kubernetes which won't happen until stx.4.0.

Also lowering the priority to medium as a workaround exists: delete any pods stuck in the MatchNodeSelector state.

tags: added: stx.4.0
removed: stx.3.0
Changed in starlingx:
importance: High → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers