StarlingX

platform-integ-apps application apply failure

Bug #1848721 reported by Anujeyan Manokeran on 2019-10-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Bob Church

Bug Description

2019-10-18 01:18:29.843 46 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176^[[00m
2019-10-18 01:19:29.901 46 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176^[[00m
2019-10-18 01:20:28.108 46 ERROR armada.handlers.wait [-] [chart=kube-system-rbd-provisioner]: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?^[[00m
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada [-] Chart deploy [kube-system-rbd-provisioner] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada Traceback (most recent call last):
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada result = get_result()
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 248, in execute
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada chart_wait.wait(timer)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 134, in wait
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 294, in wait
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada modified = self._wait(deadline)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 354, in _wait
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada ^[[00m
2019-10-18 01:20:28.111 46 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner']^[[00m
2019-10-18 01:20:28.963 46 INFO armada.handlers.lock [-] Releasing lock^[[00m
2019-10-18 01:20:28.968 46 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-rbd-provisioner']
2019-10-18 01:20:28.968 46 ERROR armada.cli Traceback (most recent call last):
2019-10-18 01:20:28.968 46 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke

Severity
--------
Major

Steps to Reproduce
------------------
1. As per install procedure install the AIO DX +N system unlock all the nodes.

System Configuration
--------------------
AIODX+N woflpass-8-12

Expected Behavior
------------------
Apply success no errors without re-apply

Actual Behavior
----------------
As per description apply failure

Reproducibility
---------------
intermittent mostly seen on AIO dx

Branch/Pull Time/Commit
-----------------------
BUILD_DATE= 2019-10-16 20:02:12 -0400

Last Pass
---------
2019-10-10_20-00-00

Timestamp/Logs
--------------
2019-10-18 01:18:29

Test Activity
-------------
Regression test

See original description

Tags:

Anujeyan Manokeran (anujeyan) on 2019-10-18

description:	updated
summary:	- AIO DX+N platform-integ-apps application apply failure + platform-integ-apps application apply failure

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-18:

collect logs Edit (68.9 MiB, application/x-tar)

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-18:

This was seen in wcp-76-77. Today's sanity load 20191018T013000Z

http://128.224.150.21/jenkins/job/cgcs-wildcat-76_77_k8s/178/console

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-18:

As per Yang, this issue is recently happening 50% of the time

tags:	added: stx.containers
tags:	added: stx.3.0
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Bob Church (rchurch)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-18:

stx.3.0 / high priority - issue appears to be happening regularly

Revision history for this message

Bob Church (rchurch) wrote on 2019-10-18:

platform-integ-apps fails to apply because the replica count for the number of rbd-provisioner pods is zero. With zero relics the armada manifest apply will timeout as no pods will be launched as it's waiting on notification of this event. This occurs because VIM services go enabled 4s after the overrides are generated and the replica count is based on number of enabled controllers w/ vim_services.

2019-10-18 00:35:28.293 95972 INFO sysinv.api.controllers.v1.host [-] controller-0 Action unlock perform notify_mtce
2019-10-18 00:48:05.275 113546 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Uploading...
2019-10-18 00:48:08.556 113546 INFO sysinv.conductor.kube_app [-] Generating application overrides...
2019-10-18 00:48:08.987 113546 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) upload completed.
2019-10-18 00:49:05.671 113546 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Applying...
2019-10-18 00:49:05.990 113546 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply started.
2019-10-18 00:49:06.294 113546 INFO sysinv.conductor.kube_app [-] Generating application overrides...
2019-10-18 00:49:10.474 114087 INFO sysinv.api.controllers.v1.host [-] controller-0 notify_availability=services-enabled
2019-10-18 01:20:29.138 113546 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.

If you are seeing success 50% of the time then we have a race condition here. The fix is to update the apply criteria in _met_app_apply_prerequisites() to align with the logic that determines the number of replicas. This way we can ensure the apply happens only if there is a guarantee of at least one replica.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-20: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/689643

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-21: Fix merged to config (master)

Reviewed: https://review.opendev.org/689643
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=690cea15ee982fa2f27eac9b7992b61b24d797fb
Submitter: Zuul
Branch: master

commit 690cea15ee982fa2f27eac9b7992b61b24d797fb
Author: Robert Church <email address hidden>
Date: Sun Oct 20 01:46:10 2019 -0400

Ensure minimal replicas for platform-integ-apps managed apply

Align the application apply pre-requisites with the replica calculation
used when generating overrides for the rbd-provsioner.

    This avoids the situation where the application is applied and the
    manifest apply times out due to a repica count of zero being provided in
    the overrides.

    Change-Id: I30bb8816febd33b60e4623b83fd8060f4bbf1f97
    Closes-Bug: #1848721
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-10-22:

ALL_NODES_20191022.150012.tar Edit (33.9 MiB, application/x-tar)

This error was faced during sanity execution from Oct/22 (BUILD: 20191021T230000Z)

controller-0:~$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191021T230000Z"

However, I'm not sure if the fix was merged after the build was created. I'm attaching a full collect.

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-10-23:

Hi Bob
It seems your patch not fix the issue.
I test my EB based on 20191021T144814Z which already includes your patch according to change_log file
It still failed during platform-integ-apps application apply

attached sysinv.log

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-10-23:

#10

sysinv.log Edit (150.0 KiB, text/plain)

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-10-23:

#11

It can pass if manually apply platform-integ-apps again after first failure.

Zhipeng

Revision history for this message

Bob Church (rchurch) wrote on 2019-10-23:

#12

Download full text (4.6 KiB)

I took a look at the collect logs. The fix for this looks to be present as is https://review.opendev.org/#/c/689438/ which was later reverted due to causing some ceph related issues. I'm not sure if the follow is related to that change or some other like the k8s upgrade to 1.16.2.

This problem does appear not to be related to this LP change. All the application overrides look to be correct and are being applied based on the updated apply pre-reqs.

What I’m observing is the following pattern on application apply
- sysinv fires off the manifest apply
- tiller requests all the current releases
- an API log WARNING shows up in horizon.log for the configmap requests that tiller is looking for related to the releases.
- tiller fails to get the release configmaps and fails the apply.

Apply #1:
2019-10-22 12:30:40.297 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 12:30:41.120893971Z: [storage] 2019/10/22 12:30:41 listing all releases with filter
2019-10-22 12:30:41.836 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 12:30:41.83790798Z: [storage/driver] 2019/10/22 12:30:41 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 12:30:42.215 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.
2019-10-22 12:30:42.484 111637 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Apply #2:
2019-10-22 13:08:49.308 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 13:08:50.067400757Z: [storage] 2019/10/22 13:08:50 listing all releases with filter
2019-10-22 13:08:50.139 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 13:08:50.140220136Z: [storage/driver] 2019/10/22 13:08:50 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 13:08:51.125 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/...

This problem does appear not to be related to this LP change. All the application overrides look to be correct and are being applied based on the updated apply pre-reqs.

What I’m observing is the following pattern on application apply
 - sysinv fires off the manifest apply
 - tiller requests all the current releases
 - an API log WARNING shows up in horizon.log for the configmap requests that tiller is looking for related to the releases.
 - tiller fails to get the release configmaps and fails the apply.

Apply #1:
2019-10-22 12:30:40.297 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml  --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml  --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 12:30:41.120893971Z: [storage] 2019/10/22 12:30:41 listing all releases with filter
2019-10-22 12:30:41.836 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 12:30:41.83790798Z: [storage/driver] 2019/10/22 12:30:41 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 12:30:42.215 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.
2019-10-22 12:30:42.484 111637 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Apply #2:
2019-10-22 13:08:49.308 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml  --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml  --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 13:08:50.067400757Z: [storage] 2019/10/22 13:08:50 listing all releases with filter
2019-10-22 13:08:50.139 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 13:08:50.140220136Z: [storage/driver] 2019/10/22 13:08:50 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 13:08:51.125 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.
2019-10-22 13:08:51.362 111637 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Apply #3:
2019-10-22 13:13:53.446 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml  --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml  --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 13:13:54.21153814Z: [storage] 2019/10/22 13:13:54 listing all releases with filter
2019-10-22 13:13:54.277 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 13:13:54.279302854Z: [storage/driver] 2019/10/22 13:13:54 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 13:13:55.270 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.
2019-10-22 13:13:55.441 111637 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Plus 7 additional applies with the same signature

Revision history for this message

Bob Church (rchurch) wrote on 2019-10-25:

#13

Following up on this LP and the failed sanity results from the build on 20191021T230000Z.

I installed a storage lab with this build. I can confirm that this issue is related to this commit present in the build https://review.opendev.org/#/c/689438/. This was later reverted by: https://review.opendev.org/#/c/690083/

This commit prevented the mgr-restful-plugin from accessing ceph.conf during storage provisioning.

I also installed a AIO-DX with this build and it installed correctly and didn't see the tiller issues reported in the collect logs.

As the sanity report with the 20191024 build was green. No further action is required here related to this LP. Any further issues related to platform-integ-apps applying should result in a new LP.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-29:

#14

Did not see original issue in recent WR sanity on various systems.
The new tiller issue mentioned above is seen once in WR today and new LP is opened (1850189)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-29: Fix proposed to config (master)

#15

Fix proposed to branch: master
Review: https://review.opendev.org/691992

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-30:

#16

Note: The above commit was linked to this bug by mistake. It's unrelated to this.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.