platform-integ-apps application apply failure

Bug #1848721 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bob Church

Bug Description

Brief Description
-----------------.
   AIO DX +N system was failing on platform-integ-apps apply . This was an intermittent issue during the install. Reapply was successful.
system application-show platform-integ-apps
+---------------+------------------------------------------+
| Property | Value |
+---------------+------------------------------------------+
| active | False |
| app_version | 1.0-8 |
| created_at | 2019-10-18T00:48:05.144205+00:00 |
| manifest_file | manifest.yaml |
| manifest_name | platform-integration-manifest |
| name | platform-integ-apps |
| progress | operation aborted, check logs for detail |
| status | apply-failed |
| updated_at | 2019-10-18T01:20:29.208565+00:00 |
+---------------+------------------------------------------+

2019-10-18 01:18:29.843 46 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176^[[00m
2019-10-18 01:19:29.901 46 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176^[[00m
2019-10-18 01:20:28.108 46 ERROR armada.handlers.wait [-] [chart=kube-system-rbd-provisioner]: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?^[[00m
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada [-] Chart deploy [kube-system-rbd-provisioner] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada Traceback (most recent call last):
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada result = get_result()
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 248, in execute
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada chart_wait.wait(timer)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 134, in wait
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 294, in wait
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada modified = self._wait(deadline)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 354, in _wait
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=kube-system, labels=(app=rbd-provisioner)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?
2019-10-18 01:20:28.109 46 ERROR armada.handlers.armada ^[[00m
2019-10-18 01:20:28.111 46 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner']^[[00m
2019-10-18 01:20:28.963 46 INFO armada.handlers.lock [-] Releasing lock^[[00m
2019-10-18 01:20:28.968 46 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['kube-system-rbd-provisioner']
2019-10-18 01:20:28.968 46 ERROR armada.cli Traceback (most recent call last):
2019-10-18 01:20:28.968 46 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke

Severity
--------
Major

Steps to Reproduce
------------------
1. As per install procedure install the AIO DX +N system unlock all the nodes.

System Configuration
--------------------
AIODX+N woflpass-8-12

Expected Behavior
------------------
Apply success no errors without re-apply

Actual Behavior
----------------
As per description apply failure

Reproducibility
---------------
intermittent mostly seen on AIO dx

Branch/Pull Time/Commit
-----------------------
BUILD_DATE= 2019-10-16 20:02:12 -0400

Last Pass
---------
2019-10-10_20-00-00

Timestamp/Logs
--------------
2019-10-18 01:18:29

Test Activity
-------------
Regression test

description: updated
summary: - AIO DX+N platform-integ-apps application apply failure
+ platform-integ-apps application apply failure
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This was seen in wcp-76-77. Today's sanity load 20191018T013000Z

http://128.224.150.21/jenkins/job/cgcs-wildcat-76_77_k8s/178/console

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Yang, this issue is recently happening 50% of the time

tags: added: stx.containers
tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - issue appears to be happening regularly

Revision history for this message
Bob Church (rchurch) wrote :

platform-integ-apps fails to apply because the replica count for the number of rbd-provisioner pods is zero. With zero relics the armada manifest apply will timeout as no pods will be launched as it's waiting on notification of this event. This occurs because VIM services go enabled 4s after the overrides are generated and the replica count is based on number of enabled controllers w/ vim_services.

2019-10-18 00:35:28.293 95972 INFO sysinv.api.controllers.v1.host [-] controller-0 Action unlock perform notify_mtce
2019-10-18 00:48:05.275 113546 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Uploading...
2019-10-18 00:48:08.556 113546 INFO sysinv.conductor.kube_app [-] Generating application overrides...
2019-10-18 00:48:08.987 113546 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) upload completed.
2019-10-18 00:49:05.671 113546 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Applying...
2019-10-18 00:49:05.990 113546 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-8) apply started.
2019-10-18 00:49:06.294 113546 INFO sysinv.conductor.kube_app [-] Generating application overrides...
2019-10-18 00:49:10.474 114087 INFO sysinv.api.controllers.v1.host [-] controller-0 notify_availability=services-enabled
2019-10-18 01:20:29.138 113546 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.

If you are seeing success 50% of the time then we have a race condition here. The fix is to update the apply criteria in _met_app_apply_prerequisites() to align with the logic that determines the number of replicas. This way we can ensure the apply happens only if there is a guarantee of at least one replica.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/689643

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/689643
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=690cea15ee982fa2f27eac9b7992b61b24d797fb
Submitter: Zuul
Branch: master

commit 690cea15ee982fa2f27eac9b7992b61b24d797fb
Author: Robert Church <email address hidden>
Date: Sun Oct 20 01:46:10 2019 -0400

    Ensure minimal replicas for platform-integ-apps managed apply

    Align the application apply pre-requisites with the replica calculation
    used when generating overrides for the rbd-provsioner.

    This avoids the situation where the application is applied and the
    manifest apply times out due to a repica count of zero being provided in
    the overrides.

    Change-Id: I30bb8816febd33b60e4623b83fd8060f4bbf1f97
    Closes-Bug: #1848721
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This error was faced during sanity execution from Oct/22 (BUILD: 20191021T230000Z)

controller-0:~$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191021T230000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="292"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-10-21 23:00:00 +0000"
controller-0:~$ !source /etc/platform/openrc
source /etc/platform/openrc /etc/platform/openrc
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+

However, I'm not sure if the fix was merged after the build was created. I'm attaching a full collect.

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Bob
It seems your patch not fix the issue.
I test my EB based on 20191021T144814Z which already includes your patch according to change_log file
It still failed during platform-integ-apps application apply

INFO: sysadmin@10.10.10.3's password:
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applying | processing chart: stx-rbd-provisioner, overall completion: 50.0% |
Connection to 10.10.10.3 closed.

attached sysinv.log

Revision history for this message
zhipeng liu (zhipengs) wrote :
Revision history for this message
zhipeng liu (zhipengs) wrote :

It can pass if manually apply platform-integ-apps again after first failure.

Zhipeng

Revision history for this message
Bob Church (rchurch) wrote :
Download full text (4.6 KiB)

I took a look at the collect logs. The fix for this looks to be present as is https://review.opendev.org/#/c/689438/ which was later reverted due to causing some ceph related issues. I'm not sure if the follow is related to that change or some other like the k8s upgrade to 1.16.2.

This problem does appear not to be related to this LP change. All the application overrides look to be correct and are being applied based on the updated apply pre-reqs.

What I’m observing is the following pattern on application apply
 - sysinv fires off the manifest apply
 - tiller requests all the current releases
 - an API log WARNING shows up in horizon.log for the configmap requests that tiller is looking for related to the releases.
 - tiller fails to get the release configmaps and fails the apply.

Apply #1:
2019-10-22 12:30:40.297 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 12:30:41.120893971Z: [storage] 2019/10/22 12:30:41 listing all releases with filter
2019-10-22 12:30:41.836 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 12:30:41.83790798Z: [storage/driver] 2019/10/22 12:30:41 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 12:30:42.215 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply.log for details.
2019-10-22 12:30:42.484 111637 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Apply #2:
2019-10-22 13:08:49.308 111637 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply.log'
2019-10-22 13:08:50.067400757Z: [storage] 2019/10/22 13:08:50 listing all releases with filter
2019-10-22 13:08:50.139 [WARNING] django.request: Not Found: /api/v1/namespaces/kube-system/configmaps
2019-10-22 13:08:50.140220136Z: [storage/driver] 2019/10/22 13:08:50 list: failed to list: the server could not find the requested resource (get configmaps)
2019-10-22 13:08:51.125 111637 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/...

Read more...

Revision history for this message
Bob Church (rchurch) wrote :

Following up on this LP and the failed sanity results from the build on 20191021T230000Z.

I installed a storage lab with this build. I can confirm that this issue is related to this commit present in the build https://review.opendev.org/#/c/689438/. This was later reverted by: https://review.opendev.org/#/c/690083/

This commit prevented the mgr-restful-plugin from accessing ceph.conf during storage provisioning.

I also installed a AIO-DX with this build and it installed correctly and didn't see the tiller issues reported in the collect logs.

As the sanity report with the 20191024 build was green. No further action is required here related to this LP. Any further issues related to platform-integ-apps applying should result in a new LP.

Revision history for this message
Yang Liu (yliu12) wrote :

Did not see original issue in recent WR sanity on various systems.
The new tiller issue mentioned above is seen once in WR today and new LP is opened (1850189)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/691992

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Note: The above commit was linked to this bug by mistake. It's unrelated to this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.