platform-integ-apps occasionally fail to apply initially due to tiller unable to get release info

Bug #1850189 reported by Yang Liu on 2019-10-28
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
After initial install and config, platform-integ-apps occasionally in apply-failed status due to tiller unable to get release info.

The workaround is to re-apply the platform-integ-apps. The re-apply usually passes.

Severity
--------
Minor

Steps to Reproduce
------------------
Install and configure a stx system

Expected Behavior
------------------
- platform-integ-apps automatically applied successfully

Actual Behavior
----------------
- platform-integ-apps occasionally in apply-failed status due to tiller unable to get release info

Reproducibility
---------------
Rare - less than 10%

System Configuration
--------------------
One node system
Lab-name: SM-4

Branch/Pull Time/Commit
-----------------------
2019-10-27_20-00-00

Last Pass
---------
This is very intermittent. Don't know last pass. It is the first time seen in WR sanity in the past 1.5 weeks that I looked.

Timestamp/Logs
--------------
# platform-integ-apps failed before following timestamp
[2019-10-28 07:02:53,338] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-9 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+

# Comment from Robert Church:
Looks like tiller couldn’t get release information which is why the apply failed.
{"log":"[main] 2019/10/28 06:56:29 Starting Tiller v2.13.1 (tls=false)\n","stream":"stderr","time":"2019-10-28T06:56:29.952167656Z"}
{"log":"[main] 2019/10/28 06:56:29 GRPC listening on :44134\n","stream":"stderr","time":"2019-10-28T06:56:29.952227906Z"}
{"log":"[main] 2019/10/28 06:56:29 Probes listening on :44135\n","stream":"stderr","time":"2019-10-28T06:56:29.95223832Z"}
{"log":"[main] 2019/10/28 06:56:29 Storage driver is ConfigMap\n","stream":"stderr","time":"2019-10-28T06:56:29.952245037Z"}
{"log":"[main] 2019/10/28 06:56:29 Max history per release is 0\n","stream":"stderr","time":"2019-10-28T06:56:29.952251347Z"}
{"log":"[storage] 2019/10/28 06:59:50 listing all releases with filter\n","stream":"stderr","time":"2019-10-28T06:59:50.862282171Z"}
{"log":"[storage/driver] 2019/10/28 06:59:52 list: failed to list: the server could not find the requested resource (get configmaps)\n","stream":"stderr","time":"2019-10-28T06:59:52.094561718Z"}

Test Activity
-------------
Sanity

Yang Liu (yliu12) wrote :
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - should be investigated further

description: updated
tags: added: stx.containers
tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Bob Church (rchurch)
Bob Church (rchurch) wrote :

I saw this occur again on SM-4. I think the next steps here is to come up with a reliable reproduction scenario. I think this can be achieved with a test to run consecutive lock/unlock cycles. What I observed, after an unlock, was that the k8s API was not accessible from logs in sysinv within the same timeframe as the tiller error. We might need to determine if there is intermittent API access on the AIO-SX since there is only one replica. If intermittent access is not an issue, we may need to test the API access prior to running the application-apply as after a host unlock the apply might be happening before all the pods are are recovered.

Frank Miller (sensfan22) wrote :

Based on Bob's analysis, this looks like a timing or dependency issue with k8s recovery on an AIO-SX unlock. Assigning to Stefan to continue the investigation. Suggest keeping Bart informed if this is a k8s recovery/dependency issue so he can help identify the best solution.

Changed in starlingx:
assignee: Bob Church (rchurch) → Stefan Dinescu (stefandinescu)
Wendy Mitchell (wmitchellwr) wrote :

Experienced a case of platform-integ-apps apply failing
2019-11-02_08-39-54
2 controller system
R720 1-2

see logs attached
Exceptions in processiong chart rbd-provision

starting here:
2019-11-04 15:46:22.982 11 INFO armada.handlers.chart_deploy [-] [chart=kube-system-rbd-provisioner]: Processing Chart, release=stx-rbd-provisioner
...
 2019-11-04 15:46:44.059 11 ERROR armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Error while updating release stx-rbd-provisioner: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
...
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller During handling of the above exception, another exception occurred:
...

2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller During handling of the above exception, another exception occurred:
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 546, in get_release_status
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller status_request, self.timeout, metadata=self.metadata)
...
2019-11-04 15:47:04.080 11 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
...

2019-11-04 15:47:04.080 11 ERROR armada.handlers.armada [-] Chart deploy [kube-system-rbd-provisioner] failed: armada.exceptions.tiller_exceptions.GetReleaseStatusException: Failed to get stx-rbd-provisioner status 0 version
...

2019-11-04 15:47:04.088 11 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner']
2019-11-04 15:47:04.793 11 INFO armada.handlers.lock [-] Releasing lock
...

Wendy Mitchell (wmitchellwr) wrote :
tags: added: stx.retestneeded
Wendy Mitchell (wmitchellwr) wrote :
Download full text (3.2 KiB)

Steps (roughly) prior to the problem

[Unlocked controller-1 then swact was initiated here]
2019-11-04 15:25:10,211 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__swact__1", "csrfmiddlewaretoken": "a7HwE603lUKsk7PEWLbWXQVPf7rwYttoWJPWmDWqLrm1kEsOowBQcze9tVPperiZ"}] message:[success: Swact Initiated Host: controller-0]

[Lock controller-1 then elastic labels added (via horizon) the controller-1 was unlocked - elastic-data, elastic-controller,elastic client, elastic-master]

2019-11-04 15:46:43,736 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__lock__2", "csrfmiddlewaretoken": "r3wLFmZ4VcOyMt4wTm5sSQDgRGbAJTUzUR1fc0Vg13qFx4b8obqectAGIsOnFp94"}] message:[success: Locking Host: controller-1]

2019-11-04 15:47:45,148 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-data", "host_id": "2", "csrfmiddlewaretoken": "UQbQ6qKK1jvcAkPYavBAQoJHheGRFZstnEGkD4GW7a7jlVWAFkWma1G780jEBvHY", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-data" was successfully created.]

2019-11-04 15:47:55,759 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-controller", "host_id": "2", "csrfmiddlewaretoken": "R8lSVzWJzbExf28mf6nLP6tg1fOsBJyHkWQmsdSVF2gE0DfYKVIx9JqGS1rfxfNc", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-controller" was successfully created.]

2019-11-04 15:48:05,502 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-client", "host_id": "2", "csrfmiddlewaretoken": "2yMXXqt9UlscnQdltdvz5DHU9eJRjuVYvmhru4pl0c4j8rkXY2QlpgEk00mEf0at", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-client" was successfully created.]

2019-11-04 15:48:14,962 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-master", "host_id": "2", "csrfmiddlewaretoken": "PPIuuORIADvIkgBaWCjljpqp17Jqy377iDdY1sNUGu7P5RIMrrE7D2nPSTmduzmC", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-master" was successfully created.]

[Unlock controller-1]
2019-11-04 15:48:27,365 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__unlock__2", "csrfmiddlewar...

Read more...

Stefan Dinescu (stefandinescu) wrote :
Download full text (13.7 KiB)

I managed to reproduce this issue in a Vbox AIO-SX setup, but the error given is different. But it should be noted that the error presented in the original LP and the one Wendy encountered, are also different (also different than mine).

It seems that in some cases, K8s services start later than usual and the apply/reapply checks in sysinv don't check for that.

I received a name resolution order because the coredns pod started after the re-apply of platform-integ-apps was triggered. My theory is that, depending on what pods are running or not-running at the moment when platform-integ-apps apply is triggered, we get different error messages.

Also this issue is very hard to repoduce and on the vbox SX, i am able to reproduce this once every 10-15 lock-unlock cycles.

Logs of when I reproduces the issue:

sysinv.log
ysinv 2019-11-13 11:38:08.422 98276 INFO sysinv.conductor.kube_app [-] Starting Armada service...
sysinv 2019-11-13 11:38:08.424 98276 INFO sysinv.conductor.kube_app [-] kube_config=/opt/platform/armada/19.09/admin.conf, manifests_dir=/opt/platform/armada/19.09, overrides_dir=/opt/platform/helm/19.09, logs_dir=/var/log/armada.
sysinv 2019-11-13 11:38:08.824 98276 INFO sysinv.conductor.kube_app [-] Armada service started!
sysinv 2019-11-13 11:38:08.825 98276 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply_2019-11-13-11-38-08.log'
sysinv 2019-11-13 11:38:09.420 98276 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app platform-integ-apps
sysinv 2019-11-13 11:38:16.833 98276 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply_2019-11-13-11-38-08.log for details.
sysinv 2019-11-13 11:38:16.835 98276 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app platform-integ-apps
sysinv 2019-11-13 11:38:17.017 98276 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Pod states during this time (note coredns marked as completed):
Wed Nov 13 11:38:08 UTC 2019
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7f985db75c-d5xml 0/1 Error 9 21h
calico-node-zld2m 0/1 CrashLoopBackOff 20 21h
ceph-pools-audit-1573644000-8v4w8 0/1 Completed 0 18m
ceph-pools-audit-1573644300-wc2st 0/1 Completed 0 10m
ceph-pools-audit-1573644600-9lqv7 0/1 Completed 0 8m1s
coredns-6889846b6b-5nmng 0/1 Completed 9 21h
kube-apiserver-co...

Stefan Dinescu (stefandinescu) wrote :

I am working on a fix to this issue by issuing a retry for the platform-integ-apps in case of a failure.

Platform-integ-apps are not expected to fail and retrying for a limited amount of times should be safe. I am currently testing this approach, but testing takes some time as the issue is not very easy to reproduce.

Fix proposed to branch: master
Review: https://review.opendev.org/696311

Changed in starlingx:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/696311
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=638487f67b7cdcf3c01331cb057d9a273c4ed50e
Submitter: Zuul
Branch: master

commit 638487f67b7cdcf3c01331cb057d9a273c4ed50e
Author: Stefan Dinescu <email address hidden>
Date: Wed Nov 27 16:10:42 2019 +0200

    Retry applying platform-integ-apps

    In rare cases, platform-integ-apps does and automatic applies
    or re-applies before the kubernetes core system pods are not
    yet ready.

    To fix this issue, we wrap the function that applies applications
    in a retry decorator that retries only when a platform-integ-apps
    specific exception is raised.

    Change-Id: I6b0bf996658079e0c10871254c75045662ad9db4
    Closes-bug: 1850189
    Signed-off-by: Stefan Dinescu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil) wrote :

Next step is to cherrypick to r/stx.3.0

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers