Bug #1850189 “platform-integ-apps occasionally fail to apply ini...” : Bugs : StarlingX

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-28:

#1

ALL_NODES_20191028.080431.tar Edit (18.2 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-28:

#2

stx.3.0 / medium priority - should be investigated further

description:	updated
tags:	added: stx.containers
tags:	added: stx.3.0
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Bob Church (rchurch)

Revision history for this message

Bob Church (rchurch) wrote on 2019-11-03:

#3

I saw this occur again on SM-4. I think the next steps here is to come up with a reliable reproduction scenario. I think this can be achieved with a test to run consecutive lock/unlock cycles. What I observed, after an unlock, was that the k8s API was not accessible from logs in sysinv within the same timeframe as the tiller error. We might need to determine if there is intermittent API access on the AIO-SX since there is only one replica. If intermittent access is not an issue, we may need to test the API access prior to running the application-apply as after a host unlock the apply might be happening before all the pods are are recovered.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-11-04:

#4

Based on Bob's analysis, this looks like a timing or dependency issue with k8s recovery on an AIO-SX unlock. Assigning to Stefan to continue the investigation. Suggest keeping Bart informed if this is a k8s recovery/dependency issue so he can help identify the best solution.

Changed in starlingx:
assignee:	Bob Church (rchurch) → Stefan Dinescu (stefandinescu)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-11-04:

#5

Experienced a case of platform-integ-apps apply failing
2019-11-02_08-39-54
2 controller system
R720 1-2

see logs attached
Exceptions in processiong chart rbd-provision

starting here:
2019-11-04 15:46:22.982 11 INFO armada.handlers.chart_deploy [-] [chart=kube-system-rbd-provisioner]: Processing Chart, release=stx-rbd-provisioner
...
2019-11-04 15:46:44.059 11 ERROR armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Error while updating release stx-rbd-provisioner: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
...
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller During handling of the above exception, another exception occurred:
...

2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller During handling of the above exception, another exception occurred:
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 546, in get_release_status
2019-11-04 15:47:04.080 11 ERROR armada.handlers.tiller status_request, self.timeout, metadata=self.metadata)
...
2019-11-04 15:47:04.080 11 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
...

2019-11-04 15:47:04.080 11 ERROR armada.handlers.armada [-] Chart deploy [kube-system-rbd-provisioner] failed: armada.exceptions.tiller_exceptions.GetReleaseStatusException: Failed to get stx-rbd-provisioner status 0 version
...

2019-11-04 15:47:04.088 11 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['kube-system-rbd-provisioner'][00m
2019-11-04 15:47:04.793 11 INFO armada.handlers.lock [-] Releasing lock[00m
...

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-11-04:

#6

ALL_NODES_20191104.162428.tar Edit (57.1 MiB, application/x-tar)

tags:

added: stx.retestneeded

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-11-04:

#7

Download full text (3.2 KiB)

Steps (roughly) prior to the problem

[Unlocked controller-1 then swact was initiated here]
2019-11-04 15:25:10,211 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__swact__1", "csrfmiddlewaretoken": "a7HwE603lUKsk7PEWLbWXQVPf7rwYttoWJPWmDWqLrm1kEsOowBQcze9tVPperiZ"}] message:[success: Swact Initiated Host: controller-0]

[Lock controller-1 then elastic labels added (via horizon) the controller-1 was unlocked - elastic-data, elastic-controller,elastic client, elastic-master]

2019-11-04 15:46:43,736 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__lock__2", "csrfmiddlewaretoken": "r3wLFmZ4VcOyMt4wTm5sSQDgRGbAJTUzUR1fc0Vg13qFx4b8obqectAGIsOnFp94"}] message:[success: Locking Host: controller-1]

2019-11-04 15:47:45,148 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-data", "host_id": "2", "csrfmiddlewaretoken": "UQbQ6qKK1jvcAkPYavBAQoJHheGRFZstnEGkD4GW7a7jlVWAFkWma1G780jEBvHY", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-data" was successfully created.]

2019-11-04 15:47:55,759 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-controller", "host_id": "2", "csrfmiddlewaretoken": "R8lSVzWJzbExf28mf6nLP6tg1fOsBJyHkWQmsdSVF2gE0DfYKVIx9JqGS1rfxfNc", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-controller" was successfully created.]

2019-11-04 15:48:05,502 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-client", "host_id": "2", "csrfmiddlewaretoken": "2yMXXqt9UlscnQdltdvz5DHU9eJRjuVYvmhru4pl0c4j8rkXY2QlpgEk00mEf0at", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-client" was successfully created.]

2019-11-04 15:48:14,962 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-master", "host_id": "2", "csrfmiddlewaretoken": "PPIuuORIADvIkgBaWCjljpqp17Jqy377iDdY1sNUGu7P5RIMrrE7D2nPSTmduzmC", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-master" was successfully created.]

[Unlock controller-1]
2019-11-04 15:48:27,365 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__unlock__2", "csrfmiddlewar...

Steps (roughly) prior to the problem

[Unlocked controller-1 then swact was initiated here]
2019-11-04 15:25:10,211 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__swact__1", "csrfmiddlewaretoken": "a7HwE603lUKsk7PEWLbWXQVPf7rwYttoWJPWmDWqLrm1kEsOowBQcze9tVPperiZ"}] message:[success: Swact Initiated Host: controller-0]

[Lock controller-1 then elastic labels added (via horizon) the controller-1 was unlocked - elastic-data, elastic-controller,elastic client, elastic-master]

2019-11-04 15:46:43,736 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__lock__2", "csrfmiddlewaretoken": "r3wLFmZ4VcOyMt4wTm5sSQDgRGbAJTUzUR1fc0Vg13qFx4b8obqectAGIsOnFp94"}] message:[success: Locking Host: controller-1]

2019-11-04 15:47:45,148 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-data", "host_id": "2", "csrfmiddlewaretoken": "UQbQ6qKK1jvcAkPYavBAQoJHheGRFZstnEGkD4GW7a7jlVWAFkWma1G780jEBvHY", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-data" was successfully created.]

2019-11-04 15:47:55,759 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-controller", "host_id": "2", "csrfmiddlewaretoken": "R8lSVzWJzbExf28mf6nLP6tg1fOsBJyHkWQmsdSVF2gE0DfYKVIx9JqGS1rfxfNc", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-controller" was successfully created.]

2019-11-04 15:48:05,502 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-client", "host_id": "2", "csrfmiddlewaretoken": "2yMXXqt9UlscnQdltdvz5DHU9eJRjuVYvmhru4pl0c4j8rkXY2QlpgEk00mEf0at", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-client" was successfully created.]

2019-11-04 15:48:14,962 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/2/assignlabel/ 200] parameters:[{"clabelvalue": "enabled", "clabelkey": "elastic-master", "host_id": "2", "csrfmiddlewaretoken": "PPIuuORIADvIkgBaWCjljpqp17Jqy377iDdY1sNUGu7P5RIMrrE7D2nPSTmduzmC", "host_uuid": "1c95aa14-657e-494d-a617-76e460981339", "labelkey": "customized_label"}] message:[success: Label "elastic-master" was successfully created.]

[Unlock controller-1]
2019-11-04 15:48:27,365 [INFO] horizon.operation_log: [admin 0b18a42bf702449a83245107e807391f] [admin 801af43854bb4bc4a8527e9409fa6afd] [POST /admin/ 302] parameters:[{"action": "hostscontroller__unlock__2", "csrfmiddlewaretoken": "NyG412UsUjYBpPElWveljpTak6S9vrIDgmbyyGQE0aAIaqLXrkz7D2QAbSvWrXX8"}] message:[success: Unlocked Host: controller-1]

Revision history for this message

Stefan Dinescu (stefandinescu) wrote on 2019-11-13:

#8

Download full text (13.7 KiB)

I managed to reproduce this issue in a Vbox AIO-SX setup, but the error given is different. But it should be noted that the error presented in the original LP and the one Wendy encountered, are also different (also different than mine).

It seems that in some cases, K8s services start later than usual and the apply/reapply checks in sysinv don't check for that.

I received a name resolution order because the coredns pod started after the re-apply of platform-integ-apps was triggered. My theory is that, depending on what pods are running or not-running at the moment when platform-integ-apps apply is triggered, we get different error messages.

Also this issue is very hard to repoduce and on the vbox SX, i am able to reproduce this once every 10-15 lock-unlock cycles.

Logs of when I reproduces the issue:

sysinv.log
ysinv 2019-11-13 11:38:08.422 98276 INFO sysinv.conductor.kube_app [-] Starting Armada service...
sysinv 2019-11-13 11:38:08.424 98276 INFO sysinv.conductor.kube_app [-] kube_config=/opt/platform/armada/19.09/admin.conf, manifests_dir=/opt/platform/armada/19.09, overrides_dir=/opt/platform/helm/19.09, logs_dir=/var/log/armada.
sysinv 2019-11-13 11:38:08.824 98276 INFO sysinv.conductor.kube_app [-] Armada service started!
sysinv 2019-11-13 11:38:08.825 98276 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply_2019-11-13-11-38-08.log'
sysinv 2019-11-13 11:38:09.420 98276 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app platform-integ-apps
sysinv 2019-11-13 11:38:16.833 98276 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply_2019-11-13-11-38-08.log for details.
sysinv 2019-11-13 11:38:16.835 98276 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app platform-integ-apps
sysinv 2019-11-13 11:38:17.017 98276 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Pod states during this time (note coredns marked as completed):
Wed Nov 13 11:38:08 UTC 2019
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7f985db75c-d5xml 0/1 Error 9 21h
calico-node-zld2m 0/1 CrashLoopBackOff 20 21h
ceph-pools-audit-1573644000-8v4w8 0/1 Completed 0 18m
ceph-pools-audit-1573644300-wc2st 0/1 Completed 0 10m
ceph-pools-audit-1573644600-9lqv7 0/1 Completed 0 8m1s
coredns-6889846b6b-5nmng 0/1 Completed 9 21h
kube-apiserver-co...

I managed to reproduce this issue in a Vbox AIO-SX setup, but the error given is different. But it should be noted that the error presented in the original LP and the one Wendy encountered, are also different (also different than mine).

It seems that in some cases, K8s services start later than usual and the apply/reapply checks in sysinv don't check for that.

I received a name resolution order because the coredns pod started after the re-apply of platform-integ-apps was triggered. My theory is that, depending on what pods are running or not-running at the moment when platform-integ-apps apply is triggered, we get different error messages.

Also this issue is very hard to repoduce and on the vbox SX, i am able to reproduce this once every 10-15 lock-unlock cycles.

Logs of when I reproduces the issue:

sysinv.log
ysinv 2019-11-13 11:38:08.422 98276 INFO sysinv.conductor.kube_app [-] Starting Armada service...
sysinv 2019-11-13 11:38:08.424 98276 INFO sysinv.conductor.kube_app [-] kube_config=/opt/platform/armada/19.09/admin.conf, manifests_dir=/opt/platform/armada/19.09, overrides_dir=/opt/platform/helm/19.09, logs_dir=/var/log/armada.
sysinv 2019-11-13 11:38:08.824 98276 INFO sysinv.conductor.kube_app [-] Armada service started!
sysinv 2019-11-13 11:38:08.825 98276 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml  --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml  --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml  --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply_2019-11-13-11-38-08.log'
sysinv 2019-11-13 11:38:09.420 98276 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app platform-integ-apps
sysinv 2019-11-13 11:38:16.833 98276 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply_2019-11-13-11-38-08.log for details.
sysinv 2019-11-13 11:38:16.835 98276 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app platform-integ-apps
sysinv 2019-11-13 11:38:17.017 98276 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Pod states during this time (note coredns marked as completed):
Wed Nov 13 11:38:08 UTC 2019
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-7f985db75c-d5xml   0/1     Error              9          21h
calico-node-zld2m                          0/1     CrashLoopBackOff   20         21h
ceph-pools-audit-1573644000-8v4w8          0/1     Completed          0          18m
ceph-pools-audit-1573644300-wc2st          0/1     Completed          0          10m
ceph-pools-audit-1573644600-9lqv7          0/1     Completed          0          8m1s
coredns-6889846b6b-5nmng                   0/1     Completed          9          21h
kube-apiserver-controller-0                1/1     Running            10         21h
kube-controller-manager-controller-0       1/1     Running            19         21h
kube-multus-ds-amd64-76gj8                 1/1     Running            10         21h
kube-proxy-kzjmt                           1/1     Running            10         21h
kube-scheduler-controller-0                1/1     Running            19         21h
kube-sriov-cni-ds-amd64-kkhpx              1/1     Running            10         21h
rbd-provisioner-7484d49cf6-wgwhh           0/1     Error              7          20h
storage-init-rbd-provisioner-84qfw         0/1     Completed          0          20h
tiller-deploy-d6b59fcb-cgv4p               1/1     Running            7          123m

Wed Nov 13 11:38:10 UTC 2019
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-7f985db75c-d5xml   0/1     Error              9          21h
calico-node-zld2m                          0/1     CrashLoopBackOff   20         21h
ceph-pools-audit-1573644000-8v4w8          0/1     Completed          0          18m
ceph-pools-audit-1573644300-wc2st          0/1     Completed          0          10m
ceph-pools-audit-1573644600-9lqv7          0/1     Completed          0          8m3s
coredns-6889846b6b-5nmng                   0/1     Completed          9          21h
kube-apiserver-controller-0                1/1     Running            10         21h
kube-controller-manager-controller-0       1/1     Running            19         21h
kube-multus-ds-amd64-76gj8                 1/1     Running            10         21h
kube-proxy-kzjmt                           1/1     Running            10         21h
kube-scheduler-controller-0                1/1     Running            19         21h
kube-sriov-cni-ds-amd64-kkhpx              1/1     Running            10         21h
rbd-provisioner-7484d49cf6-wgwhh           0/1     Error              7          20h
storage-init-rbd-provisioner-84qfw         0/1     Completed          0          20h
tiller-deploy-d6b59fcb-cgv4p               1/1     Running            7          123m

Wed Nov 13 11:38:12 UTC 2019
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-7f985db75c-d5xml   0/1     Error              9          21h
calico-node-zld2m                          0/1     CrashLoopBackOff   20         21h
ceph-pools-audit-1573644000-8v4w8          0/1     Completed          0          18m
ceph-pools-audit-1573644300-wc2st          0/1     Completed          0          10m
ceph-pools-audit-1573644600-9lqv7          0/1     Completed          0          8m6s
coredns-6889846b6b-5nmng                   0/1     Completed          9          21h
kube-apiserver-controller-0                1/1     Running            10         21h
kube-controller-manager-controller-0       1/1     Running            19         21h
kube-multus-ds-amd64-76gj8                 1/1     Running            10         21h
kube-proxy-kzjmt                           1/1     Running            10         21h
kube-scheduler-controller-0                1/1     Running            19         21h
kube-sriov-cni-ds-amd64-kkhpx              1/1     Running            10         21h
rbd-provisioner-7484d49cf6-wgwhh           0/1     Error              7          20h
storage-init-rbd-provisioner-84qfw         0/1     Completed          0          20h
tiller-deploy-d6b59fcb-cgv4p               1/1     Running            7          123m

Wed Nov 13 11:38:15 UTC 2019
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-7f985db75c-d5xml   0/1     Error              9          21h
calico-node-zld2m                          0/1     CrashLoopBackOff   20         21h
ceph-pools-audit-1573644000-8v4w8          0/1     Completed          0          18m
ceph-pools-audit-1573644300-wc2st          0/1     Completed          0          10m
ceph-pools-audit-1573644600-9lqv7          0/1     Completed          0          8m8s
coredns-6889846b6b-5nmng                   0/1     Completed          9          21h
kube-apiserver-controller-0                1/1     Running            10         21h
kube-controller-manager-controller-0       1/1     Running            19         21h
kube-multus-ds-amd64-76gj8                 1/1     Running            10         21h
kube-proxy-kzjmt                           1/1     Running            10         21h
kube-scheduler-controller-0                1/1     Running            19         21h
kube-sriov-cni-ds-amd64-kkhpx              1/1     Running            10         21h
rbd-provisioner-7484d49cf6-wgwhh           0/1     Error              7          20h
storage-init-rbd-provisioner-84qfw         0/1     Completed          0          20h
tiller-deploy-d6b59fcb-cgv4p               1/1     Running            7          123m

Wed Nov 13 11:38:17 UTC 2019
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-7f985db75c-d5xml   0/1     Error              9          21h
calico-node-zld2m                          0/1     CrashLoopBackOff   20         21h
ceph-pools-audit-1573644000-8v4w8          0/1     Completed          0          18m
ceph-pools-audit-1573644300-wc2st          0/1     Completed          0          10m
ceph-pools-audit-1573644600-9lqv7          0/1     Completed          0          8m10s
coredns-6889846b6b-5nmng                   0/1     Completed          9          21h
kube-apiserver-controller-0                1/1     Running            10         21h
kube-controller-manager-controller-0       1/1     Running            19         21h
kube-multus-ds-amd64-76gj8                 1/1     Running            10         21h
kube-proxy-kzjmt                           1/1     Running            10         21h
kube-scheduler-controller-0                1/1     Running            19         21h
kube-sriov-cni-ds-amd64-kkhpx              1/1     Running            10         21h
rbd-provisioner-7484d49cf6-wgwhh           0/1     Error              7          20h
storage-init-rbd-provisioner-84qfw         0/1     Completed          0          20h
tiller-deploy-d6b59fcb-cgv4p               1/1     Running            7          123m

Note:"kubectl -n kube-system get pods" didn't start working until Wed Nov 13 11:37:54 UTC 2019

Error in armada log:
2019-11-13 11:38:10.770 16 DEBUG armada.handlers.tiller [-] Tiller ListReleases() with timeout=300, request=limit: 32
status_codes: UNKNOWN
status_codes: DEPLOYED
status_codes: DELETED
status_codes: DELETING
status_codes: FAILED
status_codes: PENDING_INSTALL
status_codes: PENDING_UPGRADE
status_codes: PENDING_ROLLBACK
 get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215^[[00m
2019-11-13 11:38:16.646 16 INFO armada.handlers.lock [-] Releasing lock^[[00m
2019-11-13 11:38:16.654 16 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Name resolution failure"
        debug_error_string = "{"created":"@1573645095.806259192","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1573645095.806254835","description":"Name resolution failure","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3026,"grpc_status":14}]}"
>
2019-11-13 11:38:16.654 16 ERROR armada.cli Traceback (most recent call last):
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2019-11-13 11:38:16.654 16 ERROR armada.cli     self.invoke()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2019-11-13 11:38:16.654 16 ERROR armada.cli     resp = self.handle(documents, tiller)
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2019-11-13 11:38:16.654 16 ERROR armada.cli     return future.result()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2019-11-13 11:38:16.654 16 ERROR armada.cli     return self.__get_result()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-11-13 11:38:16.654 16 ERROR armada.cli     raise self._exception
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2019-11-13 11:38:16.654 16 ERROR armada.cli     result = self.fn(*self.args, **self.kwargs)
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2019-11-13 11:38:16.654 16 ERROR armada.cli     return armada.sync()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 189, in sync
2019-11-13 11:38:16.654 16 ERROR armada.cli     known_releases = self.tiller.list_releases()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 252, in list_releases
2019-11-13 11:38:16.654 16 ERROR armada.cli     releases = get_results()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 220, in get_results
2019-11-13 11:38:16.654 16 ERROR armada.cli     for message in response:
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 364, in __next__
2019-11-13 11:38:16.654 16 ERROR armada.cli     return self._next()
2019-11-13 11:38:16.654 16 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 358, in _next
2019-11-13 11:38:16.654 16 ERROR armada.cli     raise self
2019-11-13 11:38:16.654 16 ERROR armada.cli grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-11-13 11:38:16.654 16 ERROR armada.cli     status = StatusCode.UNAVAILABLE
2019-11-13 11:38:16.654 16 ERROR armada.cli     details = "Name resolution failure"
2019-11-13 11:38:16.654 16 ERROR armada.cli     debug_error_string = "{"created":"@1573645095.806259192","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1573645095.806254835","description":"Name resolution failure","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3026,"grpc_status":14}]}"

Revision history for this message

Stefan Dinescu (stefandinescu) wrote on 2019-11-26:

#9

I am working on a fix to this issue by issuing a retry for the platform-integ-apps in case of a failure.

Platform-integ-apps are not expected to fail and retrying for a limited amount of times should be safe. I am currently testing this approach, but testing takes some time as the issue is not very easy to reproduce.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-27: Fix proposed to config (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/696311

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-06: Fix merged to config (master)

#11

Reviewed: https://review.opendev.org/696311
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=638487f67b7cdcf3c01331cb057d9a273c4ed50e
Submitter: Zuul
Branch: master

commit 638487f67b7cdcf3c01331cb057d9a273c4ed50e
Author: Stefan Dinescu <email address hidden>
Date: Wed Nov 27 16:10:42 2019 +0200

Retry applying platform-integ-apps

    In rare cases, platform-integ-apps does and automatic applies
    or re-applies before the kubernetes core system pods are not
    yet ready.

    To fix this issue, we wrap the function that applies applications
    in a retry decorator that retries only when a platform-integ-apps
    specific exception is raised.

    Change-Id: I6b0bf996658079e0c10871254c75045662ad9db4
    Closes-bug: 1850189
    Signed-off-by: Stefan Dinescu <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-06:

#12

Next step is to cherrypick to r/stx.3.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-09: Fix proposed to config (r/stx.3.0)

#13

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/698025

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-09: Fix merged to config (r/stx.3.0)

#14

Reviewed: https://review.opendev.org/698025
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=d5b39786baf97f0e02bf7b433a7f7641425fe54c
Submitter: Zuul
Branch: r/stx.3.0

commit d5b39786baf97f0e02bf7b433a7f7641425fe54c
Author: Stefan Dinescu <email address hidden>
Date: Wed Nov 27 16:10:42 2019 +0200

Retry applying platform-integ-apps

    In rare cases, platform-integ-apps does and automatic applies
    or re-applies before the kubernetes core system pods are not
    yet ready.

    To fix this issue, we wrap the function that applies applications
    in a retry decorator that retries only when a platform-integ-apps
    specific exception is raised.

    Change-Id: I6b0bf996658079e0c10871254c75045662ad9db4
    Closes-bug: 1850189
    Signed-off-by: Stefan Dinescu <email address hidden>
    (Cherry-picked from 638487f67b7cdcf3c01331cb057d9a273c4ed50e)

Ghada Khalil (gkhalil) on 2019-12-09

tags:

added: in-r-stx30

Revision history for this message

Yang Liu (yliu12) wrote on 2019-12-17:

#15

Have not seen this issue in recent sanity.

tags:

removed: stx.retestneeded

Ghada Khalil (gkhalil) on 2020-01-17

Changed in starlingx:
importance:	Medium → High
importance:	High → Low
importance:	Low → Medium

StarlingX

platform-integ-apps occasionally fail to apply initially due to tiller unable to get release info

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches