Bug #1874858 “DC system controllers some pods in “unknown” state...” : Bugs : StarlingX

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-04-24:

#1

ALL_NODES_20200424.153902.tar Edit (121.9 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-24:

#2

Please include a list of the pods that are having issues.
Can you clarify the frequency? Was the TC tried multiple times, but only failed once. Or was it only tried once and failed that one time?

Changed in starlingx:
status:	New → Incomplete
assignee:	nobody → Nimalini Rasa (nrasa)

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-04-24:

#3

The test case tried once and it failed.
The following pods were in unknown state:

coredns-78d9fd7cb9-tf76j
coredns-78d9fd7cb9-tf76j
stx-oidc-client-6c8cfc5f65-9kllc

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-27:

#4

Please re-run the TC again and let us know if it's reproducible. Please include the output of
kubectl get pods --all-namespaces

tags:

added: stx.distcloud stx.update

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-04-28:

#5

Download full text (13.5 KiB)

Seen the issue with stx-monitor pods after RR patch kubectl get pods --all-namespaces
NAMESPACE NAME cert-manager cm-cert-manager-manager-manager--2k4wd default small-7497946d9-58qmd default small-7497946d9-5jvqx default small-7497946d9-5p7ng default small-7497946d9-7cpzb default small-7497946d9-88hwf default small-7497946d9-8p75z default small-7497946d9-9sv8m default small-7497946d9-b7h8g default small-7497946d9-bxpdn default small-7497946d9-dljff default small-7497946d9-fvmf2 default small-7497946d9-fxq9s default small-7497946d9-h7rj8 default small-7497946d9-jgxcm default small-7497946d9-k8ntz default small-7497946d9-kk2jh default small-7497946d9-mtt4m default small-7497946d9-p42l7 default small-7497946d9-pjq57 default small-7497946d9-pp5dm default small-7497946d9-q4vrs default small-7497946d9-rnt24 removed. Seen it in few subclouds:
READY STATUS RESTARTS AGE
/>7b8b94bf9f-khwbt 1/1 Running 1 6h11m
/>cainjector-56b68989b5-fp75x 1/1 Running 3 6h11m
/>webhook-7d5c897795-pqflz 1/1 Running 1 6h11m
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
1/1 Running 2 2d3h
...

Seen the issue with stx-monitor pods after RR patch removed. Seen it in few subclouds:
kubectl get pods --all-namespaces
NAMESPACE                     NAME                                                READY   STATUS    RESTARTS   AGE
cert-manager                  cm-cert-manager-7b8b94bf9f-khwbt                    1/1     Running   1          6h11m
cert-manager                  cm-cert-manager-cainjector-56b68989b5-fp75x         1/1     Running   3          6h11m
cert-manager                  cm-cert-manager-webhook-7d5c897795-pqflz            1/1     Running   1          6h11m
default                       small-7497946d9-2k4wd                               1/1     Running   2          2d3h
default                       small-7497946d9-58qmd                               1/1     Running   2          2d3h
default                       small-7497946d9-5jvqx                               1/1     Running   2          2d3h
default                       small-7497946d9-5p7ng                               1/1     Running   2          2d3h
default                       small-7497946d9-7cpzb                               1/1     Running   2          2d3h
default                       small-7497946d9-88hwf                               1/1     Running   2          2d3h
default                       small-7497946d9-8p75z                               1/1     Running   2          2d3h
default                       small-7497946d9-9sv8m                               1/1     Running   2          2d3h
default                       small-7497946d9-b7h8g                               1/1     Running   2          2d3h
default                       small-7497946d9-bxpdn                               1/1     Running   2          2d3h
default                       small-7497946d9-dljff                               1/1     Running   2          2d3h
default                       small-7497946d9-fvmf2                               1/1     Running   2          2d3h
default                       small-7497946d9-fxq9s                               1/1     Running   2          2d3h
default                       small-7497946d9-h7rj8                               1/1     Running   2          2d3h
default                       small-7497946d9-jgxcm                               1/1     Running   2          2d3h
default                       small-7497946d9-k8ntz                               1/1     Running   2          2d3h
default                       small-7497946d9-kk2jh                               1/1     Running   2          2d3h
default                       small-7497946d9-mtt4m                               1/1     Running   2          2d3h
default                       small-7497946d9-p42l7                               1/1     Running   2          2d3h
default                       small-7497946d9-pjq57                               1/1     Running   2          2d3h
default                       small-7497946d9-pp5dm                               1/1     Running   2          2d3h
default                       small-7497946d9-q4vrs                               1/1     Running   2          2d3h
default                       small-7497946d9-rnt24                               1/1     Running   2          2d3h
default                       small-7497946d9-rppsj                               1/1     Running   2          2d3h
default                       small-7497946d9-vmmnw                               1/1     Running   2          2d3h
default                       small-7497946d9-wnzql                               1/1     Running   2          2d3h
default                       small-7497946d9-xpt94                               1/1     Running   2          2d3h
default                       small-7497946d9-xtm2c                               1/1     Running   2          2d3h
default                       small-7497946d9-z9gq9                               1/1     Running   2          2d3h
default                       small-7497946d9-zz4zj                               1/1     Running   2          2d3h
kube-system                   calico-kube-controllers-5cd4695574-kg45c            1/1     Running   6          4d
kube-system                   calico-node-9h2t5                                   1/1     Running   3          4d
kube-system                   coredns-7fc965fbd7-gjddv                            1/1     Running   3          4d
kube-system                   ic-nginx-ingress-controller-xbls8                   1/1     Running   3          4d
kube-system                   ic-nginx-ingress-default-backend-5ffcfd7744-gvssw   1/1     Running   1          6h11m
kube-system                   kube-apiserver-controller-0                         1/1     Running   2          2d23h
kube-system                   kube-controller-manager-controller-0                1/1     Running   5          4d
kube-system                   kube-multus-ds-amd64-gbz62                          1/1     Running   3          4d
kube-system                   kube-proxy-t4wgw                                    1/1     Running   3          4d
kube-system                   kube-scheduler-controller-0                         1/1     Running   5          4d
kube-system                   kube-sriov-cni-ds-amd64-5bk79                       1/1     Running   3          4d
kube-system                   kube-sriov-device-plugin-amd64-g99rj                1/1     Running   2          4d
kube-system                   oidc-dex-6585f5f9bc-vpg2w                           1/1     Running   1          6h11m
kube-system                   stx-oidc-client-6c8cfc5f65-rdv5v                    1/1     Running   5          6h11m
kube-system                   tiller-deploy-5c8dd9fb56-kplts                      1/1     Running   1          6h11m
monitor                       mon-filebeat-9nd4w                                  0/1     Running   2          2d4h
monitor                       mon-kube-state-metrics-7c7f79f45c-dl7f9             1/1     Running   1          6h11m
monitor                       mon-logstash-0                                      0/1     Unknown   1          2d4h
monitor                       mon-metricbeat-bbl9p                                0/1     Running   2          2d4h
monitor                       mon-metricbeat-metrics-6d4fdf4567-ns9zx             0/1     Running   1          6h11m
platform-deployment-manager   platform-deployment-manager-0                       2/2     Running   11         4d

Events:
  Type     Reason        Age                     From                   Message
  ----     ------        ----                    ----                   -------
  Normal   Pulled        9h (x2 over 2d4h)       kubelet, controller-0  Container image "registry.local:9001/docker.elastic.co/logstash/logstash:7.6.0" already present on machine
  Normal   Created       9h (x2 over 2d4h)       kubelet, controller-0  Created container logstash
  Normal   Started       9h (x2 over 2d4h)       kubelet, controller-0  Started container logstash
  Warning  NodeAffinity  6h15m                   kubelet, controller-0  Predicate NodeAffinity failed
  Warning  FailedMount   6h15m (x2 over 6h15m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashconfig" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   6h15m (x2 over 6h15m)   kubelet, controller-0  MountVolume.SetUp failed for volume "default-token-z7n6z" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   6h15m (x2 over 6h15m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashsetup" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   6h15m (x2 over 6h15m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashpipeline" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   6h14m (x5 over 6h15m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashpipeline" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashpipeline: no such file or directory
  Warning  FailedMount   6h8m                    kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[logstashsetup logstashconfig logstashpipeline], unattached volumes=[logstashsetup logstashconfig logstashpipeline monenv default-token-z7n6z]: timed out waiting for the condition
  Warning  FailedMount   6h4m                    kubelet, controller-0  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[logstashsetup logstashconfig logstashpipeline], unattached volumes=[default-token-z7n6z logstashsetup logstashconfig logstashpipeline monenv]: timed out waiting for the condition
  Warning  FailedMount   5h29m (x3 over 5h38m)   kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[logstashconfig logstashpipeline logstashsetup], unattached volumes=[logstashconfig logstashpipeline monenv default-token-z7n6z logstashsetup]: timed out waiting for the condition
  Warning  FailedMount   5h24m (x31 over 6h15m)  kubelet, controller-0  MountVolume.SetUp failed for volume "logstashsetup" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashsetup: no such file or directory
  Warning  FailedMount   5h9m (x38 over 6h15m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashconfig" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashconfig: no such file or directory
  Warning  FailedMount   4h59m                   kubelet, controller-0  MountVolume.SetUp failed for volume "default-token-z7n6z" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   4h59m                   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashconfig" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   4h59m                   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashsetup" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   4h59m                   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashpipeline" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   4h59m                   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashpipeline" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashpipeline: no such file or directory
  Warning  FailedMount   4h59m (x2 over 4h59m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashconfig" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   4h59m (x2 over 4h59m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashsetup" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   4h59m (x2 over 4h59m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashpipeline" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   4h59m (x2 over 4h59m)   kubelet, controller-0  MountVolume.SetUp failed for volume "default-token-z7n6z" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   4h58m (x5 over 4h59m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashsetup" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashsetup: no such file or directory
  Warning  FailedMount   4h58m (x6 over 4h59m)   kubelet, controller-0  MountVolume.SetUp failed for volume "logstashconfig" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashconfig: no such file or directory
  Warning  FailedMount   4h53m                   kubelet, controller-0  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[logstashconfig logstashpipeline logstashsetup], unattached volumes=[logstashconfig logstashpipeline monenv default-token-z7n6z logstashsetup]: timed out waiting for the condition
  Warning  FailedMount   128m (x12 over 4h26m)   kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[logstashconfig logstashpipeline logstashsetup], unattached volumes=[logstashconfig logstashpipeline monenv default-token-z7n6z logstashsetup]: timed out waiting for the condition
  Warning  FailedMount   19m (x43 over 4h57m)    kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[logstashsetup logstashconfig logstashpipeline], unattached volumes=[monenv default-token-z7n6z logstashsetup logstashconfig logstashpipeline]: timed out waiting for the condition
  Warning  FailedMount   13m (x17 over 4h13m)    kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[logstashpipeline logstashsetup logstashconfig], unattached volumes=[logstashpipeline monenv default-token-z7n6z logstashsetup logstashconfig]: timed out waiting for the condition
  Warning  FailedMount   8m53s (x40 over 4h55m)  kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[logstashsetup logstashconfig logstashpipeline], unattached volumes=[default-token-z7n6z logstashsetup logstashconfig logstashpipeline monenv]: timed out waiting for the condition
  Warning  FailedMount   4m6s (x151 over 4h59m)  kubelet, controller-0  MountVolume.SetUp failed for volume "logstashpipeline" : stat /var/lib/kubelet/pods/b0bb523c-7d9a-449e-86a0-01fc96a551c5/volumes/kubernetes.io~configmap/logstashpipeline: no such file or directory

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-04-29:

#6

Download full text (13.4 KiB)

Next iteration of the test case (apply RR patch), in one subcloud, stx-oidc-client pod is in "unknown" state.
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cm-cert-manager-7b8b94bf9f-qjc7x 1/1 Running 2 7h10m
cert-manager cm-cert-manager-cainjector-56b68989b5-qqq2p 1/1 Running 6 7h10m
cert-manager cm-cert-manager-webhook-7d5c897795-kpzgx 1/1 Running 2 7h10m
default small-7497946d9-24rlb 1/1 Running 3 2d4h
default small-7497946d9-2cn2r 1/1 Running 3 2d4h
default small-7497946d9-2jvfc 1/1 Running 3 2d4h
default small-7497946d9-492nk 1/1 Running 3 2d4h
default small-7497946d9-4jvtv 1/1 Running 3 2d4h
default small-7497946d9-4lzbc 1/1 Running 3 2d4h
default small-7497946d9-4nc99 1/1 Running 3 2d4h
default small-7497946d9-5mcvf 1/1 Running 3 2d4h
default small-7497946d9-64xqv 1/1 Running 3 2d4h
default small-7497946d9-6tmb7 1/1 Running 3 2d4h
default small-7497946d9-84khl 1/1 Running 3 2d4h
default small-7497946d9-895qk 1/1 Running 3 2d4h
default small-7497946d9-8ks9m 1/1 Running 3 2d4h
default small-7497946d9-922xz 1/1 Running 3 2d4h
default small-7497946d9-b58bl 1/1 Running 3 2d4h
default small-7497946d9-bhw82 1/1 Running 3 2d4h
default small-7497946d9-cdrnr 1/1 Running 3 2d4h
default small-7497946d9-cr7br 1/1 Running 3 2d4h
default small-7497946d9-jwgvl 1/1 Running 3 2d4h
default small-7497946d9-l745j 1/1 Running 3 2d4h
default small-7497946d9-lfhvs 1/1 Running 3 2d4h
default small-7497946d9-m6hvj 1/1 Running 3 2d4h
default small-7497946d9-mjdrj...

Next iteration of the test case (apply RR patch), in one subcloud, stx-oidc-client pod is in "unknown" state.
 kubectl get pods -A
NAMESPACE                     NAME                                                READY   STATUS    RESTARTS   AGE
cert-manager                  cm-cert-manager-7b8b94bf9f-qjc7x                    1/1     Running   2          7h10m
cert-manager                  cm-cert-manager-cainjector-56b68989b5-qqq2p         1/1     Running   6          7h10m
cert-manager                  cm-cert-manager-webhook-7d5c897795-kpzgx            1/1     Running   2          7h10m
default                       small-7497946d9-24rlb                               1/1     Running   3          2d4h
default                       small-7497946d9-2cn2r                               1/1     Running   3          2d4h
default                       small-7497946d9-2jvfc                               1/1     Running   3          2d4h
default                       small-7497946d9-492nk                               1/1     Running   3          2d4h
default                       small-7497946d9-4jvtv                               1/1     Running   3          2d4h
default                       small-7497946d9-4lzbc                               1/1     Running   3          2d4h
default                       small-7497946d9-4nc99                               1/1     Running   3          2d4h
default                       small-7497946d9-5mcvf                               1/1     Running   3          2d4h
default                       small-7497946d9-64xqv                               1/1     Running   3          2d4h
default                       small-7497946d9-6tmb7                               1/1     Running   3          2d4h
default                       small-7497946d9-84khl                               1/1     Running   3          2d4h
default                       small-7497946d9-895qk                               1/1     Running   3          2d4h
default                       small-7497946d9-8ks9m                               1/1     Running   3          2d4h
default                       small-7497946d9-922xz                               1/1     Running   3          2d4h
default                       small-7497946d9-b58bl                               1/1     Running   3          2d4h
default                       small-7497946d9-bhw82                               1/1     Running   3          2d4h
default                       small-7497946d9-cdrnr                               1/1     Running   3          2d4h
default                       small-7497946d9-cr7br                               1/1     Running   3          2d4h
default                       small-7497946d9-jwgvl                               1/1     Running   3          2d4h
default                       small-7497946d9-l745j                               1/1     Running   3          2d4h
default                       small-7497946d9-lfhvs                               1/1     Running   3          2d4h
default                       small-7497946d9-m6hvj                               1/1     Running   3          2d4h
default                       small-7497946d9-mjdrj                               1/1     Running   3          2d4h
default                       small-7497946d9-mkzmn                               1/1     Running   3          2d4h
default                       small-7497946d9-mmh69                               1/1     Running   3          2d4h
default                       small-7497946d9-mzvf6                               1/1     Running   3          2d4h
default                       small-7497946d9-pflxz                               1/1     Running   4          2d4h
default                       small-7497946d9-rs9wb                               1/1     Running   3          2d4h
default                       small-7497946d9-s2st4                               1/1     Running   3          2d4h
default                       small-7497946d9-vm99s                               1/1     Running   3          2d4h
kube-system                   calico-kube-controllers-5cd4695574-h5mkx            1/1     Running   8          3d9h
kube-system                   calico-node-wt2cq                                   1/1     Running   5          3d9h
kube-system                   coredns-7fc965fbd7-hbh92                            1/1     Running   4          3d9h
kube-system                   ic-nginx-ingress-controller-rhw4d                   1/1     Running   2          7h10m
kube-system                   ic-nginx-ingress-default-backend-5ffcfd7744-56kfq   1/1     Running   2          7h10m
kube-system                   kube-apiserver-controller-0                         1/1     Running   3          2d23h
kube-system                   kube-controller-manager-controller-0                1/1     Running   7          3d9h
kube-system                   kube-multus-ds-amd64-pb9nb                          1/1     Running   6          3d9h
kube-system                   kube-proxy-qdrpd                                    1/1     Running   4          3d9h
kube-system                   kube-scheduler-controller-0                         1/1     Running   7          3d9h
kube-system                   kube-sriov-cni-ds-amd64-22xtq                       1/1     Running   4          3d9h
kube-system                   kube-sriov-device-plugin-amd64-m8f9b                1/1     Running   2          7h10m
kube-system                   oidc-dex-6585f5f9bc-r2rh6                           1/1     Running   2          7h10m
kube-system                   stx-oidc-client-6c8cfc5f65-8bbwc                    0/1     Unknown   0          2d23h
kube-system                   tiller-deploy-5c8dd9fb56-x9cq4                      1/1     Running   5          3d9h
monitor                       mon-filebeat-fhr8t                                  1/1     Running   3          2d5h
monitor                       mon-kube-state-metrics-7c7f79f45c-542xc             1/1     Running   4          7h10m
monitor                       mon-logstash-0                                      1/1     Running   3          7h10m
monitor                       mon-metricbeat-l5ph4                                1/1     Running   3          2d5h
monitor                       mon-metricbeat-metrics-679dddb9fc-z2rcc             1/1     Running   2          7h10m
platform-deployment-manager   platform-deployment-manager-0                       2/2     Running   13         3d9h

Events:
  Type     Reason        Age                        From                   Message
  ----     ------        ----                       ----                   -------
  Warning  ProbeWarning  7h22m (x23189 over 2d23h)  kubelet, controller-0  Readiness probe warning: <a href="https://[2620:10a:a001:a103::41]:30556/dex/auth?client_id=stx-oidc-client-app&amp;redirect_uri=https%3A%2F%2F%5B2620%3A10a%3Aa001%3Aa103%3A%3A41%5D%3A30555%2Fcallback&amp;response_type=code&amp;scope=openid+profile+email+groups+offline_access&amp;state=I+wish+to+wash+my+irish+wristwatch">See Other</a>.
  Warning  NodeAffinity  7h11m                      kubelet, controller-0  Predicate NodeAffinity failed
  Warning  FailedMount   7h11m                      kubelet, controller-0  MountVolume.SetUp failed for volume "dex-client-secret-volume" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   7h11m                      kubelet, controller-0  MountVolume.SetUp failed for volume "https-tls" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   7h11m (x2 over 7h11m)      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   6h58m (x3 over 7h7m)       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[config default-token-snk48 dex-client-secret-volume https-tls]: timed out waiting for the condition
  Warning  FailedMount   6h56m (x2 over 7h)         kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[default-token-snk48 dex-client-secret-volume https-tls config]: timed out waiting for the condition
  Warning  FailedMount   6h25m (x12 over 7h9m)      kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[dex-client-secret-volume https-tls config default-token-snk48]: timed out waiting for the condition
  Warning  FailedMount   6h5m (x38 over 7h11m)      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : stat /var/lib/kubelet/pods/758ce0d3-867b-43ab-a683-8b5ac29d5547/volumes/kubernetes.io~configmap/config: no such file or directory
  Warning  FailedMount   5h55m                      kubelet, controller-0  MountVolume.SetUp failed for volume "dex-client-secret-volume" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   5h55m                      kubelet, controller-0  MountVolume.SetUp failed for volume "https-tls" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   5h55m                      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   5h55m                      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : stat /var/lib/kubelet/pods/758ce0d3-867b-43ab-a683-8b5ac29d5547/volumes/kubernetes.io~configmap/config: no such file or directory
  Warning  FailedMount   5h55m (x2 over 5h55m)      kubelet, controller-0  MountVolume.SetUp failed for volume "dex-client-secret-volume" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   5h55m (x2 over 5h55m)      kubelet, controller-0  MountVolume.SetUp failed for volume "https-tls" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   5h55m (x2 over 5h55m)      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   99m (x22 over 5h44m)       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[https-tls config default-token-snk48 dex-client-secret-volume]: timed out waiting for the condition
  Warning  FailedMount   49m (x156 over 5h55m)      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : stat /var/lib/kubelet/pods/758ce0d3-867b-43ab-a683-8b5ac29d5547/volumes/kubernetes.io~configmap/config: no such file or directory
  Warning  FailedMount   44m (x32 over 5h42m)       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[default-token-snk48 dex-client-secret-volume https-tls config]: timed out waiting for the condition
  Warning  FailedMount   29m (x22 over 5h53m)       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[config default-token-snk48 dex-client-secret-volume https-tls]: timed out waiting for the condition
  Warning  FailedMount   19m (x80 over 5h48m)       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[dex-client-secret-volume https-tls config default-token-snk48]: timed out waiting for the condition
  Warning  FailedMount   8m42s                      kubelet, controller-0  MountVolume.SetUp failed for volume "https-tls" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   8m42s                      kubelet, controller-0  MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   8m42s                      kubelet, controller-0  MountVolume.SetUp failed for volume "dex-client-secret-volume" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   8m9s                       kubelet, controller-0  MountVolume.SetUp failed for volume "dex-client-secret-volume" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   8m8s (x2 over 8m9s)        kubelet, controller-0  MountVolume.SetUp failed for volume "https-tls" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   8m8s (x2 over 8m9s)        kubelet, controller-0  MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   6m7s                       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[default-token-snk48 dex-client-secret-volume https-tls config]: timed out waiting for the condition
  Warning  FailedMount   4m4s                       kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[https-tls config default-token-snk48 dex-client-secret-volume]: timed out waiting for the condition
  Warning  FailedMount   2m                         kubelet, controller-0  Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[dex-client-secret-volume https-tls config default-token-snk48]: timed out waiting for the condition
  Warning  FailedMount   116s (x9 over 8m7s)        kubelet, controller-0  MountVolume.SetUp failed for volume "config" : stat /var/lib/kubelet/pods/758ce0d3-867b-43ab-a683-8b5ac29d5547/volumes/kubernetes.io~configmap/config: no such file or directory

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-04-29:

#7

Logs for subcloud9 can be found here:
https://files.starlingx.kube.cengn.ca/download_file/130

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-04-30:

#8

Additional info:
- Nimalini has seen this issue more than 50% of the time. Occurs after a host reboot.
- Most of the time one or both of these pods is in unknown state: stx-oidc-client-6c8cfc5f65-brcfk, mon-logstash-0

These events may indicate the reason for the pod failures and come from the kubectl describe pods -n output:
  Warning FailedMount 5m31s (x16 over 25m) kubelet, controller-0 MountVolume.SetUp failed for volume "config" : stat /var/lib/kubelet/pods/f9202e49-2a61-4322-8d44-8b35669aab27/volumes/kubernetes.io~configmap/config: no such file or directory
  Warning FailedMount 14m (x12 over 30m) kubelet, controller-0 MountVolume.SetUp failed for volume "logstashsetup" : stat /var/lib/kubelet/pods/1322abfc-ca78-4fac-9c91-a8483c527f12/volumes/kubernetes.io~configmap/logstashsetup: no such file or directory
  Warning FailedMount 3m53s (x18 over 30m) kubelet, controller-0 MountVolume.SetUp failed for volume "logstashpipeline" : stat /var/lib/kubelet/pods/1322abfc-ca78-4fac-9c91-a8483c527f12/volumes/kubernetes.io~configmap/logstashpipeline: no such file or directory

Ghada Khalil (gkhalil) on 2020-05-01

Changed in starlingx:
status:	Incomplete → Triaged
importance:	Undecided → Medium
assignee:	Nimalini Rasa (nrasa) → Bob Church (rchurch)

Ghada Khalil (gkhalil) on 2020-05-01

tags:

added: stx.4.0

Ghada Khalil (gkhalil) on 2020-05-06

Changed in starlingx:
assignee:	Bob Church (rchurch) → Bart Wensley (bartwensley)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-05-08:

#9

Reproduction:
- The issue (pods stuck with Unknown status after AIO-SX reboot) is relatively easy to reproduce on our DC-4 lab (AIO-DX System Controller with 10 AIO-SX subclouds).
- The best environment to reproduce includes:
- Following apps applied: cert-manager, nginx-ingress-controller, oidc-auth-apps, stx-monitor.
- 30 PV ‘small’ test pods running.
- Doing a lock/unlock of an AIO-SX subcloud reproduces the issue about 1 out of 4 reboots.

Potential cause:
- On startup, the kubelet is trying to reattach all the volumes to the pods. For some reason, it fails to do this sometimes and the pods are then stuck in the unknown state. The kubelet logs show a “MountVolume.SetUp failed for volume” error when this happens.
- There are several k8s bug reports that reference the “MountVolume.SetUp failed for volume” error shown above. The most promising one is: https://github.com/kubernetes/kubernetes/issues/68211
- The pods that usually experience the failure are stx-oidc-client and some of the monitor pods (e.g. mon-metricbeat-metrics and mon-logstash). These pods use the “subPath” volume configuration referenced in the above issue.
- Another pointer to the above issue is that when a pod is stuck in the unknown state, listing the contents of the pods volumes under /var/lib/kubelet/<uuid>/volumes shows the contents of all the volumes, except the volume using the subPath.

Yesterday I tried to update the deployment for the stx-oidc-client pods to use one of the workarounds that avoids the subPath volume configuration (workarounds suggested in the bug report):
- Use “items” in the volume config instead of subPath.
- Use projected volumes instead of subPath.

After several hours of effort I was not able to get the updated deployment to work. The pod would always fail to run, with errors like this:
Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"./stx-oidc-client\": stat ./stx-oidc-client: no such file or directory": unknown

I suspect an actual change to the image might be required.

My thinking is that if we remove the subPath config from stx-oidc-client and verify that the issue no longer happens with that pod, we’d at least be able to confirm the cause. We would then want to remove the subPath config from the stx-monitor pods (that could take some effort). Even then, we still may have cases where user pods are using subPath (assuming that is the cause), so we still may need to update the workaround we have in sysinv to kill pods in the NodeAffinity sate to also kill any pods stuck with Unknown status after a certain amount of time has expired after the reboot. The workaround is not ideal though because we’d need to wait a significant amount of time before killing them because some pods do go through the Unknown status as they come up - that is why it makes sense to avoid using the subPath so we can avoid the workaround as much as possible.

Reproduction:
- The issue (pods stuck with Unknown status after AIO-SX reboot) is relatively easy to reproduce on our DC-4 lab (AIO-DX System Controller with 10 AIO-SX subclouds).
- The best environment to reproduce includes:
  - Following apps applied: cert-manager, nginx-ingress-controller, oidc-auth-apps, stx-monitor.
  - 30 PV ‘small’ test pods running.
- Doing a lock/unlock of an AIO-SX subcloud reproduces the issue about 1 out of 4 reboots.

Potential cause:
- On startup, the kubelet is trying to reattach all the volumes to the pods. For some reason, it fails to do this sometimes and the pods are then stuck in the unknown state. The kubelet logs show a “MountVolume.SetUp failed for volume” error when this happens.
- There are several k8s bug reports that reference the “MountVolume.SetUp failed for volume” error shown above. The most promising one is: https://github.com/kubernetes/kubernetes/issues/68211
- The pods that usually experience the failure are stx-oidc-client and some of the monitor pods (e.g. mon-metricbeat-metrics and mon-logstash). These pods use the “subPath” volume configuration referenced in the above issue.
- Another pointer to the above issue is that when a pod is stuck in the unknown state, listing the contents of the pods volumes under /var/lib/kubelet/<uuid>/volumes shows the contents of all the volumes, except the volume using the subPath.

Yesterday I tried to update the deployment for the stx-oidc-client pods to use one of the workarounds that avoids the subPath volume configuration (workarounds suggested in the bug report):
- Use “items” in the volume config instead of subPath.
- Use projected volumes instead of subPath.

After several hours of effort I was not able to get the updated deployment to work. The pod would always fail to run, with errors like this:
Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"./stx-oidc-client\": stat ./stx-oidc-client: no such file or directory": unknown

I suspect an actual change to the image might be required.

My thinking is that if we remove the subPath config from stx-oidc-client and verify that the issue no longer happens with that pod, we’d at least be able to confirm the cause. We would then want to remove the subPath config from the stx-monitor pods (that could take some effort). Even then, we still may have cases where user pods are using subPath (assuming that is the cause), so we still may need to update the workaround we have in sysinv to kill pods in the NodeAffinity sate to also kill any pods stuck with Unknown status after a certain amount of time has expired after the reboot. The workaround is not ideal though because we’d need to wait a significant amount of time before killing them because some pods do go through the Unknown status as they come up - that is why it makes sense to avoid using the subPath so we can avoid the workaround as much as possible.

tags:

removed: stx.distcloud

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-05-08:

#10

Since the analysis indicates this issue has nothing to do with distributed cloud, I have removed the stx.distcloud tag.

Frank Miller (sensfan22) on 2020-05-12

tags:

added: stx.containers

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-05-12:

#11

Re-assigning to Paul to work through the next steps which is to prove Bart's theory that pods that use a subPath config are the ones that see this issue.

Changed in starlingx:
assignee:	Bart Wensley (bartwensley) → Paul-Ionut Vaduva (pvaduva)

Revision history for this message

Paul-Ionut Vaduva (pvaduva) wrote on 2020-05-21:

#12

After stressing the lab just as the test team who discovered this with multiple small pods with the exception that they also use subPath option when mounting configMaps just like in the bug description for the kubernetes issue, I discovered the bug is tends to reproduce with higher frequencies on some pods like the ones from stx-monitor and oidc-auth-apps applications. I can't seem to reproduce this on pods I manually created to stress the system even if they also use configMaps mounted volumes, with subPath option. That being said even if I can't seem to manually create pods that reproduce this issue I still think that we are facing the fore-mentioned kubernetes issue, it's just difficult to manually reproduce, it.

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-05-25:

#13

Next step for this LP is to make a change to the stx-oidc-client pod to remove its use of subpath and then confirm this avoids that pod coming up in unknown state. If this fixes the issue for the stx-oidc-client pod then a similar solution will be required by a stx-monitor developer.

Assigning to Teresa to remove subpath from stx-oidc-client.

Changed in starlingx:
assignee:	Paul-Ionut Vaduva (pvaduva) → Teresa Ho (teresaho)

Ghada Khalil (gkhalil) on 2020-06-11

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-15: Fix proposed to oidc-auth-armada-app (master)

#14

Fix proposed to branch: master
Review: https://review.opendev.org/735647

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-16: Fix merged to oidc-auth-armada-app (master)

#15

Reviewed: https://review.opendev.org/735647
Committed: https://git.openstack.org/cgit/starlingx/oidc-auth-armada-app/commit/?id=04b525ea2a0a8795c4d556676ad565acd67831cf
Submitter: Zuul
Branch: master

commit 04b525ea2a0a8795c4d556676ad565acd67831cf
Author: Teresa Ho <email address hidden>
Date: Mon Jun 15 08:02:11 2020 -0400

Remove subPath in volume mount for oidc-client

    The use of subPath to configure volume mount path causes failure in
    stx-oidc-client pod restarts.
    This update is to remove the use of subPath.

Ran security regression tests for oidc-auth-apps

Partial-Bug: 1874858

Change-Id: Iba97f6b1da8c7a7cc280662bf4fd9b2f80b1d33f
Signed-off-by: Teresa Ho <email address hidden>

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-17:

#16

The fix submitted applies to the stx-oidc-client pod.
No fix is planned for the stx-monitor mon-logstash pod as stx-monitor is no longer actively maintained in stx master.

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-06-29:

#17

Verified with load build on:2020-06-27_00-41-42

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-07-22:

#18

Note that the upstream issue (https://github.com/kubernetes/kubernetes/issues/68211) we suspect now has a fix submitted for it:
https://github.com/kubernetes/kubernetes/pull/89629

I'm not sure which future kubernetes version will include this fix.

Revision history for this message

Chris Friesen (cbf123) wrote on 2020-07-30:

#19

I've seen application pods in "Unknown" state after RR patching on AIO-SX as well.

I tried bringing in the upstream fix from https://github.com/kubernetes/kubernetes/pull/89629 but it doesn't seem to have fixed the problem. I'm still seeing pods in the "Unknown" state.

This fix may be necessary (yet to be determined) but it is not sufficient by itself to fix the issue.

StarlingX

DC system controllers some pods in "unknown" state after RR patching

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches