DC system controllers some pods in "unknown" state after RR patching

Bug #1874858 reported by Nimalini Rasa
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Teresa Ho

Bug Description

Brief Description
-----------------
Some pods in unknown state after RR patch applied

Severity
--------
Major

Steps to Reproduce
------------------
1) Apply Reboot required test patch

....
TC-name: Patching system controllers

Expected Behavior
------------------
All pods in running state

Actual Behavior
----------------
Some of the pods got stuck in "unknown" state

Reproducibility
---------------
Seen once.

System Configuration
--------------------
DC system controllers (Duplex)- IPV6

Branch/Pull Time/Commit
-----------------------
2020-04-21

Last Pass
---------
N/A

Timestamp/Logs
--------------
2020-04-24T15:06:31.628 (controller-1 recovered)

Test Activity
-------------
SystemTest

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Please include a list of the pods that are having issues.
Can you clarify the frequency? Was the TC tried multiple times, but only failed once. Or was it only tried once and failed that one time?

Changed in starlingx:
status: New → Incomplete
assignee: nobody → Nimalini Rasa (nrasa)
Revision history for this message
Nimalini Rasa (nrasa) wrote :

The test case tried once and it failed.
The following pods were in unknown state:

coredns-78d9fd7cb9-tf76j
coredns-78d9fd7cb9-tf76j
stx-oidc-client-6c8cfc5f65-9kllc

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Please re-run the TC again and let us know if it's reproducible. Please include the output of
kubectl get pods --all-namespaces

tags: added: stx.distcloud stx.update
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Download full text (13.5 KiB)

Seen the issue with stx-monitor pods after RR patch removed. Seen it in few subclouds:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cm-cert-manager-7b8b94bf9f-khwbt 1/1 Running 1 6h11m
cert-manager cm-cert-manager-cainjector-56b68989b5-fp75x 1/1 Running 3 6h11m
cert-manager cm-cert-manager-webhook-7d5c897795-pqflz 1/1 Running 1 6h11m
default small-7497946d9-2k4wd 1/1 Running 2 2d3h
default small-7497946d9-58qmd 1/1 Running 2 2d3h
default small-7497946d9-5jvqx 1/1 Running 2 2d3h
default small-7497946d9-5p7ng 1/1 Running 2 2d3h
default small-7497946d9-7cpzb 1/1 Running 2 2d3h
default small-7497946d9-88hwf 1/1 Running 2 2d3h
default small-7497946d9-8p75z 1/1 Running 2 2d3h
default small-7497946d9-9sv8m 1/1 Running 2 2d3h
default small-7497946d9-b7h8g 1/1 Running 2 2d3h
default small-7497946d9-bxpdn 1/1 Running 2 2d3h
default small-7497946d9-dljff 1/1 Running 2 2d3h
default small-7497946d9-fvmf2 1/1 Running 2 2d3h
default small-7497946d9-fxq9s 1/1 Running 2 2d3h
default small-7497946d9-h7rj8 1/1 Running 2 2d3h
default small-7497946d9-jgxcm 1/1 Running 2 2d3h
default small-7497946d9-k8ntz 1/1 Running 2 2d3h
default small-7497946d9-kk2jh 1/1 Running 2 2d3h
default small-7497946d9-mtt4m 1/1 Running 2 2d3h
default small-7497946d9-p42l7 1/1 Running 2 2d3h
default small-7497946d9-pjq57 1/1 Running 2 2d3h
default small-7497946d9-pp5dm 1/1 Running 2 2d3h
default small-7497946d9-q4vrs 1/1 Running 2 2d3h
default small-7497946d9-rnt24 ...

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Download full text (13.4 KiB)

Next iteration of the test case (apply RR patch), in one subcloud, stx-oidc-client pod is in "unknown" state.
 kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cm-cert-manager-7b8b94bf9f-qjc7x 1/1 Running 2 7h10m
cert-manager cm-cert-manager-cainjector-56b68989b5-qqq2p 1/1 Running 6 7h10m
cert-manager cm-cert-manager-webhook-7d5c897795-kpzgx 1/1 Running 2 7h10m
default small-7497946d9-24rlb 1/1 Running 3 2d4h
default small-7497946d9-2cn2r 1/1 Running 3 2d4h
default small-7497946d9-2jvfc 1/1 Running 3 2d4h
default small-7497946d9-492nk 1/1 Running 3 2d4h
default small-7497946d9-4jvtv 1/1 Running 3 2d4h
default small-7497946d9-4lzbc 1/1 Running 3 2d4h
default small-7497946d9-4nc99 1/1 Running 3 2d4h
default small-7497946d9-5mcvf 1/1 Running 3 2d4h
default small-7497946d9-64xqv 1/1 Running 3 2d4h
default small-7497946d9-6tmb7 1/1 Running 3 2d4h
default small-7497946d9-84khl 1/1 Running 3 2d4h
default small-7497946d9-895qk 1/1 Running 3 2d4h
default small-7497946d9-8ks9m 1/1 Running 3 2d4h
default small-7497946d9-922xz 1/1 Running 3 2d4h
default small-7497946d9-b58bl 1/1 Running 3 2d4h
default small-7497946d9-bhw82 1/1 Running 3 2d4h
default small-7497946d9-cdrnr 1/1 Running 3 2d4h
default small-7497946d9-cr7br 1/1 Running 3 2d4h
default small-7497946d9-jwgvl 1/1 Running 3 2d4h
default small-7497946d9-l745j 1/1 Running 3 2d4h
default small-7497946d9-lfhvs 1/1 Running 3 2d4h
default small-7497946d9-m6hvj 1/1 Running 3 2d4h
default small-7497946d9-mjdrj...

Revision history for this message
Nimalini Rasa (nrasa) wrote :

Logs for subcloud9 can be found here:
https://files.starlingx.kube.cengn.ca/download_file/130

Revision history for this message
Frank Miller (sensfan22) wrote :

Additional info:
- Nimalini has seen this issue more than 50% of the time. Occurs after a host reboot.
- Most of the time one or both of these pods is in unknown state: stx-oidc-client-6c8cfc5f65-brcfk, mon-logstash-0

These events may indicate the reason for the pod failures and come from the kubectl describe pods -n output:
  Warning FailedMount 5m31s (x16 over 25m) kubelet, controller-0 MountVolume.SetUp failed for volume "config" : stat /var/lib/kubelet/pods/f9202e49-2a61-4322-8d44-8b35669aab27/volumes/kubernetes.io~configmap/config: no such file or directory
  Warning FailedMount 14m (x12 over 30m) kubelet, controller-0 MountVolume.SetUp failed for volume "logstashsetup" : stat /var/lib/kubelet/pods/1322abfc-ca78-4fac-9c91-a8483c527f12/volumes/kubernetes.io~configmap/logstashsetup: no such file or directory
  Warning FailedMount 3m53s (x18 over 30m) kubelet, controller-0 MountVolume.SetUp failed for volume "logstashpipeline" : stat /var/lib/kubelet/pods/1322abfc-ca78-4fac-9c91-a8483c527f12/volumes/kubernetes.io~configmap/logstashpipeline: no such file or directory

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
assignee: Nimalini Rasa (nrasa) → Bob Church (rchurch)
Ghada Khalil (gkhalil)
tags: added: stx.4.0
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Bob Church (rchurch) → Bart Wensley (bartwensley)
Revision history for this message
Bart Wensley (bartwensley) wrote :

Reproduction:
- The issue (pods stuck with Unknown status after AIO-SX reboot) is relatively easy to reproduce on our DC-4 lab (AIO-DX System Controller with 10 AIO-SX subclouds).
- The best environment to reproduce includes:
  - Following apps applied: cert-manager, nginx-ingress-controller, oidc-auth-apps, stx-monitor.
  - 30 PV ‘small’ test pods running.
- Doing a lock/unlock of an AIO-SX subcloud reproduces the issue about 1 out of 4 reboots.

Potential cause:
- On startup, the kubelet is trying to reattach all the volumes to the pods. For some reason, it fails to do this sometimes and the pods are then stuck in the unknown state. The kubelet logs show a “MountVolume.SetUp failed for volume” error when this happens.
- There are several k8s bug reports that reference the “MountVolume.SetUp failed for volume” error shown above. The most promising one is: https://github.com/kubernetes/kubernetes/issues/68211
- The pods that usually experience the failure are stx-oidc-client and some of the monitor pods (e.g. mon-metricbeat-metrics and mon-logstash). These pods use the “subPath” volume configuration referenced in the above issue.
- Another pointer to the above issue is that when a pod is stuck in the unknown state, listing the contents of the pods volumes under /var/lib/kubelet/<uuid>/volumes shows the contents of all the volumes, except the volume using the subPath.

Yesterday I tried to update the deployment for the stx-oidc-client pods to use one of the workarounds that avoids the subPath volume configuration (workarounds suggested in the bug report):
- Use “items” in the volume config instead of subPath.
- Use projected volumes instead of subPath.

After several hours of effort I was not able to get the updated deployment to work. The pod would always fail to run, with errors like this:
Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"./stx-oidc-client\": stat ./stx-oidc-client: no such file or directory": unknown

I suspect an actual change to the image might be required.

My thinking is that if we remove the subPath config from stx-oidc-client and verify that the issue no longer happens with that pod, we’d at least be able to confirm the cause. We would then want to remove the subPath config from the stx-monitor pods (that could take some effort). Even then, we still may have cases where user pods are using subPath (assuming that is the cause), so we still may need to update the workaround we have in sysinv to kill pods in the NodeAffinity sate to also kill any pods stuck with Unknown status after a certain amount of time has expired after the reboot. The workaround is not ideal though because we’d need to wait a significant amount of time before killing them because some pods do go through the Unknown status as they come up - that is why it makes sense to avoid using the subPath so we can avoid the workaround as much as possible.

tags: removed: stx.distcloud
Revision history for this message
Bart Wensley (bartwensley) wrote :

Since the analysis indicates this issue has nothing to do with distributed cloud, I have removed the stx.distcloud tag.

Frank Miller (sensfan22)
tags: added: stx.containers
Revision history for this message
Frank Miller (sensfan22) wrote :

Re-assigning to Paul to work through the next steps which is to prove Bart's theory that pods that use a subPath config are the ones that see this issue.

Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

After stressing the lab just as the test team who discovered this with multiple small pods with the exception that they also use subPath option when mounting configMaps just like in the bug description for the kubernetes issue, I discovered the bug is tends to reproduce with higher frequencies on some pods like the ones from stx-monitor and oidc-auth-apps applications. I can't seem to reproduce this on pods I manually created to stress the system even if they also use configMaps mounted volumes, with subPath option. That being said even if I can't seem to manually create pods that reproduce this issue I still think that we are facing the fore-mentioned kubernetes issue, it's just difficult to manually reproduce, it.

Revision history for this message
Frank Miller (sensfan22) wrote :

Next step for this LP is to make a change to the stx-oidc-client pod to remove its use of subpath and then confirm this avoids that pod coming up in unknown state. If this fixes the issue for the stx-oidc-client pod then a similar solution will be required by a stx-monitor developer.

Assigning to Teresa to remove subpath from stx-oidc-client.

Changed in starlingx:
assignee: Paul-Ionut Vaduva (pvaduva) → Teresa Ho (teresaho)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oidc-auth-armada-app (master)

Fix proposed to branch: master
Review: https://review.opendev.org/735647

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oidc-auth-armada-app (master)

Reviewed: https://review.opendev.org/735647
Committed: https://git.openstack.org/cgit/starlingx/oidc-auth-armada-app/commit/?id=04b525ea2a0a8795c4d556676ad565acd67831cf
Submitter: Zuul
Branch: master

commit 04b525ea2a0a8795c4d556676ad565acd67831cf
Author: Teresa Ho <email address hidden>
Date: Mon Jun 15 08:02:11 2020 -0400

    Remove subPath in volume mount for oidc-client

    The use of subPath to configure volume mount path causes failure in
    stx-oidc-client pod restarts.
    This update is to remove the use of subPath.

    Ran security regression tests for oidc-auth-apps

    Partial-Bug: 1874858

    Change-Id: Iba97f6b1da8c7a7cc280662bf4fd9b2f80b1d33f
    Signed-off-by: Teresa Ho <email address hidden>

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The fix submitted applies to the stx-oidc-client pod.
No fix is planned for the stx-monitor mon-logstash pod as stx-monitor is no longer actively maintained in stx master.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Nimalini Rasa (nrasa) wrote :

Verified with load build on:2020-06-27_00-41-42

Revision history for this message
Bart Wensley (bartwensley) wrote :

Note that the upstream issue (https://github.com/kubernetes/kubernetes/issues/68211) we suspect now has a fix submitted for it:
https://github.com/kubernetes/kubernetes/pull/89629

I'm not sure which future kubernetes version will include this fix.

Revision history for this message
Chris Friesen (cbf123) wrote :

I've seen application pods in "Unknown" state after RR patching on AIO-SX as well.

I tried bringing in the upstream fix from https://github.com/kubernetes/kubernetes/pull/89629 but it doesn't seem to have fixed the problem. I'm still seeing pods in the "Unknown" state.

This fix may be necessary (yet to be determined) but it is not sufficient by itself to fix the issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.