On deploy, pods go into CrashLoopBackOff and leak mounts on workers

Bug #1868368 reported by Barry Price
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Openstack Integrator Charm
Incomplete
High
Unassigned

Bug Description

Pod status:

$ kubectl get po --all-namespaces | grep csi-cinder
kube-system csi-cinder-controllerplugin-0 2/4 CrashLoopBackOff 1300 45h
kube-system csi-cinder-nodeplugin-p6lzs 1/2 CrashLoopBackOff 3391 9d
kube-system csi-cinder-nodeplugin-schgg 1/2 CrashLoopBackOff 2622 9d

Pod descriptions:

csi-cinder-controllerplugin-0:
https://paste.ubuntu.com/p/DNVmMHW89M/

csi-cinder-nodeplugin-p6lzs:
https://paste.ubuntu.com/p/bn9khQcHvm/

csi-cinder-nodeplugin-schgg:
https://paste.ubuntu.com/p/mnshrW6qCt/

Pod logs, for controllerplugin:

https://paste.ubuntu.com/p/vPbD52GrRm/

And for nodeplugin (both seem identical):

https://paste.ubuntu.com/p/NMsh7MfhZt/

Mount leak on worker/0:

$ wc -l /proc/self/mounts
34914 /proc/self/mounts

And on worker/1:

$ wc -l /proc/self/mounts
49210 /proc/self/mounts

And details from both - suspiciously round numbers (this part may make more sense as a separate bug against kubernetes itself, or some component of it?):
https://paste.ubuntu.com/p/HqVYyWPK4x/

Tags: canonical-is
Changed in charm-openstack-integrator:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Calvin Hartwell (calvinh) wrote :

@Barry any chance you could provider information about the openstack release (version used and charm versions) and also the juju bundle used?

It would be interesting to see if the same workload runs on a newer release of Openstack as Tom mentioned this is running on icehouse.

Revision history for this message
Tom Haddon (mthaddon) wrote :

The setup here is that we've added cs:~containers/openstack-integrator to a pre-deployed 1.16 CDK instance with the following relations:

- ["openstack-integrator", "kubernetes-master"]
- ["openstack-integrator", "kubernetes-worker"]

We than also run "juju trust openstack-integrator".

The OpenStack version this is deployed on is icehouse fwiw.

Revision history for this message
Barry Price (barryprice) wrote :

> We than also run "juju trust openstack-integrator".

This actually didn't work, so was undone, and we configured authentication via charm options instead.

See LP:1867095 for details there.

Revision history for this message
George Kraft (cynerva) wrote :

What charm revisions of kubernetes-master and openstack-integrator are you running?

What version of Kubernetes are you running?

Can you share logs from kubelet? `journalctl -o cat -u snap.kubelet.daemon` on the kubernetes-worker machines.

Changed in charm-openstack-integrator:
status: Triaged → Incomplete
Revision history for this message
Barry Price (barryprice) wrote :

Charm revisions:-

kubernetes-master:

collect_date: '2019-10-07 10:28:25.478476'
collect_url: cs:~containers/kubernetes-master-754

openstack-integrator:
collect_date: '2020-03-12 04:14:25.490696'
collect_url: cs:~containers/openstack-integrator-49

Version-wise, we're on the default 1.16/stable channel - installed snaps on master units are:

canonical-livepatch 9.5.2 94 stable canonical✓ -
cdk-addons 1.16.7 1645 1.16 canonical✓ -
core 16-2.43.3 8689 stable canonical✓ core
kube-apiserver 1.16.4 1501 1.16 canonical✓ -
kube-controller-manager 1.16.4 1413 1.16 canonical✓ -
kube-proxy 1.16.4 1400 1.16 canonical✓ classic
kube-scheduler 1.16.4 1380 1.16 canonical✓ -
kubectl 1.16.4 1380 1.16 canonical✓ classic

And here are those kubelet logs from each worker unit (Canonical internal access only, sorry):

https://private-fileshare.canonical.com/~barryprice/worker0.log.xz

https://private-fileshare.canonical.com/~barryprice/worker1.log.xz

Changed in charm-openstack-integrator:
status: Incomplete → New
Revision history for this message
Barry Price (barryprice) wrote :

@Calvin - I'm going to attempt a deploy on a newer cloud next, will get back to you.

Revision history for this message
Barry Price (barryprice) wrote :

@Calvin - I can't reproduce on a bionic/stein cloud, where things seem to be working okay, but I've just noticed the current bundle deploys K8s 1.17/stable rather than 1.16/stable.

I'll look into upgrading our K8s cluster to 1.17 and see whether that changes anything.

George Kraft (cynerva)
Changed in charm-openstack-integrator:
status: New → Triaged
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the details.

From the kubelet logs, it looks like your 1.16 cluster is running docker instead of containerd. Can you confirm? Could that be something that differs between your existing 1.16 cluster and the test 1.17 deployment?

Can you share output of `juju status --format yaml`? I'm trying to get a sense of what all is involved in this deployment, which Ubuntu series you're using, etc.

A lot of your snaps appear to be out of date - you have 1.16.4, but the latest version on 1.16/stable is 1.16.8. Can you try updating those and see if it helps?

tags: added: canonical-is
Revision history for this message
Tom Haddon (mthaddon) wrote :

We're going to take another approach for now and work with Juju 2.8 which won't need storage in k8s for operator charms.

I'll mark this incomplete for now and we'll only reopen if we're looking to pick this up again for some reason.

Changed in charm-openstack-integrator:
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.