occm-v1.18.0 is crashing on focal

Bug #1898726 reported by Jake Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
CDK Addons
Medium
Unassigned

Bug Description

I installed charmed-kubernetes-519 with openstack-integrator-81. This seems to prefer focal. With this configuration, the openstack-cloud-controller-manager pods are crashing with;

runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

which seems to be a wontfix problem with Go1.14 on focal. cdk-addons is pinning this 1.18.0 version in Makefile, I think?

I was asked to try occm-1.19.1 (which has Go1.15) but I found that a little tricky.

Revision history for this message
George Kraft (cynerva) wrote :

I'm unable to reproduce this. I also confirmed with our QA team that they tested Charmed Kubernetes 1.19 wih openstack-integrator on Focal, and they also did not run into this issue. We're going to need more details about your environment to figure out what's going on here.

What version of OpenStack are you running?

Can you please paste or attach the full logs from the openstack-cloud-controller-manager pod?

Can you please paste or attach output from the following commands:

juju run --application kubernetes-worker -- uname -a
juju run --application kubernetes-worker -- ulimit -l

Changed in cdk-addons:
status: New → Incomplete
Revision history for this message
Jake Hill (routergod) wrote :

Thank you for looking at this.

Openstack is also charmed. openstack-origin=cloud:bionic-ussuri.

I have attached one of the pod logs, hope this helps!

routergod@management:~$ juju run --application kubernetes-worker -- uname -a
- Stdout: |
    Linux juju-db327e-k8s-6 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  UnitId: kubernetes-worker/0
- Stdout: |
    Linux juju-db327e-k8s-7 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  UnitId: kubernetes-worker/1
- Stdout: |
    Linux juju-db327e-k8s-8 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  UnitId: kubernetes-worker/2

routergod@management:~$ juju run --application kubernetes-worker -- ulimit -l
- Stdout: |
    64
  UnitId: kubernetes-worker/1
- Stdout: |
    64
  UnitId: kubernetes-worker/0
- Stdout: |
    64
  UnitId: kubernetes-worker/2

Revision history for this message
Jake Hill (routergod) wrote :

Strangely, if I repeat that run ulimit on the explicit machine units I get a different result;

routergod@management:~$ juju run --machine 6,7,8 -- ulimit -l
- MachineId: "6"
  Stdout: |
    65536
- MachineId: "7"
  Stdout: |
    65536
- MachineId: "8"
  Stdout: |
    65536

I have a problem somewhere else?

Revision history for this message
Jake Hill (routergod) wrote :

I did some digging around the LimitMEMLOCK setting(s) on Focal.

  https://discourse.juju.is/t/ulimit-limitmemlock-on-focal/3661

It seems that 64K is the normal LimitMEMLOCK for non-interactive things under systemd on Focal. With this setting OCCM is not running.

Not a workaround, but if I manually set LimitMEMLOCK=infinity in containerd.service, OCCM runs ok.

I don't know if OCCM actually requires the higher setting or if somehow it is just that the Go1.14 bug detection thing is tickled in my case. It is very confusing!

Any suggestions please?

Revision history for this message
Jake Hill (routergod) wrote :
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the details. It's still unclear to me why we're unable to reproduce this issue, but I think we're just going to have to move forward without that.

Yes, the golang change you linked appears to be a fix for this problem. The golang issue[1] discusses this in detail. In short, to fix this, we need to provide a version of openstack-cloud-controller-manager that was built with golang 1.14.1+ or 1.15+. That or raise LimitMEMLOCK for the containerd service, but that strikes me as more of a workaround than a long-term solution.

[1]: https://github.com/golang/go/issues/37436

Changed in cdk-addons:
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
Drew Freiberger (afreiberger) wrote :

@cynerva

Working with the Field team, we found that this is not reproducable with k8s-worker constraints of:
cores=4 mem=4G root-disk=16G root-disk-source=volume

But we are able to reproduce simply by increasing these constraints to:
cpu-cores=16 mem=131072 root-disk=153600 root-disk-source=volume

This doesn't change numa architecture (both vms end up with a single numa node), but is somehow affecting whether mlocking is attempted or not. It's either the larger number of cores or the larger memory footprint triggering the mlock attempt.

We found a workaround by setting the following in /etc/default/docker and restarting docker.service.

DOCKER_OPTS="--default-ulimit=memlock=-1:-1"

The docker charm's "docker-opts" does not populate this setting, so we had to manually configure this on the kubernetes-worker units. We've also filed related bug: https://bugs.launchpad.net/charm-docker/+bug/1939038

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.