CDK Addons

occm-v1.18.0 is crashing on focal

Bug #1898726 reported by Jake Hill on 2020-10-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	CDK Addons	Triaged	Medium	Unassigned

Bug Description

I installed charmed-kubernetes-519 with openstack-integrator-81. This seems to prefer focal. With this configuration, the openstack-cloud-controller-manager pods are crashing with;

runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

which seems to be a wontfix problem with Go1.14 on focal. cdk-addons is pinning this 1.18.0 version in Makefile, I think?

I was asked to try occm-1.19.1 (which has Go1.15) but I found that a little tricky.

Revision history for this message

George Kraft (cynerva) wrote on 2020-10-06:

I'm unable to reproduce this. I also confirmed with our QA team that they tested Charmed Kubernetes 1.19 wih openstack-integrator on Focal, and they also did not run into this issue. We're going to need more details about your environment to figure out what's going on here.

What version of OpenStack are you running?

Can you please paste or attach the full logs from the openstack-cloud-controller-manager pod?

Can you please paste or attach output from the following commands:

juju run --application kubernetes-worker -- uname -a
juju run --application kubernetes-worker -- ulimit -l

Changed in cdk-addons:
status:	New → Incomplete

Revision history for this message

Jake Hill (routergod) wrote on 2020-10-07:

openstack-cloud-controller-manager.log Edit (8.0 KiB, text/plain)

Thank you for looking at this.

Openstack is also charmed. openstack-origin=cloud:bionic-ussuri.

I have attached one of the pod logs, hope this helps!

routergod@management:~$ juju run --application kubernetes-worker -- uname -a
- Stdout: |
    Linux juju-db327e-k8s-6 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  UnitId: kubernetes-worker/0
- Stdout: |
    Linux juju-db327e-k8s-7 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  UnitId: kubernetes-worker/1
- Stdout: |
    Linux juju-db327e-k8s-8 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  UnitId: kubernetes-worker/2

routergod@management:~$ juju run --application kubernetes-worker -- ulimit -l
- Stdout: |
    64
  UnitId: kubernetes-worker/1
- Stdout: |
    64
  UnitId: kubernetes-worker/0
- Stdout: |
    64
  UnitId: kubernetes-worker/2

Revision history for this message

Jake Hill (routergod) wrote on 2020-10-08:

Strangely, if I repeat that run ulimit on the explicit machine units I get a different result;

routergod@management:~$ juju run --machine 6,7,8 -- ulimit -l
- MachineId: "6"
  Stdout: |
    65536
- MachineId: "7"
  Stdout: |
    65536
- MachineId: "8"
  Stdout: |
    65536

I have a problem somewhere else?

Revision history for this message

Jake Hill (routergod) wrote on 2020-10-16:

I did some digging around the LimitMEMLOCK setting(s) on Focal.

https://discourse.juju.is/t/ulimit-limitmemlock-on-focal/3661

It seems that 64K is the normal LimitMEMLOCK for non-interactive things under systemd on Focal. With this setting OCCM is not running.

Not a workaround, but if I manually set LimitMEMLOCK=infinity in containerd.service, OCCM runs ok.

I don't know if OCCM actually requires the higher setting or if somehow it is just that the Go1.14 bug detection thing is tickled in my case. It is very confusing!

Any suggestions please?

Revision history for this message

Jake Hill (routergod) wrote on 2020-10-16:

Is it related to this perhaps?

https://go-review.googlesource.com/c/go/+/223417/3/src/runtime/os_linux_x86.go

Revision history for this message

George Kraft (cynerva) wrote on 2020-10-21:

Thanks for the details. It's still unclear to me why we're unable to reproduce this issue, but I think we're just going to have to move forward without that.

Yes, the golang change you linked appears to be a fix for this problem. The golang issue[1] discusses this in detail. In short, to fix this, we need to provide a version of openstack-cloud-controller-manager that was built with golang 1.14.1+ or 1.15+. That or raise LimitMEMLOCK for the containerd service, but that strikes me as more of a workaround than a long-term solution.

[1]: https://github.com/golang/go/issues/37436

Changed in cdk-addons:
importance:	Undecided → Medium
status:	Incomplete → Triaged

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2021-08-05:

@cynerva

Working with the Field team, we found that this is not reproducable with k8s-worker constraints of:
cores=4 mem=4G root-disk=16G root-disk-source=volume

But we are able to reproduce simply by increasing these constraints to:
cpu-cores=16 mem=131072 root-disk=153600 root-disk-source=volume

This doesn't change numa architecture (both vms end up with a single numa node), but is somehow affecting whether mlocking is attempted or not. It's either the larger number of cores or the larger memory footprint triggering the mlock attempt.

We found a workaround by setting the following in /etc/default/docker and restarting docker.service.

DOCKER_OPTS="--default-ulimit=memlock=-1:-1"

The docker charm's "docker-opts" does not populate this setting, so we had to manually configure this on the kubernetes-worker units. We've also filed related bug: https://bugs.launchpad.net/charm-docker/+bug/1939038

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

openstack-cloud-controller-manager.log Edit

Add attachment

Remote bug watches

auto-github-golang-go #37436 Edit

Bug watches keep track of this bug in other bug trackers.