deploying openstack on top of focal results in failures due to bumping up against the maxkeys limit

Bug #1891223 reported by Jason Hobbs
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Wishlist
Unassigned
lxd
Fix Released
Unknown

Bug Description

Forked from bug 1839616.

Deploying PCB (Ussuri) on top of focal results in sporadic failures due to running up against the kernel.keys.maxkeys limit. This limit is discussed here:

https://linuxcontainers.org/lxd/docs/master/production-setup

Normally, a system administrator would adjust this limit based on the workload. In this case, juju owns the system and it should be responsible for adjusting this limit.

Juju users can workaround this by providing cloud-init userdata that adjusts the sysctl limit.

Revision history for this message
Pen Gale (pengale) wrote :

Does this bug show up just deploying openstack-base on top of something? Or is it only showing up in more complex bundles?

If the latter, can you drop a bundle in the comments here that can reproduce the issue?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I don't know about openstack-base, we don't ever use that.

Here's our bundle. You're unlikely to be able to deploy this without big hardware and specially crafted networks.

https://pastebin.canonical.com/p/v5KbqcNSzW/

You might be able to reproduce it by deploying a special charm that installs lots of snaps.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

You could probably also reproduce it by artificially lowering the maxkeys limit on a system and then deploying lots of containers to it.

Revision history for this message
Tim Penhey (thumper) wrote : Re: [Bug 1891223] Re: deploying openstack on top of focal results in failures due to bumping up against the maxkeys limit

This makes much more sense being a charm issue, not a Juju issue.

It is charms that should make sure the host is sufficiently configured.
There is no way that Juju could make reasonable decisions that work for all
deployments.

On Wed, Aug 12, 2020 at 8:10 AM Jason Hobbs <email address hidden>
wrote:

> You could probably also reproduce it by artificially lowering the
> maxkeys limit on a system and then deploying lots of containers to it.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: Juju bugs
> https://bugs.launchpad.net/bugs/1891223
>
> Title:
> deploying openstack on top of focal results in failures due to bumping
> up against the maxkeys limit
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1891223/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's not a single charm. We deploy many charms to a system. Some are in
containers. How are charms in containers going to influence system wide
sysctl limits?

On Tue, Aug 11, 2020 at 4:00 PM Tim Penhey <email address hidden>
wrote:

> This makes much more sense being a charm issue, not a Juju issue.
>
> It is charms that should make sure the host is sufficiently configured.
> There is no way that Juju could make reasonable decisions that work for all
> deployments.
>
> On Wed, Aug 12, 2020 at 8:10 AM Jason Hobbs <email address hidden>
> wrote:
>
> > You could probably also reproduce it by artificially lowering the
> > maxkeys limit on a system and then deploying lots of containers to it.
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: Juju bugs
> > https://bugs.launchpad.net/bugs/1891223
> >
> > Title:
> > deploying openstack on top of focal results in failures due to bumping
> > up against the maxkeys limit
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1891223/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1891223
>
> Title:
> deploying openstack on top of focal results in failures due to bumping
> up against the maxkeys limit
>
> Status in juju:
> New
>
> Bug description:
> Forked from bug 1839616.
>
> Deploying PCB (Ussuri) on top of focal results in sporadic failures
> due to running up against the kernel.keys.maxkeys limit. This limit is
> discussed here:
>
> https://linuxcontainers.org/lxd/docs/master/production-setup
>
> Normally, a system administrator would adjust this limit based on the
> workload. In this case, juju owns the system and it should be
> responsible for adjusting this limit.
>
> Juju users can workaround this by providing cloud-init userdata that
> adjusts the sysctl limit.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1891223/+subscriptions
>

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Generically-speaking: I think this is simply a matter of having multiple charm application lxd units on any given bare metal host, with enough snaps installed cumulatively to exceed the limit. It shouldn't need to be OpenStack specifically.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I agree that Juju is the place in the stack to address this issue. The only other option in my mind (a bad option at that) is a base-node type of charm that tunes the host before lxd units are deployed.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I suggest that the machine-agent on the metal host is the configuration location for this, as the agents inside the containers won't be able to reconfigure the host limits, nor will they know about it's necessity

Revision history for this message
Pen Gale (pengale) wrote :

It sounds like the problem here arises when deploying many charms that contain snaps into lxd containers, correct?

The tricky thing about solving this at the Juju level is that Juju doesn't know much about what's inside the charm. Juju isn't aware that the OpenStack charms contain snaps, and that this puts us in a situation where kernel.keys.maxkeys is a concern. (Or is a concern sooner than it would be when deploying non snap containing charms to lxd containers.)

Of course, you can't solve this at the charm level, either, because the charm doesn't really even know that it's in an lxd container, much less how densely populated the host is with containers.

To a certain extent, this is a cloud provider issue. I should be able to request machines from my cloud that can handle running a whole lot of containers and snaps at once. I could cheekily tag this as a MAAS bug :-) Though of course, MAAS can already apply cloud-init scripts to machines, which solves this problem, as noted in the original description.

I agree that this is a legitimate problem, and that we might want to extend Juju's modelling in some sensible way to grapple with it. Doing so is non-trivial, however, and would have to fight for a place in the roadmap along with other feature requests.

Changed in juju:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's not really a maas bug; maas doesn't know how the machines it deploys will be used. Machines can have all the memory in the world, but setting maxkeys higher than it has to be would still waste memory (though I'm not sure how much).

charms could signal to juju how many keys they need somehow. new feature, I agree.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Not sure I'm fully following here, but this kind of sysctl bump was made with the LXD deb previously. e.g., fs.inotify.max_user_instances as in /etc/sysctl.d/10-lxd-inotify.conf:
https://bugs.launchpad.net/ubuntu/+source/lxd/+bug/1602192

Now that LXD is snap-only for focal, the new place would be here instead?
https://github.com/lxc/lxd-pkg-snap/blob/7c0719817b047d9670f2c4969fdcbb0a01a304f5/snapcraft/commands/daemon.start#L336-L357

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

https://github.com/lxc/lxd-pkg-snap/pull/66 has been merged and available as latest/edge. I've been told that it would be applied to 4.0 series in the next update.

Changed in lxd:
status: Unknown → Fix Released
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

I have hit the similar issue on my current disaggregated OpenStack Ussuri deployment. In my case kernel.keys.maxbytes is the problem, instead of kernel.keys.maxkeys.

In my deployment I have 3 control nodes. Each control node hosts ~20 LXD containers.

As described in "Production setup" (https://linuxcontainers.org/lxd/docs/master/production-setup/), the kernel.keys.maxbytes is by default configured to 20000. In my environment though, the value I'm curently seeing is over 30000, please see below. Note that I have increased the value of kernel.keys.maxbytes to 2000000 (as recommended in "Production setup" guide in my model-defaults' cloudinit-userdata).

cat /proc/key-users
    0: 214 213/213 115/1000000 2326/25000000
  100: 1 1/1 1/2000 9/2000000
  101: 1 1/1 1/2000 9/2000000
  117: 1 1/1 1/2000 9/2000000
  996: 1 1/1 1/2000 9/2000000
  997: 1 1/1 1/2000 9/2000000
 1000: 3 3/3 3/2000 37/2000000
64061: 1 1/1 1/2000 9/2000000
1000000: 1548 1548/1548 1548/2000 31521/2000000
1000100: 22 22/22 22/2000 198/2000000
1000101: 22 22/22 22/2000 198/2000000
1000107: 4 4/4 4/2000 36/2000000
1000113: 13 13/13 13/2000 117/2000000
1000114: 9 9/9 9/2000 81/2000000
1000115: 12 12/12 12/2000 108/2000000
1000116: 7 7/7 7/2000 63/2000000
1000117: 5 5/5 5/2000 45/2000000
1000118: 1 1/1 1/2000 9/2000000
1000996: 14 14/14 14/2000 126/2000000
1000997: 22 22/22 22/2000 198/2000000
1001000: 7 7/7 7/2000 93/2000000
1064060: 3 3/3 3/2000 27/2000000
1064061: 1 1/1 1/2000 9/2000000
1064062: 1 1/1 1/2000 9/2000000

Revision history for this message
Nobuto Murata (nobuto) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.