Juju has to allow tuning lxd-related kernel parameters

Bug #1817774 reported by Andrey Grebennikov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Richard Harding
2.6
Fix Committed
High
John A Meinel
2.7
Fix Released
High
John A Meinel

Bug Description

Juju 2.5 deployment.
In case many lxd containers are created on the same host, there is error "too many open files".
The reason for that is kernel parameter fs.inotify.max_user_instances = 256 that's been set up by juju.
Most likely here:
https://github.com/juju/juju/blob/a5ab92ec9b7f5da3678d9ac603fe52d45af24412/snap/hooks/configure#L15

Ideally these parameters should be tunable, otherwise should be calculated based on the machine configuration (amount of ram, cpu etc).

Revision history for this message
Richard Harding (rharding) wrote :

Can you help clarify that this is a lxd provider based deploy or running lxd containers on another provider such as maas/openstack please?

Revision history for this message
Ed Stewart (emcs2) wrote :

This was creating LXDs on a VM with Juju targetting GCE. Thanks

Revision history for this message
Richard Harding (rharding) wrote :

Thanks Ed. I reviewed how this is setup and we had done some investigation around this in the past due to the localhost provider where you're more apt to run more lxd containers. During that investigation we followed the official LXD guidelines for running it as scale:

https://lxd.readthedocs.io/en/latest/production-setup/

When you go through and implement those suggestions there are some changes that require system reboot, some changes that require specific memory allocation, and an array of other side effects. In the end we found that really doing this at scale involves a number of settings and in a way that it's too difficult to properly predict/find a "best practice" for users to be safe by default.

Our suggestion would be to use the production setup guide for LXD to create a 'lxd-host' charm or subordinate you can use to set the properties where your specific install is comfortable and that work for the systems you're running on.

Due to the array of changes that need to be made and the potential consequences I'm going to mark this as won't fix but if you'd like to discuss our understanding of the problem please don't hesitate to reach out and let me know.

Changed in juju:
status: New → Won't Fix
Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

Just to clarify here, turned out the piece of code I pointed to initially has nothing to do with setting up lxd host, but rather setting up juju-client side host while building the snap.
Lxd host is being set up with just default parameters.

Revision history for this message
Ed Stewart (emcs2) wrote :

Hi both,

Thanks for the feedback. It's probably my ignorance, but I'm still a bit confused.

Our host VM (on GCP) has the following in sysctl.d:

ubuntu@juju-0298df-0:/etc/sysctl.d$ ls -ltr
total 52
-rw-r--r-- 1 root root 519 Jan 17 2018 README
-rw-r--r-- 1 root root 506 Jan 17 2018 10-zeropage.conf
-rw-r--r-- 1 root root 1292 Jan 17 2018 10-ptrace.conf
-rw-r--r-- 1 root root 509 Jan 17 2018 10-network-security.conf
-rw-r--r-- 1 root root 1184 Jan 17 2018 10-magic-sysrq.conf
-rw-r--r-- 1 root root 257 Jan 17 2018 10-link-restrictions.conf
-rw-r--r-- 1 root root 726 Jan 17 2018 10-kernel-hardening.conf
-rw-r--r-- 1 root root 490 Jan 17 2018 10-ipv6-privacy.conf
-rw-r--r-- 1 root root 77 Jan 17 2018 10-console-messages.conf
-rw-r--r-- 1 root root 2115 Sep 5 21:51 11-gce-network-security.conf
-rw-r--r-- 1 root root 153 Sep 10 19:19 10-lxd-inotify.conf
-rw-r--r-- 1 root root 185 Nov 20 15:14 99-cloudimg-ipv6.conf
-rw-r--r-- 1 root root 2064 Nov 20 16:08 99-gce.conf
lrwxrwxrwx 1 root root 14 Feb 13 21:32 99-sysctl.conf -> ../sysctl.conf

Only one of these confs includes max_user_instances:

ubuntu@juju-0298df-0:/etc/sysctl.d$ grep "max_user_" *
10-lxd-inotify.conf:fs.inotify.max_user_instances = 1024

(and this isn't being reset by anything in ../sysctl.conf)..

So, sysctl on boot-up, should be setting max_user_instances to 1024 (which would be fine for us!)

But when I check the running value in the host, it's not 1024, but it's 256:

ubuntu@juju-0298df-0:/etc/sysctl.d$ sudo sysctl -a | grep max_user_inst
fs.inotify.max_user_instances = 256

This is on a VM built entirely by Juju. So something is setting it back to 256 after sysctl.d runs. This is what's limiting our LXD counts

So whilst I understand that asking Juju to mess around with lxd parameters is probably not a great idea, I think something is pushing the lxd parameters down from the default lxd parameters.

This VM is built by Juju, however, we do install the Juju client on it after Juju has deployed.

So I think in our case, I think Andrey's original comment was correct - when the Juju client is installed on a machine, it forces the max_user_instances down to 256 even if they were originally higher:

ubuntu@juju-0298df-0:/usr/lib/sysctl.d$ ls -ltr
total 8
-rw-r--r-- 1 root root 682 Feb 13 21:32 50-default.conf
-rw-r--r-- 1 root root 73 Feb 15 00:20 juju-2.conf
ubuntu@juju-0298df-0:/usr/lib/sysctl.d$ more juju-2.conf
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 256
ubuntu@juju-0298df-0:/usr/lib/sysctl.d$

So should the Juju client snap really be doing this?

Revision history for this message
Richard Harding (rharding) wrote :

Thanks, the fact that the client is installed makes sense now. I appreciate the analysis that helped me look and note that lxd added that sysctl file in Apr 2017 and we had added this to the snap setup in Feb 2017. I think it's just that we originally needed that config and set some default and lxd later set a larger default that we missed was added and would negate the need for Juju to set this when using the client.

https://github.com/lxc/lxd-pkg-ubuntu/commits/8dc4a41e780000c323e119ba14f4463b4c222c8e/debian/lxd.sysctl

https://github.com/juju/juju/commit/609dbb29a77e01f54d602be8ca373a9a171f8653#diff-8f4e953fdcce135fef1df9e88717ab5d

We'll look at updating the snap definition and see if we can test/validate that we no longer need the lxd tweaks and update it.

For the immediate aid for your case I'd suggest testing removing the file 'usr/lib/sysctl.d/juju-2.conf' and kicking the system.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1817774] Re: Juju has to allow tuning lxd-related kernel parameters

If lxd is always adding a reasonable value, we can just remove it from our
configuration. We only need it when Juju interacts with LXD. We added it to
Juju a long time ago because LXD was by-default-installed on all machines,
and they didn't want to tweak the value unless people were actually using
LXD.

If their policy has finally changed, then I'm more than happy to remove our
override.

John
=:->

On Thu, Feb 28, 2019 at 9:25 AM Richard Harding <email address hidden>
wrote:

> Thanks, the fact that the client is installed makes sense now. I
> appreciate the analysis that helped me look and note that lxd added that
> sysctl file in Apr 2017 and we had added this to the snap setup in Feb
> 2017. I think it's just that we originally needed that config and set
> some default and lxd later set a larger default that we missed was added
> and would negate the need for Juju to set this when using the client.
>
> https://github.com/lxc/lxd-pkg-
> ubuntu/commits/8dc4a41e780000c323e119ba14f4463b4c222c8e/debian/lxd.sysctl
>
>
> https://github.com/juju/juju/commit/609dbb29a77e01f54d602be8ca373a9a171f8653
> #diff-8f4e953fdcce135fef1df9e88717ab5d
> <https://github.com/juju/juju/commit/609dbb29a77e01f54d602be8ca373a9a171f8653#diff-8f4e953fdcce135fef1df9e88717ab5d>
>
> We'll look at updating the snap definition and see if we can
> test/validate that we no longer need the lxd tweaks and update it.
>
> For the immediate aid for your case I'd suggest testing removing the
> file 'usr/lib/sysctl.d/juju-2.conf' and kicking the system.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1817774
>
> Title:
> Juju has to allow tuning lxd-related kernel parameters
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1817774/+subscriptions
>

Revision history for this message
Richard Harding (rharding) wrote :

I've put up a PR here: https://github.com/juju/juju/pull/9805

I also remove the max_user_watches as they seem to go hand in hand.

Revision history for this message
James Page (james-page) wrote :

Rick - would it be possible to backport this fix back to 2.5.x as well?

Revision history for this message
John A Meinel (jameinel) wrote :

Are you able to test the 2.6 edge snap and make sure it doesn't cause
problems? I think Rick wanted to wait until it had some real world exposure
before we change the stable snap branch.

On Mon, Mar 25, 2019 at 3:41 PM James Page <email address hidden> wrote:

> Rick - would it be possible to backport this fix back to 2.5.x as well?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1817774
>
> Title:
> Juju has to allow tuning lxd-related kernel parameters
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1817774/+subscriptions
>

Tim Penhey (thumper)
Changed in juju:
status: Won't Fix → Fix Committed
assignee: nobody → Richard Harding (rharding)
milestone: none → 2.6-beta1
importance: Undecided → High
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Paul Goins (vultaire) wrote :

I just hit this on one of the machines we use for charm testing, after freshly redeploying the host system today with Bionic. Was at a loss as to why our inotify changes were being "ignored".

I see "juju --version" reporting 2.6.5-bionic-amd64 here, but the file is still present on the system despite the fresh reinstall, so I'm wondering if either an older version of the snap is making it onto the system first, or if the file hasn't quite been removed yet?

Changed in juju:
status: Fix Released → Confirmed
tags: added: canonical-bootstack
Revision history for this message
Paul Goins (vultaire) wrote :

I'm not sure what branch the juju snap is coming from; is it master? It seems master's last change was March 21st, whereas this change was merged to the develop branch and has yet to make it over to the master branch, as far as I can tell.

Running "snap download juju" and looking at the snap directly, I clearly see the offending code which creates that sysctl file is still in there.

Revision history for this message
Tim Penhey (thumper) wrote :

Point releases are made from the related branch. For example 2.6.9 release comes from the 2.6 branch.

The next release, currently 2.7 comes from the develop branch.

Revision history for this message
Paul Goins (vultaire) wrote :

Hi Tim, thanks for the explanation, this helps.

Unfortunately, it seems the issue exists even in the current edge release. I just downloaded via "snap download juju --channel edge" and, when I mount the snap, I see the juju-2.conf code is still present in snap/hooks/configure. (Specifically, the code removed in this merged commit: https://github.com/juju/juju/commit/ef1db148d06f46223af45972ff00898783b959cb)

The code is not present in any of the 2.5, 2.6 or develop branches at present, so I'm not sure how what seems to be an older version of this file is making it into the snap.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We're hitting this too with juju 2.7.1

tags: added: cdo-qa
Revision history for this message
Richard Harding (rharding) wrote :

Can you please file a new bug and post with the config noted above? The fix
for this was landed and distributed and if we're hitting it again we'll
have to diagnose anew as we removed the custom config.

On Tue, Jan 21, 2020, 8:06 PM Jason Hobbs <email address hidden> wrote:

> We're hitting this too with juju 2.7.1
>
> ** Tags added: cdo-qa
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1817774
>
> Title:
> Juju has to allow tuning lxd-related kernel parameters
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1817774/+subscriptions
>

Revision history for this message
Tim Penhey (thumper) wrote :

We don't add the file any more, but if the file is there, we don't delete it.

You should be able to manually delete the file, and reinstall the snap to confirm it doesn't re-add it.

Revision history for this message
Tim Penhey (thumper) wrote :

Sorry, didn't delete from all the right places...

Revision history for this message
John A Meinel (jameinel) wrote :

I just pushed an update to:
lp:~juju-qa-bot/+git/juju-2-6-edge-tracking-snap
lp:~juju-qa-bot/+git/juju-2-7-edge-tracking-snap
lp:~juju-qa-bot/+git/juju-lp-snap
lp:~juju-qa-bot/+git/juju-edge-snap

We are also working on changing our snapcraft so that instead of being built from multiple snapcraft repositories, we instead build from multiple juju branches, but that will take a bit longer.

Our edge snaps should get built fairly soon and be updated to not have the configure script be creating the juju-2.conf file.
We still don't delete it if it exists, which we could take as a stronger step, but we at least shouldn't be creating it anymore.

Revision history for this message
John A Meinel (jameinel) wrote :

I confirmed that at least 2.6/edge has been updated:
$ snap info juju
...
installed: 2.6.11+2.6-b96e122 (10408) 75MB classic
$ sudo snap run --shell juju
# cat /snap/juju/current/meta/hooks/configure
#!/bin/bash

# Make sure we have lxd installed to use
if [ "$(which lxd)" = "" ]; then
    snap install lxd || true
fi

# copy bash completions to host system
cp -a $SNAP/bash_completions/* /usr/share/bash-completion/completions/. || true

Revision history for this message
John A Meinel (jameinel) wrote :

I also confirmed for 2.7/edge and latest/edge. Those will still be waiting for us to do a final release (2.7.2, etc) before it is in a stable release.

Changed in juju:
status: Confirmed → Fix Committed
milestone: 2.6-beta1 → 2.8-beta1
Harry Pidcock (hpidcock)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.