lxd profile removed on reboot

Bug #1844937 reported by Chris Sanders
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

I have a few different scenarios where I am applying a custom profile to a juju deployed LXD container. This ranges form charms that I'm waiting to add profile support to the charm to very custom-to-me use cases where I need to apply settings which don't belong in the charm and juju doesn't provide a way to do via tooling.

What I'm seeing is that *sometimes* profiles are removed from LXD containers after a host reboot. The very fact that this only happens part of the time makes me think it is unintended, and I originally thought it was a LXD bug. In fact, I've created a bug against LXD which has been evaluated and determined to not be related to LXD leaving juju as the most likely source of this issue.

The Github bug is found here: https://github.com/lxc/lxd/issues/6139
I'm happy to post more information here, but this bug report is very definitive and copy/pasting it here doesn't seem to add a lot of value.

If you have any issue reproducing this, I've also triggered this in other environments. Specifcally I have a K8s in LXD setup running right now on top of OpenStack that triggered this on 4 out of 8 containers on the first host reboot. I can provide that if it helps, but in my experience applying a custom profile to a few containers which are juju deployed, and then rebooting the host is a pretty easy way to trigger this.

Revision history for this message
Chris Sanders (chris.sanders) wrote :

Of course, I should have provided some version numbers.

While some versions are explicitly listed in the github bug, I'm reproducing this right now with juju 2.6.8 running on top of OpenStack.

Revision history for this message
Richard Harding (rharding) wrote :

Hmm, this does sound unintentional. Juju namespaces the profiles it's managing to help make sure we don't do things like this. If you can get any more details on the "sometimes" we'd be interested in assisting in reproducing. It'd be interesting to see, on these occasions, if you can tell from the logs the hooks that ran immediately after the reboot and if there was any strange log entries from the reboot time coming up on the unit or controller logs.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.7-beta1
Revision history for this message
Chris Sanders (chris.sanders) wrote :

Unfortunately triggering this isn't hard, but it is very disruptive to my cluster. Next time I loose power, I'll grab whatever logging you think will be helpful. Are you looking for a debug log specifically including the affected unit(s)? I had one fail this past weekend, and might still have logs available for it. I'll grab that if it's helpful, but knowing what logging you might want will help me next time it happens get appropriate logs.

I don't currently have access to a free-web account but it's super easy to reproduce on public cloud just by deploying LXD containers to a machine, creating a profiles directly on the host to add priviliged to the containers, and then reboot. You'll trigger it at least 50% of the time.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@chris.sanders, this sounds like an unintended consequence of the lxd profiles provided by charms work. See if you can reproduce by restarting the juju machine agent of the machine one of these lxd containers lives on: machine 3 if the container is 3/lxd/0. Set the logging level first:
  juju model-config logging-config='<root>=DEBUG;unit=DEBUG;juju.worker.instancemutater=TRACE'
Please provide the /var/log/juju/machine-#.log from the machine whose agent was restarted.

When a machine agent starts, it checks that the juju created containers are using the expected lxd profiles and that's it. I'm surprised that the default lxd profile isn't being added back. Though that could be a race between what the machine agent is trying to do, and the delay in autostart changes in the actual profile you've applied.

What scenario are you working towards with these lxd profile changes?

Changed in juju:
milestone: 2.7-beta1 → 2.7-rc1
Revision history for this message
Chris Sanders (chris.sanders) wrote :

@Heather, I don't think the machine agent is doing this actually (unless I'm misunderstanding which agent you are referring to). To help mitigate this I've set all of the containers with profiles to not autostart. A power outage this past weekened caused one machine to reboot, the container in question (and hence it's agent) were not started and the lxd (still stopped) had it's profile removed.

I've captured a debug log and attached from the host after the reboot and before I started the LXD container. I'm not sure what I'm looking for but I didn't see any mention of profiles in the log but it's attached.

While I can easily reproduce this (rebooting a machine tends to cause it) it's very disruptive to my environment so I'm not able to regularly debug this. However, the above listed bundle from the LXD bug has a guaranteed reproducer if you have hardware or an environment to test in it's very easy to reproduce. I was doing this in public cloud, but the test cycles aren't really free and I don't currently have an environment where I can stand up 4 machines with sufficient resources for the bundle.

In the case of the above bug, I installed a 4 node ceph cluster and installed K8s on top of the nodes for a converged K8s/Ceph setup. K8s charms do not set profiles necessary for installing in LXD containers, but profiles are provided for test environments where the team adds extra privileges via a profile.

There are other use case where I've done this as well. For example I like to mount ceph based mounts on my applications and the containers require elevated permissions to do this and most charms do not include a 'mount ceph based filesystem' profile in the individual charm so I have to add it myself externally. There is a great bug detailing several use cases for profile support in bundles that aren't covered by charm profiles if you're interested in the overall use case of user defined profiles for LXD: https://bugs.launchpad.net/juju/+bug/1822016

Revision history for this message
Chris Sanders (chris.sanders) wrote :

I've caught another occurrence of the issue on two different machines and grabbed the machine-X.log form them. I corrected to profiles (remove default, add lxd) and started the containers before copying the logs so this time it includes the starting of the containers.

Machine 7 has 3 containers, it's only container 2 that has a custom profile.
Machine 10 has 2 containers, and container 1 has a custom profile.

Revision history for this message
Chris Sanders (chris.sanders) wrote :

Posting machine-10.log

Changed in juju:
milestone: 2.7-rc1 → 2.7.1
Changed in juju:
milestone: 2.7.1 → 2.7.2
Changed in juju:
milestone: 2.7.2 → 2.7.3
Changed in juju:
milestone: 2.7.3 → 2.7.4
Changed in juju:
milestone: 2.7.4 → 2.7.5
Changed in juju:
milestone: 2.7.5 → 2.7.6
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.7.6 → 2.8-rc1
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

The debug.log from #5 did not capture any messages from juju.worker.instancemutater as hoped.

Unfortunately neither machine-7.log nor machine-10.log did not have juju.worker.instancemutater=DEBUG.

From the two machine logs, lines like:
DEBUG juju.container.lxd server.go:247 updated "juju-89bcd1-7-lxd-2" profiles, waiting on Updating container
Indicate we were updating the profiles on a container, though I'm not sure if that was at unit add or reboot.

@chris.sanders, I was referring to the juju machine agent in #4.

I can't find the bundle mentioned, but I'll try to reproduce myself.

Revision history for this message
Pen Gale (pengale) wrote :

I think that this is a feature request, as the behavior here is unsupported.

Currently, you can edit your default profile, which will cause Juju to apply custom settings to everything.

There is no current support in Juju for replacing the default profile on just specific machines.

I think the feature request here is: "Juju should have a mechanism for applying a different default profile to specific machines in my model."

(There might also be a race which is causing Juju to fail to replace the profile, as expected, but since the behavior is unsupported, I don't think that this needs to be fixed before 2.8-rc1, as currently tagged. Going to move to wishlist and remove the milestone.)

Changed in juju:
assignee: Heather Lanigan (hmlanigan) → Pete Vander Giessen (petevg)
milestone: 2.8-rc1 → none
importance: High → Wishlist
assignee: Pete Vander Giessen (petevg) → nobody
Revision history for this message
Chris Sanders (chris.sanders) wrote :

Pete, while I agree that allowing other profiles is a good feature request that's already covered as I linked before here: https://bugs.launchpad.net/juju/+bug/1822016

The behavior here is that juju is removing a non-default profile that it did not add to a LXD container. As stated above "Juju namespaces the profiles it's managing to help make sure we don't do things like this." If that's the design and it's not working as intended I'm not sure how that could be anything but a bug.

Just for clarity this still happens today on 2.7 series of Juju. It's very easy to reproduce, I'm actually tempted to say it happens on every reboot at this point. My environment isn't however a QA environment where I'm willing to run through reboots to validate that.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Wishlist → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.