Bug #1844937 “lxd profile removed on reboot” : Bugs : Canonical Juju

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2019-09-22:

#1

Of course, I should have provided some version numbers.

While some versions are explicitly listed in the github bug, I'm reproducing this right now with juju 2.6.8 running on top of OpenStack.

Revision history for this message

Richard Harding (rharding) wrote on 2019-09-24:

#2

Hmm, this does sound unintentional. Juju namespaces the profiles it's managing to help make sure we don't do things like this. If you can get any more details on the "sometimes" we'd be interested in assisting in reproducing. It'd be interesting to see, on these occasions, if you can tell from the logs the hooks that ran immediately after the reboot and if there was any strange log entries from the reboot time coming up on the unit or controller logs.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.7-beta1

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2019-10-07:

#3

Unfortunately triggering this isn't hard, but it is very disruptive to my cluster. Next time I loose power, I'll grab whatever logging you think will be helpful. Are you looking for a debug log specifically including the affected unit(s)? I had one fail this past weekend, and might still have logs available for it. I'll grab that if it's helpful, but knowing what logging you might want will help me next time it happens get appropriate logs.

I don't currently have access to a free-web account but it's super easy to reproduce on public cloud just by deploying LXD containers to a machine, creating a profiles directly on the host to add priviliged to the containers, and then reboot. You'll trigger it at least 50% of the time.

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2019-10-08:

#4

@chris.sanders, this sounds like an unintended consequence of the lxd profiles provided by charms work. See if you can reproduce by restarting the juju machine agent of the machine one of these lxd containers lives on: machine 3 if the container is 3/lxd/0. Set the logging level first:
juju model-config logging-config='<root>=DEBUG;unit=DEBUG;juju.worker.instancemutater=TRACE'
Please provide the /var/log/juju/machine-#.log from the machine whose agent was restarted.

When a machine agent starts, it checks that the juju created containers are using the expected lxd profiles and that's it. I'm surprised that the default lxd profile isn't being added back. Though that could be a race between what the machine agent is trying to do, and the delay in autostart changes in the actual profile you've applied.

What scenario are you working towards with these lxd profile changes?

Canonical Juju QA Bot (juju-qa-bot) on 2019-10-22

Changed in juju:
milestone:	2.7-beta1 → 2.7-rc1

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2019-10-24:

#5

debug.log Edit (12.0 MiB, text/plain)

@Heather, I don't think the machine agent is doing this actually (unless I'm misunderstanding which agent you are referring to). To help mitigate this I've set all of the containers with profiles to not autostart. A power outage this past weekened caused one machine to reboot, the container in question (and hence it's agent) were not started and the lxd (still stopped) had it's profile removed.

I've captured a debug log and attached from the host after the reboot and before I started the LXD container. I'm not sure what I'm looking for but I didn't see any mention of profiles in the log but it's attached.

While I can easily reproduce this (rebooting a machine tends to cause it) it's very disruptive to my environment so I'm not able to regularly debug this. However, the above listed bundle from the LXD bug has a guaranteed reproducer if you have hardware or an environment to test in it's very easy to reproduce. I was doing this in public cloud, but the test cycles aren't really free and I don't currently have an environment where I can stand up 4 machines with sufficient resources for the bundle.

In the case of the above bug, I installed a 4 node ceph cluster and installed K8s on top of the nodes for a converged K8s/Ceph setup. K8s charms do not set profiles necessary for installing in LXD containers, but profiles are provided for test environments where the team adds extra privileges via a profile.

There are other use case where I've done this as well. For example I like to mount ceph based mounts on my applications and the containers require elevated permissions to do this and most charms do not include a 'mount ceph based filesystem' profile in the individual charm so I have to add it myself externally. There is a great bug detailing several use cases for profile support in bundles that aren't covered by charm profiles if you're interested in the overall use case of user defined profiles for LXD: https://bugs.launchpad.net/juju/+bug/1822016

@Heather, I don't think the machine agent is doing this actually (unless I'm misunderstanding which agent you are referring to). To help mitigate this I've set all of the containers with profiles to not autostart. A power outage this past weekened caused one machine to reboot, the container in question (and hence it's agent) were not started and the lxd (still stopped) had it's profile removed.

I've captured a debug log and attached from the host after the reboot and before I started the LXD container. I'm not sure what I'm looking for but I didn't see any mention of profiles in the log but it's attached.

While I can easily reproduce this (rebooting a machine tends to cause it) it's very disruptive to my environment so I'm not able to regularly debug this. However, the above listed bundle from the LXD bug has a guaranteed reproducer if you have hardware or an environment to test in it's very easy to reproduce. I was doing this in public cloud, but the test cycles aren't really free and I don't currently have an environment where I can stand up 4 machines with sufficient resources for the bundle.

In the case of the above bug, I installed a 4 node ceph cluster and installed K8s on top of the nodes for a converged K8s/Ceph setup. K8s charms do not set profiles necessary for installing in LXD containers, but profiles are provided for test environments where the team adds extra privileges via a profile.

There are other use case where I've done this as well. For example I like to mount ceph based mounts on my applications and the containers require elevated permissions to do this and most charms do not include a 'mount ceph based filesystem' profile in the individual charm so I have to add it myself externally. There is a great bug detailing several use cases for profile support in bundles that aren't covered by charm profiles if you're interested in the overall use case of user defined profiles for LXD: https://bugs.launchpad.net/juju/+bug/1822016

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2019-10-27:

#6

machine-7.log Edit (1.3 MiB, text/plain)

I've caught another occurrence of the issue on two different machines and grabbed the machine-X.log form them. I corrected to profiles (remove default, add lxd) and started the containers before copying the logs so this time it includes the starting of the containers.

Machine 7 has 3 containers, it's only container 2 that has a custom profile.
Machine 10 has 2 containers, and container 1 has a custom profile.

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2019-10-27:

#7

machine-10.log Edit (1.4 MiB, text/plain)

Posting machine-10.log

Richard Harding (rharding) on 2019-11-01

Changed in juju:
milestone:	2.7-rc1 → 2.7.1

Canonical Juju QA Bot (juju-qa-bot) on 2020-01-17

Changed in juju:
milestone:	2.7.1 → 2.7.2

Canonical Juju QA Bot (juju-qa-bot) on 2020-02-18

Changed in juju:
milestone:	2.7.2 → 2.7.3

Canonical Juju QA Bot (juju-qa-bot) on 2020-02-27

Changed in juju:
milestone:	2.7.3 → 2.7.4

Canonical Juju QA Bot (juju-qa-bot) on 2020-03-12

Changed in juju:
milestone:	2.7.4 → 2.7.5

Canonical Juju QA Bot (juju-qa-bot) on 2020-04-01

Changed in juju:
milestone:	2.7.5 → 2.7.6

Ian Booth (wallyworld) on 2020-04-07

Changed in juju:
milestone:	2.7.6 → 2.8-rc1

Heather Lanigan (hmlanigan) on 2020-05-05

Changed in juju:
assignee:	nobody → Heather Lanigan (hmlanigan)

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2020-05-05:

#8

The debug.log from #5 did not capture any messages from juju.worker.instancemutater as hoped.

Unfortunately neither machine-7.log nor machine-10.log did not have juju.worker.instancemutater=DEBUG.

From the two machine logs, lines like:
DEBUG juju.container.lxd server.go:247 updated "juju-89bcd1-7-lxd-2" profiles, waiting on Updating container
Indicate we were updating the profiles on a container, though I'm not sure if that was at unit add or reboot.

@chris.sanders, I was referring to the juju machine agent in #4.

I can't find the bundle mentioned, but I'll try to reproduce myself.

Revision history for this message

Pen Gale (pengale) wrote on 2020-05-05:

#9

I think that this is a feature request, as the behavior here is unsupported.

Currently, you can edit your default profile, which will cause Juju to apply custom settings to everything.

There is no current support in Juju for replacing the default profile on just specific machines.

I think the feature request here is: "Juju should have a mechanism for applying a different default profile to specific machines in my model."

(There might also be a race which is causing Juju to fail to replace the profile, as expected, but since the behavior is unsupported, I don't think that this needs to be fixed before 2.8-rc1, as currently tagged. Going to move to wishlist and remove the milestone.)

Changed in juju:
assignee:	Heather Lanigan (hmlanigan) → Pete Vander Giessen (petevg)
milestone:	2.8-rc1 → none
importance:	High → Wishlist
assignee:	Pete Vander Giessen (petevg) → nobody

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2020-06-26:

#10

Pete, while I agree that allowing other profiles is a good feature request that's already covered as I linked before here: https://bugs.launchpad.net/juju/+bug/1822016

The behavior here is that juju is removing a non-default profile that it did not add to a LXD container. As stated above "Juju namespaces the profiles it's managing to help make sure we don't do things like this." If that's the design and it's not working as intended I'm not sure how that could be anything but a bug.

Just for clarity this still happens today on 2.7 series of Juju. It's very easy to reproduce, I'm actually tempted to say it happens on every reboot at this point. My environment isn't however a QA environment where I'm willing to run through reboots to validate that.

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#11

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Wishlist → Low
tags:	added: expirebugs-bot

Canonical Juju

lxd profile removed on reboot

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches