Several containers stuck in Pending with cloud-init failing to start

Bug #1911067 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Expired
Undecided
Unassigned

Bug Description

Run here:
https://solutions.qa.canonical.com/testruns/testRun/4a343cb3-2b6e-44b7-8aa0-a7bf569514f0

Logs/artifacts here:
https://oil-jenkins.canonical.com/artifacts/4a343cb3-2b6e-44b7-8aa0-a7bf569514f0/index.html

OpenStack model crashdump here:
https://oil-jenkins.canonical.com/artifacts/4a343cb3-2b6e-44b7-8aa0-a7bf569514f0/generated/generated/openstack/juju-crashdump-openstack-2021-01-09-06.49.25.tar.gz
---

- We (solutions-qa) hit this in a handful of runs over the weekend. A few of the containers get stuck in "Pending".

- It doesn't appear to be a Juju application issue as there is no single consistent application being deployed to the containers that share the "Pending".

- In the crashdump at I'm seeing the msg:

/var/log/lxd/$JUJU_INSTANCE_NAME/console.log for the baremetal logs of the machine that has "Pending" containers
[FAILED] Failed to start Initial cloud-init job (metadata service crawler)

- We're using a Level 2 CIS Hardened image. It could make sense that something was cutting off its ability to go and make a network call near the beginning of its run. But if that was the case, it seems like all of the containers would fail to come up.

---

I'm going to work on reproducing this manually and will update this bug with any new info I find.

Revision history for this message
Richard Harding (rharding) wrote :

I took a peek at the artifacts and I am not able to see how to get at the cloud-init logs for the failed containers. Can you give me a hint or include those separately, please?

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Joshua Genet (genet022) wrote :

Download and extract the crashdump:
generated/generated/openstack/juju-crashdump-openstack-2021-01-09-06.49.25.tar.gz

Then you can find the console logs here:
3/baremetal/var/log/lxd/juju-84a282-3-lxd-0/console.log

Where 3 is the Juju machine number, juju-84a282-3-lxd-0 is the Juju instance name.

Let me know if you have any other questions!

Changed in cloud-init:
status: Incomplete → New
Michael Skalka (mskalka)
description: updated
Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks for the info, Joshua and Michael! Looking at those log files I see these lines also:

[ 566.418804] cloud-init[157]: No /sbin/ifup, applying netplan configuration.
[ OK ] Stopped Wait for Network to be Configured.
         Stopping Network Service...
[ OK ] Stopped Network Service.
         Starting Network Service...
[ 566.856577] cloud-init[157]: error: cannot communicate with server: Put http://localhost/v2/snaps/system/conf: dial unix /run/snapd.socket: connect: no such file or directory
[ 566.874458] cloud-init[157]: error: cannot communicate with server: Put http://localhost/v2/snaps/system/conf: dial unix /run/snapd.socket: connect: no such file or directory
[ 566.877350] cloud-init[157]: 2021-01-09 03:04:18,904 - util.py[WARNING]: Failed to run bootcmd module bootcmd
[ 566.879592] cloud-init[157]: 2021-01-09 03:04:18,907 - util.py[WARNING]: Running module bootcmd (<module 'cloudinit.config.cc_bootcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_bootcmd.py'>) failed

This indicates that something in your bootcmd is failing, rather than cloud-init itself. Looking at the bootcmd section in 3/lxd/0/var/lib/cloud/seed/nocloud-net/user-data, you can see that the "No /sbin/ifup, applying netplan configuration." originates from line 144, so that's confirmation.

(Given the snapd.socket error messages, I suspect that the failures are in the `snap set` calls: `bootcmd`s are run in the init stage of cloud-init (`cloud-init-local.service`) and systemd dependencies mean that snapd will not come up until well after cloud-init-local has completed.)

Whatever the cause, I don't believe this is a cloud-init bug; if you find a more specific reason why you think this _is_ a cloud-init bug, please do reply and set this back to New.

Thanks!

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Joshua Genet (genet022) wrote :

Just to close the loop here, I think you're correct that this isn't an issue with cloud-init. The error messages appear to be in all of our runs (even ones that succeed). And it seems to occur when we do the snap set call like you mentioned.

I've filed a new bug against Juju. Thanks for the great help!

Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks for the update!

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for cloud-init because there has been no activity for 60 days.]

Changed in cloud-init:
status: Incomplete → Expired
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.