Ubuntu
cloud-init package

Libvirt VM host provisioning never completes with image revision 20221018

Bug #1993836 reported by Nobuto Murata on 2022-10-21

This bug affects 3 people

	Status	Importance	Assigned to
MAAS	Invalid	Critical	Christian Grabowski
maas-images	Fix Released	Critical	Unassigned
cloud-init (Ubuntu)	Fix Released	Critical	James Falcon

Bug Description

I believe this is a regression in a MAAS image and/or package updates between image revision 20221010 and 20221018. And this is happening for both bionic and focal deployment.

Focal image has the following updates between 20221010 and 20221018:

cloud-init 22.3.4-0ubuntu1~20.04.1
git 1:2.25.1-1ubuntu3.6
git-man 1:2.25.1-1ubuntu3.6
libasn1-8-heimdal:amd64 7.7.0
libgmp10:amd64 2:6.2.0
libgssapi3-heimdal:amd64 7.7.0
libhcrypto4-heimdal:amd64 7.7.0
libheimbase1-heimdal:amd64 7.7.0
libheimntlm0-heimdal:amd64 7.7.0
libhx509-5-heimdal:amd64 7.7.0
libkrb5-26-heimdal:amd64 7.7.0
libroken18-heimdal:amd64 7.7.0
libwind0-heimdal:amd64 7.7.0
zlib1g:amd64 1:1.2.11.dfsg-2ubuntu1.5

How to reproduce:
1. install maas: 1:3.2.6-12016-g.19812b4da-0ubuntu1~20.04.1
2. enlist one node
3. deploy the node with `install_kvm` parameter or select "Libvirt" in WebUI

Expected:
the machine becomes "Deployed"

Actual:
the machine becomes "Failed deployment" due to timeout

There is no major error in the cloud-init log from the target host nor MAAS side, but the image revision makes a difference.
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu27.24
Architecture: amd64
CasperMD5CheckResult: skip
CloudName: MAAS
DistroRelease: Ubuntu 20.04
Package: cloud-init 22.3.4-0ubuntu1~20.04.1
PackageArchitecture: all
ProcEnviron:
TERM=screen-256color
PATH=(custom, no user)
LANG=C.UTF-8
SHELL=/bin/bash
ProcVersionSignature: Ubuntu 5.4.0-131.147-generic 5.4.210
Tags: focal uec-images
Uname: Linux 5.4.0-131-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
user_data.txt: Error: path contained symlinks.

See original description

Tags:

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

Subscribing ~field-critical as this regression is blocking a project.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

I wonder how we can import an older revision of MAAS image easily for testing.

The following steps are half-baked because MAAS tries to fetch /bionic/amd64/20221018/squashfs based on /streams/v1/com.ubuntu.maas:stable:1:bootloader-download.sjson although 20221018 was explicitly excluded when mirroring.

How to confirm the behavior with the previous image revision:

1. Mirror content with a specific date

$ sstream-mirror --progress --max=1 \
    --keyring=/usr/share/keyrings/ubuntu-cloudimage-keyring.gpg \
    http://images.maas.io/ephemeral-v3/stable/ \
    images_maas/ arch=amd64 \
    'release~(focal|bionic)' \
    version_name=20221010

$ python3 -m http.server --directory ./images_maas 8080

2. Set the image source as "http://<mirror_host>:8080/"

3. import bionic image and use bionic for a VM host.(It seems MAAS doesn't allow downgrading an image and focal is already imported out of the box)

4. switch back the image source to the upstream one and update the bionic image to the latest.

5. use bionic again for a VM host to see the difference

description:

updated

Revision history for this message

Adam Collard (adam-collard) wrote on 2022-10-21:

Please attach or share an soseport from this issue, specifically logs from the machine that is failing to deploy.

Changed in maas:
status:	New → Incomplete

Christian Grabowski (cgrabowski) on 2022-10-21

Changed in maas:
assignee:	nobody → Christian Grabowski (cgrabowski)

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

cloud-init-output.log Edit (67.1 KiB, text/plain)

on-spider.maas (system_id: qgcef8) is used as Libvirt VM host and with focal. Cloud-init completes but status in MAAS doesn't turn into Deployed.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

sosreport-on-spider-2022-10-21-itvaqxo.tar.xz Edit (3.4 MiB, application/x-tar)

Changed in maas:
status:	Incomplete → New

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

sosreport-quick-maas-2022-10-21-xlgxcgw.tar.xz Edit (4.9 MiB, application/x-tar)

sosreport from the MAAS side

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

cloud-init-output_with_20221010.log Edit (66.4 KiB, text/plain)

Okay the following command works as a hack to test an older squashfs.

$ sudo wget -O /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/focal/stable/squashfs \
https://images.maas.io/ephemeral-v3/stable/focal/amd64/20221010/squashfs

and grown-swift.maas (system_id: aa3ktf) has succeeded with squashfs 20221010 with focal.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

sosreport-grown-swift-2022-10-21-xyzukmq.tar.xz Edit (3.4 MiB, application/x-tar)

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21:

sosreport-quick-maas-2022-10-21-ccmudhv.tar.xz Edit (5.2 MiB, application/x-tar)

Refreshed MAAS sos report.

tags:	added: apport-collected focal uec-images
description:	updated

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21: Dependencies.txt

#10

Dependencies.txt Edit (7.1 KiB, text/plain)

apport information

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21: ProcCpuinfoMinimal.txt

#11

ProcCpuinfoMinimal.txt Edit (1.1 KiB, text/plain)

apport information

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21: cloud-init-log-warnings.txt

#12

cloud-init-log-warnings.txt Edit (1.2 KiB, text/plain)

apport information

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21: cloud-init-output.log.txt.txt

#13

cloud-init-output.log.txt.txt Edit (67.1 KiB, text/plain)

apport information

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21: logs.tgz.gz

#14

logs.tgz.gz Edit (91.4 KiB, application/x-gtar-compressed)

apport information

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-21: lshw.txt.txt

#15

lshw.txt.txt Edit (20.0 KiB, text/plain)

apport information

Bill Wear (billwear) on 2022-10-21

Changed in maas:
status:	New → Triaged
Changed in maas-images:
status:	New → Triaged
Changed in maas:
importance:	Undecided → Critical
Changed in maas-images:
importance:	Undecided → Critical

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-21:

#16

on a local reproduction of the issue, I am able to SSH into the host post-reboot, it appears it is not sending a request to MAAS' metadata endpoints post-reboot.

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-21:

#17

A tcpdump of the deployment will show the machine successfully sends a status update saying the deployment has finished from cloud-init, however it continues to wait for the host to register as a VM host.

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-21:

#18

So it appears the status update to register as a KVM host is happening earlier than intended, prior to the node requesting the user-data endpoint, this results in agent_name being unset at the time the node attempts to register as a kvm host, this results in a broken state for the node that goes unhandled, and MAAS will not mark the node as DEPLOYED. The cloud-init config MAAS is serving is identical between image versions, was there a change in cloud-init that could result in these requests being made out of order?

Changed in maas:
status:	Triaged → Incomplete

Nobuto Murata (nobuto) on 2022-10-22

tags:

added: regression-update

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-22:

#19

"sru cloud-init (22.3 update) Bionic, Focal, Jammy"
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1987318
> The notable ones are:
>
> Config Module Additions / Deletions:
> - Ansible config module
> - Wireguard config module
> - Drop debug module
>
> Behavior changes:
> - schema: Resolve user-data if --system given
> - mounts: fix suggested_swapsize for > 64GB hosts
>
> New Features:
> - clean: add param to remove /etc/machine-id for golden image creation
> - Return cc_set_hostname to PER_INSTANCE frequency
> - clean: allow third party cleanup scripts in /etc/cloud/clean.d
> - ssh_util: Handle sshd_config.d folder
>
> Optimizations:
> - meta-schema: add infra to skip inapplicable modules
> - main: avoid downloading full contents cmdline urls
> - Update WebHookHandler to run as background thread
> - net: Implement link-local ephemeral ipv6

And this caught my eyes since it's talking about MAAS use case in mind and asynchronous behavior:
https://bugs.launchpad.net/cloud-init/+bug/1910552

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-22:

#20

Until we track down the root cause, could we retract revision 20221018 from the stable channel of MAAS image index?
https://images.maas.io/ephemeral-v3/stable/focal/amd64/

I think we can continue the investigation with the candidate channel instead.
https://images.maas.io/ephemeral-v3/candidate/focal/amd64/

Revision history for this message

James Falcon (falcojr) wrote on 2022-10-24 (last edit on 2022-10-24):

#21

I have a hunch as to what's happening here. The issue fixed in https://bugs.launchpad.net/cloud-init/+bug/1910552 was that all reporting events were happening synchronously. If the event endpoint was down, cloud-init effectively stalled as it spent 20 seconds per event (or whatever the configured retry/timeout was) timing out on over 100 events. To fix that, now all reporting events get queued to a background thread to be sent independent of the main thread.

It appears that in this bug, events are getting generated before networking is configured. For the sake of example, lets just say there are 10 events happening before networking is up. Previously, this meant that those 10 events were never emitted to the reporting URL. Cloud-init would try to POST, fail, and then keep moving along. Once network came up, events emitted after that time would POST successfully. However, since the reporting now happens asynchronously, we're likely not missing any messages at all. If cloud-init emits a message pre-network, it will retry until max attempts are reached or we a hit a timeout. In the mean time, cloud-init local is continuing to run and setting up networking. Once networking is setup, the queued message can now POST on retry. Messages that we weren't seeing before are now being POSTed successfully.

Christian, does this sound like a likely explanation based on what you're seeing?

Changed in cloud-init (Ubuntu):
status:	New → Incomplete

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-10-24:

#22

The metadata provider and cloud-init generated data consumer, MAAS, and cloud-init are both marked as Incomplete. Can you folks discuss and agree on a path forward what kind of changes are necessary to overcome the current situation please?

Revision history for this message

James Falcon (falcojr) wrote on 2022-10-24 (last edit on 2022-10-24):

#23

Incomplete doesn't mean "I'm not doing anything". Incomplete means "I don't have enough information to properly triage this yet." Discussion is still happening here.

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-24:

#24

So the explanation James gives does line up with what I am seeing, though it seems to consistently be the case that the request responsible for registering the KVM host is first, so I would believe the network may be down for the user-data endpoint request (where we set agent_name) but then the request to register succeeds on the first attempt, though this is just an assumption based on what I am seeing with regards to which requests MAAS receives.

Changed in cloud-init (Ubuntu):
status:	Incomplete → New

Revision history for this message

James Falcon (falcojr) wrote on 2022-10-24:

#25

Christian, can I get a cloud-init.log from one of your affected systems along with a good cloud-init.log from a good system?

If what I said is true, I'm not exactly sure how to proceed. I see no evidence that there is a bug in how our fix was implemented, rather MAAS came to rely on cloud-init's buggy behavior. If we were to rollback this change, we'd re-introduce the bug that we fixed, along with longer boot times stemming from waiting on sending events before networking is up.

Can MAAS hold back images for now and fix the issue on the MAAS side?

Revision history for this message

Chris Johnston (cjohnston) wrote on 2022-10-24:

#26

Diego removed B/F/J 20221018 images. After he removed these, I was able to deploy 22.04 KVM.

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-25:

#27

Here is cloud-init.log from a failed deployment: https://pastebin.canonical.com/p/jR86cPYpF6/

And here is cloud-init.log from a successful deployment: https://pastebin.canonical.com/p/GmJdbMpccJ/

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-25:

#28

As far as fixing the issue from MAAS, it seems a bit odd that we could receive a status for modules-final and other requests will follow, but if we were to fix it on the MAAS side, would returning a 409 for the modules-final request result in retries from cloud-init?

Revision history for this message

James Falcon (falcojr) wrote on 2022-10-25:

#29

"As far as fixing the issue from MAAS, it seems a bit odd that we could receive a status for modules-final and other requests will follow, but if we were to fix it on the MAAS side, would returning a 409 for the modules-final request result in retries from cloud-init?"

To answer your specific question, yes, "raise_for_status()" from requests is used, which will raise on http responses 400-599. We'll catch this and then retry until time/tries is up.

I'm not sure I follow though. How would this be used to help the situation? Do you know which message specifically is the problematic one? Why is it not expected to be received as soon as it is?

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-25:

#30

So the ideal flow is cloud-init makes a request to the user-data endpoint, setting agent_name, then the request to MAAS' metadata status endpoint for modules-final, where the node gets registered by MAAS as a KVM host, assuming agent_name has been set. This ordering assumes modules-final is the final request normally, however that is no longer the case. By sending the 409, and having the modules-final request reattempted, the retry to the user-data endpoint that occurs due to network not yet being available can still set the agent_name prior to MAAS considering cloud-init done. This does seem odd to me however, as the modules-final request had been treated as a way of considering cloud-init to be done for MAAS.

Revision history for this message

James Falcon (falcojr) wrote on 2022-10-25:

#31

"This ordering assumes modules-final is the final request normally, however that is no longer the case."
You're saying you're seeing this event earlier than a request for userdata? That shouldn't be happening. What event specifically? Are you able to see the event_type, name, or description? Could MAAS be misinterpreting a different event? I can't think of a possibility where final message would get sent before a userdata get.

I was able to dig deeper and find that with this change it is possible "modules-final: running modules for final" message doesn't get sent. It's possible cloud-init main thread can shut down before that message gets sent. If you're checking specifically for this message, then I think we found our problem.

But it sounds like your symptoms are different?

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-10-26:

#32

If modules-final is not being sent, that's definitely an issue for MAAS. I observed it being sent, and then additional requests were sent, but not receiving it would also cause this issue.

James Falcon (falcojr) on 2022-10-26

Changed in cloud-init (Ubuntu):
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-11-04:

#33

Upstream PR up that should resolve this problem:
https://github.com/canonical/cloud-init/pull/1796

PPA available for testing updated cloud-init in provisioning/commisioning images in MAAS

sudo add-apt-repository ppa:chad.smith/maas-test

What cloud-init needs help with is a mechanism to deliver a custom image stream to MAAS provisioning that includes cloud-init in that base image.
We'll reach out to (cgrabowski) for this assistance

Changed in cloud-init (Ubuntu):
status:	Triaged → In Progress
assignee:	nobody → James Falcon (falcojr)

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-11-09:

#34

I tested using the candidate simple streams, fails without the fix (expected) and when installing the cloud-init fix early in the deployment process, the deployment succeeds and the host is registered as a VMHost. So I'd say this is fixed from my perspective.

Changed in maas:
status:	Incomplete → Triaged
status:	Triaged → Invalid

Chad Smith (chad.smith) on 2022-11-15

Changed in cloud-init (Ubuntu):
status:	In Progress → Fix Committed

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-11-29:

#35

This fix is queued and undergoing SRU validation in cloud-init 22.4.0 which is currently present in the -proposed pocket for Bionic, Focal, Jammy and Kinetic. The fix has already been released to Ubuntu Lunar (23.04).

An unrelated regression [1] was found during SRU verification forcing cloud-init to upload 22.4.2 to -proposed verification for Bionic, Focal, Jammy and Kinetic. Once 22.4.2 completes validation, this bug will be marked fix released.

[1] cloud-init status --wait regression in early boot https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1997559

James Falcon (falcojr) on 2022-12-07