Libvirt VM host provisioning never completes with image revision 20221018

Bug #1993836 reported by Nobuto Murata
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Christian Grabowski
maas-images
Fix Released
Critical
Unassigned
cloud-init (Ubuntu)
Fix Released
Critical
James Falcon

Bug Description

I believe this is a regression in a MAAS image and/or package updates between image revision 20221010 and 20221018. And this is happening for both bionic and focal deployment.

Focal image has the following updates between 20221010 and 20221018:

cloud-init 22.3.4-0ubuntu1~20.04.1
git 1:2.25.1-1ubuntu3.6
git-man 1:2.25.1-1ubuntu3.6
libasn1-8-heimdal:amd64 7.7.0
libgmp10:amd64 2:6.2.0
libgssapi3-heimdal:amd64 7.7.0
libhcrypto4-heimdal:amd64 7.7.0
libheimbase1-heimdal:amd64 7.7.0
libheimntlm0-heimdal:amd64 7.7.0
libhx509-5-heimdal:amd64 7.7.0
libkrb5-26-heimdal:amd64 7.7.0
libroken18-heimdal:amd64 7.7.0
libwind0-heimdal:amd64 7.7.0
zlib1g:amd64 1:1.2.11.dfsg-2ubuntu1.5

How to reproduce:
1. install maas: 1:3.2.6-12016-g.19812b4da-0ubuntu1~20.04.1
2. enlist one node
3. deploy the node with `install_kvm` parameter or select "Libvirt" in WebUI

Expected:
the machine becomes "Deployed"

Actual:
the machine becomes "Failed deployment" due to timeout

There is no major error in the cloud-init log from the target host nor MAAS side, but the image revision makes a difference.
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu27.24
Architecture: amd64
CasperMD5CheckResult: skip
CloudName: MAAS
DistroRelease: Ubuntu 20.04
Package: cloud-init 22.3.4-0ubuntu1~20.04.1
PackageArchitecture: all
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 5.4.0-131.147-generic 5.4.210
Tags: focal uec-images
Uname: Linux 5.4.0-131-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
user_data.txt: Error: path contained symlinks.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-critical as this regression is blocking a project.

Revision history for this message
Nobuto Murata (nobuto) wrote :

I wonder how we can import an older revision of MAAS image easily for testing.

The following steps are half-baked because MAAS tries to fetch /bionic/amd64/20221018/squashfs based on /streams/v1/com.ubuntu.maas:stable:1:bootloader-download.sjson although 20221018 was explicitly excluded when mirroring.

How to confirm the behavior with the previous image revision:

1. Mirror content with a specific date

$ sstream-mirror --progress --max=1 \
    --keyring=/usr/share/keyrings/ubuntu-cloudimage-keyring.gpg \
    http://images.maas.io/ephemeral-v3/stable/ \
    images_maas/ arch=amd64 \
    'release~(focal|bionic)' \
    version_name=20221010

$ python3 -m http.server --directory ./images_maas 8080

2. Set the image source as "http://<mirror_host>:8080/"

3. import bionic image and use bionic for a VM host.(It seems MAAS doesn't allow downgrading an image and focal is already imported out of the box)

4. switch back the image source to the upstream one and update the bionic image to the latest.

5. use bionic again for a VM host to see the difference

description: updated
Revision history for this message
Adam Collard (adam-collard) wrote :

Please attach or share an soseport from this issue, specifically logs from the machine that is failing to deploy.

Changed in maas:
status: New → Incomplete
Changed in maas:
assignee: nobody → Christian Grabowski (cgrabowski)
Revision history for this message
Nobuto Murata (nobuto) wrote :

on-spider.maas (system_id: qgcef8) is used as Libvirt VM host and with focal. Cloud-init completes but status in MAAS doesn't turn into Deployed.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Changed in maas:
status: Incomplete → New
Revision history for this message
Nobuto Murata (nobuto) wrote :

sosreport from the MAAS side

Revision history for this message
Nobuto Murata (nobuto) wrote :

Okay the following command works as a hack to test an older squashfs.

$ sudo wget -O /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/focal/stable/squashfs \
    https://images.maas.io/ephemeral-v3/stable/focal/amd64/20221010/squashfs

and grown-swift.maas (system_id: aa3ktf) has succeeded with squashfs 20221010 with focal.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Refreshed MAAS sos report.

tags: added: apport-collected focal uec-images
description: updated
Revision history for this message
Nobuto Murata (nobuto) wrote : Dependencies.txt

apport information

Revision history for this message
Nobuto Murata (nobuto) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Nobuto Murata (nobuto) wrote : cloud-init-log-warnings.txt

apport information

Revision history for this message
Nobuto Murata (nobuto) wrote : cloud-init-output.log.txt.txt

apport information

Revision history for this message
Nobuto Murata (nobuto) wrote : logs.tgz.gz

apport information

Revision history for this message
Nobuto Murata (nobuto) wrote : lshw.txt.txt

apport information

Bill Wear (billwear)
Changed in maas:
status: New → Triaged
Changed in maas-images:
status: New → Triaged
Changed in maas:
importance: Undecided → Critical
Changed in maas-images:
importance: Undecided → Critical
Revision history for this message
Christian Grabowski (cgrabowski) wrote :

on a local reproduction of the issue, I am able to SSH into the host post-reboot, it appears it is not sending a request to MAAS' metadata endpoints post-reboot.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

A tcpdump of the deployment will show the machine successfully sends a status update saying the deployment has finished from cloud-init, however it continues to wait for the host to register as a VM host.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

So it appears the status update to register as a KVM host is happening earlier than intended, prior to the node requesting the user-data endpoint, this results in agent_name being unset at the time the node attempts to register as a kvm host, this results in a broken state for the node that goes unhandled, and MAAS will not mark the node as DEPLOYED. The cloud-init config MAAS is serving is identical between image versions, was there a change in cloud-init that could result in these requests being made out of order?

Changed in maas:
status: Triaged → Incomplete
Nobuto Murata (nobuto)
tags: added: regression-update
Revision history for this message
Nobuto Murata (nobuto) wrote :

"sru cloud-init (22.3 update) Bionic, Focal, Jammy"
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1987318
> The notable ones are:
>
> Config Module Additions / Deletions:
> - Ansible config module
> - Wireguard config module
> - Drop debug module
>
> Behavior changes:
> - schema: Resolve user-data if --system given
> - mounts: fix suggested_swapsize for > 64GB hosts
>
> New Features:
> - clean: add param to remove /etc/machine-id for golden image creation
> - Return cc_set_hostname to PER_INSTANCE frequency
> - clean: allow third party cleanup scripts in /etc/cloud/clean.d
> - ssh_util: Handle sshd_config.d folder
>
> Optimizations:
> - meta-schema: add infra to skip inapplicable modules
> - main: avoid downloading full contents cmdline urls
> - Update WebHookHandler to run as background thread
> - net: Implement link-local ephemeral ipv6

And this caught my eyes since it's talking about MAAS use case in mind and asynchronous behavior:
https://bugs.launchpad.net/cloud-init/+bug/1910552

Revision history for this message
Nobuto Murata (nobuto) wrote :

Until we track down the root cause, could we retract revision 20221018 from the stable channel of MAAS image index?
https://images.maas.io/ephemeral-v3/stable/focal/amd64/

I think we can continue the investigation with the candidate channel instead.
https://images.maas.io/ephemeral-v3/candidate/focal/amd64/

Revision history for this message
James Falcon (falcojr) wrote (last edit ):

I have a hunch as to what's happening here. The issue fixed in https://bugs.launchpad.net/cloud-init/+bug/1910552 was that all reporting events were happening synchronously. If the event endpoint was down, cloud-init effectively stalled as it spent 20 seconds per event (or whatever the configured retry/timeout was) timing out on over 100 events. To fix that, now all reporting events get queued to a background thread to be sent independent of the main thread.

It appears that in this bug, events are getting generated before networking is configured. For the sake of example, lets just say there are 10 events happening before networking is up. Previously, this meant that those 10 events were never emitted to the reporting URL. Cloud-init would try to POST, fail, and then keep moving along. Once network came up, events emitted after that time would POST successfully. However, since the reporting now happens asynchronously, we're likely not missing any messages at all. If cloud-init emits a message pre-network, it will retry until max attempts are reached or we a hit a timeout. In the mean time, cloud-init local is continuing to run and setting up networking. Once networking is setup, the queued message can now POST on retry. Messages that we weren't seeing before are now being POSTed successfully.

Christian, does this sound like a likely explanation based on what you're seeing?

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Revision history for this message
Nobuto Murata (nobuto) wrote :

The metadata provider and cloud-init generated data consumer, MAAS, and cloud-init are both marked as Incomplete. Can you folks discuss and agree on a path forward what kind of changes are necessary to overcome the current situation please?

Revision history for this message
James Falcon (falcojr) wrote (last edit ):

Incomplete doesn't mean "I'm not doing anything". Incomplete means "I don't have enough information to properly triage this yet." Discussion is still happening here.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

So the explanation James gives does line up with what I am seeing, though it seems to consistently be the case that the request responsible for registering the KVM host is first, so I would believe the network may be down for the user-data endpoint request (where we set agent_name) but then the request to register succeeds on the first attempt, though this is just an assumption based on what I am seeing with regards to which requests MAAS receives.

Changed in cloud-init (Ubuntu):
status: Incomplete → New
Revision history for this message
James Falcon (falcojr) wrote :

Christian, can I get a cloud-init.log from one of your affected systems along with a good cloud-init.log from a good system?

If what I said is true, I'm not exactly sure how to proceed. I see no evidence that there is a bug in how our fix was implemented, rather MAAS came to rely on cloud-init's buggy behavior. If we were to rollback this change, we'd re-introduce the bug that we fixed, along with longer boot times stemming from waiting on sending events before networking is up.

Can MAAS hold back images for now and fix the issue on the MAAS side?

Revision history for this message
Chris Johnston (cjohnston) wrote :

Diego removed B/F/J 20221018 images. After he removed these, I was able to deploy 22.04 KVM.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

Here is cloud-init.log from a failed deployment: https://pastebin.canonical.com/p/jR86cPYpF6/

And here is cloud-init.log from a successful deployment: https://pastebin.canonical.com/p/GmJdbMpccJ/

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

As far as fixing the issue from MAAS, it seems a bit odd that we could receive a status for modules-final and other requests will follow, but if we were to fix it on the MAAS side, would returning a 409 for the modules-final request result in retries from cloud-init?

Revision history for this message
James Falcon (falcojr) wrote :

"As far as fixing the issue from MAAS, it seems a bit odd that we could receive a status for modules-final and other requests will follow, but if we were to fix it on the MAAS side, would returning a 409 for the modules-final request result in retries from cloud-init?"

To answer your specific question, yes, "raise_for_status()" from requests is used, which will raise on http responses 400-599. We'll catch this and then retry until time/tries is up.

I'm not sure I follow though. How would this be used to help the situation? Do you know which message specifically is the problematic one? Why is it not expected to be received as soon as it is?

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

So the ideal flow is cloud-init makes a request to the user-data endpoint, setting agent_name, then the request to MAAS' metadata status endpoint for modules-final, where the node gets registered by MAAS as a KVM host, assuming agent_name has been set. This ordering assumes modules-final is the final request normally, however that is no longer the case. By sending the 409, and having the modules-final request reattempted, the retry to the user-data endpoint that occurs due to network not yet being available can still set the agent_name prior to MAAS considering cloud-init done. This does seem odd to me however, as the modules-final request had been treated as a way of considering cloud-init to be done for MAAS.

Revision history for this message
James Falcon (falcojr) wrote :

"This ordering assumes modules-final is the final request normally, however that is no longer the case."
You're saying you're seeing this event earlier than a request for userdata? That shouldn't be happening. What event specifically? Are you able to see the event_type, name, or description? Could MAAS be misinterpreting a different event? I can't think of a possibility where final message would get sent before a userdata get.

I was able to dig deeper and find that with this change it is possible "modules-final: running modules for final" message doesn't get sent. It's possible cloud-init main thread can shut down before that message gets sent. If you're checking specifically for this message, then I think we found our problem.

But it sounds like your symptoms are different?

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

If modules-final is not being sent, that's definitely an issue for MAAS. I observed it being sent, and then additional requests were sent, but not receiving it would also cause this issue.

James Falcon (falcojr)
Changed in cloud-init (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Chad Smith (chad.smith) wrote :

Upstream PR up that should resolve this problem:
https://github.com/canonical/cloud-init/pull/1796

PPA available for testing updated cloud-init in provisioning/commisioning images in MAAS

sudo add-apt-repository ppa:chad.smith/maas-test

What cloud-init needs help with is a mechanism to deliver a custom image stream to MAAS provisioning that includes cloud-init in that base image.
We'll reach out to (cgrabowski) for this assistance

Changed in cloud-init (Ubuntu):
status: Triaged → In Progress
assignee: nobody → James Falcon (falcojr)
Revision history for this message
Christian Grabowski (cgrabowski) wrote :

I tested using the candidate simple streams, fails without the fix (expected) and when installing the cloud-init fix early in the deployment process, the deployment succeeds and the host is registered as a VMHost. So I'd say this is fixed from my perspective.

Changed in maas:
status: Incomplete → Triaged
status: Triaged → Invalid
Chad Smith (chad.smith)
Changed in cloud-init (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Chad Smith (chad.smith) wrote :

This fix is queued and undergoing SRU validation in cloud-init 22.4.0 which is currently present in the -proposed pocket for Bionic, Focal, Jammy and Kinetic. The fix has already been released to Ubuntu Lunar (23.04).

An unrelated regression [1] was found during SRU verification forcing cloud-init to upload 22.4.2 to -proposed verification for Bionic, Focal, Jammy and Kinetic. Once 22.4.2 completes validation, this bug will be marked fix released.

[1] cloud-init status --wait regression in early boot https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1997559

James Falcon (falcojr)
Changed in cloud-init (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Nobuto Murata (nobuto) wrote :
Changed in maas-images:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.