Deployment fails if install_kvm=true and user_data blob is passed

Bug #1801420 reported by Craig Bender
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Expired
Medium
Unassigned

Bug Description

Issue started with 2.5 beta3 (beta 2 was OK).

When deploying from command line and you pass 1) install_kvm=true AND 2) user_data="$(cat ~/user-data.yaml|base64)" installation will fail. No errors noted that are of significance and from the maas and cloud init logs, everything looks good, except, it reports back that the deployment failed.

Using either install_kvm=true OR user_data="$(cat your.yaml|base64)" by themselves works just fine, just the combination fails.

Note: user-data file can be extremely simple to recreate failure

Steps to recreate:

1) Create simple cloud-init user-data file

cat <<EOF|tee ~/user-data.yaml
#cloud-config
runcmd:
 - touch /var/tmp/install-completed
EOF

2) Allocate machine:

maas admin machines allocate system_id=<systemid>

3) Deploy machine with install_kvm and user_data args:

maas admin machine deploy <systemid> install_kvm=true user_data="$(cat ~/user-data.yaml|base64)"

Output of `dpkg -l "*maas*"|cat` and contents of /var/log/maas included in attachement

Tags: cpe-onsite
Revision history for this message
Craig Bender (craig-bender) wrote :
Revision history for this message
Craig Bender (craig-bender) wrote :

Attachment also includes cloud-init.log and cloud-init-output.log from a failed host.

Changed in maas:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Mike Pontillo (mpontillo)
milestone: none → 2.5.0rc1
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Using install_kvm causes the cloud-init vendor-data to be preset by MAAS. If you are providing custom user-data, it will need to be provided in a specific way (documented by the cloud-init team[1]) so that it can be merged correctly.

What specific user-data are you using? Are you using `cloud-config-jsonp` to merge your user-data with the MAAS vendor-data, or are you allowing cloud-init to overwrite the vendor-data from MAAS with your custom user-data?

[1]:
https://cloudinit.readthedocs.io/en/latest/topics/vendordata.html

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Mike Pontillo (mpontillo) wrote :

The vendor-data for installing KVM is complex[1]; depending on what your're doing I can see that you could easily cause a conflict between your custom user-data and this vendor-data.

[1]:
https://git.launchpad.net/maas/tree/src/metadataserver/vendor_data.py#n103

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Thanks for attaching the logs - I did find one thing that needs fixing[1], but I'm not certain it has anything to do with the user-data you provided.

https://paste.ubuntu.com/p/W3JGQVdPhj/

Revision history for this message
Craig Bender (craig-bender) wrote :

Hi Mike,

I understand the complexity, but even the simplest of cloud-init files fail.

Simple meaning the only thing running is touching a file as a run command. That should not cause a conflict and far more complex cloud-init files worked under beta-2. The following is enough to make it fail:

#cloud-config
runcmd:
 - touch /var/tmp/install-completed

Furthermore, nothing is actually failing in cloud-init and everything actually is installed. MAAS just reports back that it failed.

Whith beta2, I could do all of the following and mroe in a user-data file when install_kvm=true was selected:
- Remove lxd pkgs
- Add lxd as a snap
- Install virt-manager, libvirt-daemon-driver-storage-zfs , jq, and other useful tooks
- Create zfs-based libvirt storage pool
- Preseed ssh keys
- load self-signed CA cert

I'm happy to post what used to work, but I don't think that would help since the simplest user-data file possible is causing a failure.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

With the information provided in this bug report, I can't narrow the cause of the failure down to the user-data blob. A week or two ago I hit bug #1800573, which had a symptom that exactly corresponds to the failure mode in your logs, but I subsequently can't reproduce it (with the latest code on the MAAS master branch).

Can you confirm that deploying KVM works /without/ the user-data blob in the same scenario? I feel like there was be a more general issue blocking KVM deployment on beta2. Can you try it again with beta4?

I think having a known-to-work user-data would be great. I still wonder if the complexity inherent in merging cloud-init YAML configurations will throw us off here, but we should probably add known-good use cases to CI.

Revision history for this message
Craig Bender (craig-bender) wrote :

Did more testing and it appears it's install_kvm that's broken in 2.5.0~beta4-7361-g401d6, both CLI and GUI. Does not appear to have anything to do with passing user-data.

This failure happens every time in rackd.log right before node gets marked as failed deployment:

2018-11-11 17:51:00 provisioningserver.rpc.pods: [critical] Failed to discover pod.
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 500, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1442, in gotResult
            _inlineCallbacks(r, g, deferred)
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/drivers/pod/virsh.py", line 1298, in discover
            conn = yield self.get_virsh_connection(context)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
            result = g.send(result)
          File "/usr/lib/python3/dist-packages/provisioningserver/drivers/pod/virsh.py", line 1289, in get_virsh_connection
            raise VirshError('Failed to login to virsh console.')
        provisioningserver.drivers.pod.virsh.VirshError: Failed to login to virsh console.

The vendor-data file that gets created by maas is valid when install_kvm=true (or gui option selected) as I tested deploying that on it's own.

#cloud-config
ntp:
  pools: []
  servers: [172.27.20.1]
packages: [qemu-kvm, libvirt-bin, qemu-efi]
runcmd:
- [mkdir, -p, /home/virsh/bin]
- [ln, -s, /usr/bin/virsh, /home/virsh/bin/virsh]
- [sh, -c, echo "PATH=/home/virsh/bin" >> /home/virsh/.bashrc]
- [sh, -c, printf "Match user virsh\n X11Forwarding no\n AllowTcpForwarding
    no\n PermitTTY no\n ForceCommand nc -q 0 -U /var/run/libvirt/libvirt-sock\n" >>
    /etc/ssh/sshd_config]
- [/usr/sbin/usermod, --append, --groups, 'libvirt,libvirt-qemu', virsh]
- [systemctl, restart, sshd]
- [/bin/sleep, '10']
ssh_pwauth: true
users:
- default
- {lock_passwd: false, name: virsh, passwd: $6$VJPW8amrHCyaWIF.$RJQr2rgySb64hHNfOE41sLqvd.TzXdCCQDcoYiO6hlmPI3cB0yJUNRtLDBNzxGJQGxpUim.Ewj.klmuC14qoz0, shell: /bin/rbash}

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Craig,

From your last message, it seems the machine gets marked as failed deployment because MAAS cannot power manage it because it failed to login to libvirt via virsh.

"provisioningserver.drivers.pod.virsh.VirshError: Failed to login to virsh console."

This didn't fail because of the metadata.

Changed in maas:
milestone: 2.5.0rc1 → 2.5.0rc2
Changed in maas:
milestone: 2.5.0rc2 → 2.5.1
Changed in maas:
milestone: 2.5.1 → 2.5.2
Changed in maas:
milestone: 2.5.2 → 2.5.3
Changed in maas:
milestone: 2.5.3 → 2.5.4
Changed in maas:
milestone: 2.5.4 → none
Revision history for this message
Björn Tillenius (bjornt) wrote :

Is this still an issue with MAAS 2.6.1?

Changed in maas:
assignee: Mike Pontillo (mpontillo) → nobody
importance: High → Undecided
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Revision history for this message
Jeff Hillman (jhillman) wrote :

This still happens in MAAS 2.6.2

Changed in maas:
status: Expired → Confirmed
tags: added: cpe-onsite
Changed in maas:
status: Confirmed → Triaged
importance: Undecided → Medium
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Is it still reproducible on MAAS 3.3 or later?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.