[2.5] pod VMs fail to commission due to corrupted initrd

Bug #1810972 reported by Ashley Lai
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
New
Undecided
Unassigned

Bug Description

On MaaS HA 2.5, some pod VMs are in 'Ready' state and a few VMs seem to get stuck on commissioning.

2019-01-08-15:12:40 foundationcloudengine.layers.maaslayer INFO Ready: 14/19
Status: {u'landscapeamqp-2': u'Ready', u'landscapeha-1': u'Commissioning', u'landscape-1': u'Ready', u'landscapeamqp-1': u'Ready', u'juju-1': u'Ready', u'juju-3': u'Ready', u'grafana-1': u'Commissioning', u'prometheus-3': u'Ready', u'landscapesql-1': u'Ready', u'landscapesql-2': u'Ready', u'juju-2': u'Ready', u'graylog-3': u'Commissioning', u'elastic-3': u'Ready', u'elastic-2': u'Ready', u'landscape-3': u'Ready', u'landscapeamqp-3': u'Ready', u'thruk-2': u'Commissioning', u'nagios-1': u'Commissioning', u'landscape-2': u'Ready'}

Revision history for this message
Ashley Lai (alai) wrote :
summary: - maas 2.5: pod VMs stuck on commissioning
+ [2.5] Pod VMs stuck on commissioning
Revision history for this message
Mike Pontillo (mpontillo) wrote : Re: [2.5] Pod VMs stuck on commissioning

First I looked at the logs from `landscapesql-1`, which (according to your logs) seems to be in Ready state. Indeed, it looks like cloud-init runs and commissions the machine successfully.

2019-01-08T14:53:01+00:00 landscapesql-1 cloud-init[925]: All scripts successfully ran
2019-01-08T14:53:01+00:00 landscapesql-1 cloud-init[925]: Cloud-init v. 18.4-0ubuntu1~18.04.1 finished at Tue, 08 Jan 2019 14:53:01 +0000. Datasource DataSourceMAAS [http://10-244-40-0--21.maas-internal:5248/MAAS/metadata/]. Up 176.97 seconds

Then I looked at the logs for `landscapeha-1`, which seems to still be in 'Commissioning' state. In contrast, there is no rsyslog output, and the machine is terminated by libvirtd less than a second after attempting startup:

2019-01-08 14:49:22.811+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.6 (Christian Ehrhardt <email address hidden> Fri, 09 Nov 2018 07:42:01 +0100), qemu version: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.9), hostname: leafeon
...
2019-01-08 14:49:23.162+0000: shutting down, reason=destroyed

However, later in the logs I see the startup message with no corresponding shutdown message, so I don't know if the machine actually booted and is attempting to commission.

Can you see any pattern regarding which hosts failed to commission, or is it random every time? Do these machines have a unique networking or storage configuration? (For example, maybe MAAS has created machines based on interface constraints without a PXE network. I thought we had a separate open bug on that, but I can't find it.)

Are these hosts still in commissioning state according to the MAAS database? What happens if you try to commission them again? Can you use a tool such as virt-manager to browse the hypervisor and determine if anything is suspicious about the VM configuration, such as missing or incorrectly attached NICs, a duplicate MAC address, etc?

Changed in maas:
status: New → Incomplete
Revision history for this message
Ashley Lai (alai) wrote :

logs with console logging enabled is attached.

Ashley Lai (alai)
Changed in maas:
status: Incomplete → New
Revision history for this message
Mike Pontillo (mpontillo) wrote :

I looked at the logging from the new run, and it seems like a race condition of some type. We rebooted one of the machines that failed commissioning, and then observed it working.

In the new logs, there is still no console logging - but not because console logging wasn't turned on. It was because the machines failed to boot from the network, and thus failed to do anything at all besides idle on a "No bootable device" screen.

I'd like to understand the order of operations here; it could be the machines are being composed too soon, or it could be that MAAS didn't bring up its services in a timely manner.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

sub'd to field-critical because this is blocking the majority of our 2.5 runs and we don't have a workaround.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Unfortunately, it looks like the repro from comments #3 and #4 are actually bug 1792965, which present similarly but is due to dhcpd not starting properly. In the logs from comment #1, there is no "There's already a DHCP server running." error message, and it looks like thruk-2 dhcp's successfully and starts the PXE boot process but maybe never makes it to linux.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Thanks Jason. Marking this Incomplete until we get logs from a reproduction of the same failure.

Changed in maas:
status: New → Incomplete
Revision history for this message
Ashley Lai (alai) wrote :

Here's the correct log.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

alai is uploading a full set of logs, but this has been reproduced with console logging on and it shows that the initrd is corrupted:

https://pastebin.canonical.com/p/5vHyzbjxH5/

[ 0.527730] Unpacking initramfs...
[ 1.206497] Initramfs unpacking failed: junk in compressed archive

This could be the same issue as bug 1811021 if the underlying issue is tftp serving up invalid files.

Ashley Lai (alai)
Changed in maas:
status: Incomplete → New
Revision history for this message
Ashley Lai (alai) wrote :

Here's a list of VMs and their status. This may be helpful.

Status: {u'landscapeamqp-2': u'Commissioning', u'landscapeha-1': u'Commissioning', u'landscape-1': u'Ready', u'landscapeamqp-1': u'Ready', u'juju-1': u'Ready', u'juju-3': u'Ready', u'grafana-1': u'Commissioning', u'prometheus-3': u'Ready', u'landscapesql-1': u'Ready', u'landscapesql-2': u'Ready', u'juju-2': u'Ready', u'graylog-3': u'Commissioning', u'elastic-3': u'Ready', u'elastic-2': u'Ready', u'landscape-3': u'Ready', u'landscapeamqp-3': u'Ready', u'thruk-2': u'Commissioning', u'nagios-1': u'Commissioning', u'landscape-2': u'Ready'}

summary: - [2.5] Pod VMs stuck on commissioning
+ [2.5] pod VMs fail to commission due to corrupted initrd
Revision history for this message
Lee Trager (ltrager) wrote :

As of MAAS 2.5 the kernel and initrd are served from nginx when booting with pxelinux. Since you seem to be having issues with multiple files can you post the output of dmesg from each of your rack controllers? I would also ensure that the SHA256 of the initrd is found in https://images.maas.io/ephemeral-v3/daily/streams/v1/com.ubuntu.maas:daily:v3:download.sjson

Revision history for this message
Lee Trager (ltrager) wrote :

On 10.244.40.30 the rack did encounter a problem downloading bootloaders

simplestreams.checksum_util.InvalidChecksum: Invalid sha256 Checksum at http://10.244.40.33:80/MAAS/images-stream/grub-ieee1275/ppc64el/generic/open-firmware/20181212.0/grub2.tar.xz. Found 22da7c572515b7d9f93b6fc91b09dbabdf093381a83fa606ad5afded1788798a. Expected 73abb7166f0bfa26d04a8494093cbbdaad6ff6dfecaca71f2c01b8c0df0791ec. read 522240 bytes expected 513244 bytes. (size 522240 expected 513244)

It seems it was later corrected but I'm wondering if there is a problem with the disks on the rack controller.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We have syslog for each rack controller included in the logs attached in comment #8. There are no signs of disk issues - if there were disk issues I would expect other things to fail and there to be warnings/errors about the filesystem in syslog.

Also, we're only seeing this with initrd.

Where are the nginx logs for the rack controller? It would be great to see size of the response and response code for the initrd requests.

Revision history for this message
Lee Trager (ltrager) wrote :

nginx logs are stored in /var/log/maas/http. Looking at your logs it seems you images are a few days old. Could you try updating? Once updated can you confirm the following SHA256s

e172821ef8bdb36e745b2044f126eb63c89484806a93c2e9790295ac2c8b7944 /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily/boot-initrd
10ab1a486eb79068a52256a375403134200ecf00ded9e24c8da98945b810a2f5 /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily/boot-kernel
f6922096bd3e70b18a04b80b1917a0610ee766bc2ca189372e3c4bc072051cd9 /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily/squashfs

If those check out could you try downloading and reverifying the boot-initrd on a remote machine?
wget http://MAAS_IP:5240/images/ubuntu/amd64/generic/bionic/daily/boot-initrd

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1810972] Re: [2.5] pod VMs fail to commission due to corrupted initrd

These failures are from a CI job, we always install MAAS just a few minutes
before we hit this failure. At the same time these VMs are having issues
with initrd, other ones are succeeding. It seems really unlikely that it's
a bad image on disk. We can try to reproduce again manually tomorrow and
see though.

Jason

On Thu, Jan 10, 2019 at 7:20 PM Lee Trager <email address hidden> wrote:

> nginx logs are stored in /var/log/maas/http. Looking at your logs it
> seems you images are a few days old. Could you try updating? Once
> updated can you confirm the following SHA256s
>
> e172821ef8bdb36e745b2044f126eb63c89484806a93c2e9790295ac2c8b7944
> /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily/boot-initrd
> 10ab1a486eb79068a52256a375403134200ecf00ded9e24c8da98945b810a2f5
> /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily/boot-kernel
> f6922096bd3e70b18a04b80b1917a0610ee766bc2ca189372e3c4bc072051cd9
> /var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily/squashfs
>
> If those check out could you try downloading and reverifying the
> boot-initrd on a remote machine?
> wget
> http://MAAS_IP:5240/images/ubuntu/amd64/generic/bionic/daily/boot-initrd
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1810972
>
> Title:
> [2.5] pod VMs fail to commission due to corrupted initrd
>
> Status in MAAS:
> New
>
> Bug description:
> On MaaS HA 2.5, some pod VMs are in 'Ready' state and a few VMs seem
> to get stuck on commissioning.
>
> 2019-01-08-15:12:40 foundationcloudengine.layers.maaslayer INFO Ready:
> 14/19
> Status: {u'landscapeamqp-2': u'Ready', u'landscapeha-1':
> u'Commissioning', u'landscape-1': u'Ready', u'landscapeamqp-1': u'Ready',
> u'juju-1': u'Ready', u'juju-3': u'Ready', u'grafana-1': u'Commissioning',
> u'prometheus-3': u'Ready', u'landscapesql-1': u'Ready', u'landscapesql-2':
> u'Ready', u'juju-2': u'Ready', u'graylog-3': u'Commissioning',
> u'elastic-3': u'Ready', u'elastic-2': u'Ready', u'landscape-3': u'Ready',
> u'landscapeamqp-3': u'Ready', u'thruk-2': u'Commissioning', u'nagios-1':
> u'Commissioning', u'landscape-2': u'Ready'}
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1810972/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Ok, on another reproduce the hashes check out:

ubuntu@swoobat:/var/lib/maas/boot-resources/current/ubuntu/amd64/generic/bionic/daily$ sha256sum *
e172821ef8bdb36e745b2044f126eb63c89484806a93c2e9790295ac2c8b7944 boot-initrd
10ab1a486eb79068a52256a375403134200ecf00ded9e24c8da98945b810a2f5 boot-kernel
f6922096bd3e70b18a04b80b1917a0610ee766bc2ca189372e3c4bc072051cd9 squashfs

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's a funny thing that shows up in the ngnix logs, all of the boot initrd requests are the proper size except one, which is 25554114.

http://paste.ubuntu.com/p/6bXnXkzdnB/

I haven't been able to correlate that to a failure yet but it definitely seems like a red flag to me.

I think we need to reproduce this with tcpdump running to capture what's on the wire. If we can rule out a issue with what's on the wire, we can focus on what happens after it gets to the VM.

Revision history for this message
Lee Trager (ltrager) wrote :

Is MAAS running in VirtualBox? There is a known issue with VirtualBox corrupting files coming from nginx. Turning off sendfile in /var/lib/maas/http/nginx.conf may help.

[1] https://forums.virtualbox.org/viewtopic.php?f=1&t=24905

Revision history for this message
Ashley Lai (alai) wrote :

MAAS is not running in Virtualbox.

Revision history for this message
Ashley Lai (alai) wrote :

May worth a try to turn it off and see what happens though.

Revision history for this message
Lee Trager (ltrager) wrote :

I'm having difficultly reproducing this issue locally. If possible can you try booting into Ubuntu and trying to download the initrd with the following script

for i in {1..100}; do
    wget -q http://$RACK_IP:5248/images/ubuntu/amd64/generic/bionic/daily/boot-initrd
    if [ $(sha256sum boot-initrd | awk ' { print $1 } ') != "e172821ef8bdb36e745b2044f126eb63c89484806a93c2e9790295ac2c8b7944" ]; then
        echo "Invalid hash detected!"
    fi
    rm boot-initrd
done

If that works the problem may be in lpxelinux instead of nginx.

Revision history for this message
Ashley Lai (alai) wrote :

Turning off sendfile did not help. I've also tried the script in comment #21 and it went fine for all rack controllers.

One thing to note is that on the new repro (different from what jhobbs looked at), I don't see the size different in the nginx log.

Status: {u'landscapeamqp-2': u'Ready', u'landscapeha-1': u'Commissioning', u'landscape-1': u'Ready', u'landscapeamqp-1': u'Ready', u'juju-1': u'Ready', u'juju-3': u'Ready', u'grafana-1': u'Commissioning', u'prometheus-3': u'Ready', u'landscapesql-1': u'Ready', u'landscapesql-2': u'Ready', u'juju-2': u'Ready', u'graylog-3': u'Commissioning', u'elastic-3': u'Ready', u'elastic-2': u'Ready', u'landscape-3': u'Ready', u'landscapeamqp-3': u'Ready', u'thruk-2': u'Commissioning', u'nagios-1': u'Commissioning', u'landscape-2': u'Ready'}

Revision history for this message
Lee Trager (ltrager) wrote :

If you have a machine you can reliably reproduce this could you post a tcpdump capture on each of the rack controllers when this happens? I'd like to see if there is anything on the network that may be interrupting the transfer.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

One of our forum users pointed out that `sendfile on` in our nginx config was causing strange problems.[1]

I suggest we try removing that from the configuration and then check if we can still reproduce the issue.

[1]:
https://discourse.maas.io/t/slow-image-download-unless-running-tcpdump/309/3

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Ah, my mistake, I see this has already been tried. Sorry. I shouldn't comment while jet lagged. ;-)

Revision history for this message
Thiago Martins (martinx) wrote :

My workaround is to build “Compose” the KVM Guest using Virt-Manager AND, using UEFI!!!

To do that, I have to run on each KVM Host (Pod):

sudo apt install ovmf

Then, compose the KVM Guests using “Firmware = UEFI”, instead of “BIOS”.

Voialá! Problem solved!

No more kernel panic and MaaS can take over the VM after a simple “Pod Refresh”!

It’s even better now! I mean, when deleting a Pod, even accidentally, ALL the VMs will be left untouched! I really like this. So, no more Composing KVM Guests via MaaS UI for me (for now).

More details here:

https://discourse.maas.io/t/maas-2-5-bionic-pod-composing-guests-kernel-panic-no-init/302

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

pcap from a corrupted initramfs transfer. the file on the wire checks out - it has the right sha256sum. You can test this by loading the capture in wireshark and exporting the http data bytes.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is also reproducible by restarting commissioning; I've got the same machine failing to commission three times in a row now.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Looking at MAAS shows that the machines that fail to commission and get this initrd failure all have only 2GB of RAM. Everything with more is working fine.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Commissioning with xenial works fine, we only hit this when commissioning with bionic.

Even with the same kernel version, xenial passes and bionic fails.

xenial:
[ 0.000000] Linux version 4.15.0-43-generic (buildd@lcy01-amd64-007) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #46~16.04.1-Ubuntu SMP Fri Dec 7 13:31:08 UTC 2018 (Ubuntu 4.15.0-43.46~16.04.1-generic 4.15.18)

bionic:
[ 0.000000] Linux version 4.15.0-43-generic (buildd@lgw01-amd64-001) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 (Ubuntu 4.15.0-43.46-generic 4.15.18)

The initrd is bigger on bionic - 56M compressed for bionic vs 54M compressed on xenial.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is bug 1797581. It doesn't reproduce for other ram sizes, larger or smaller.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.