pod kvm fails to boot for install, boot-kernel transfer size is wrong

Bug #1815843 reported by John George
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Undecided
Unassigned

Bug Description

1 of 19 POD KVMs failed to commission.

In the attached logs:
1. Power on messages for landscapesql-2 can be seen in 10.244.40.32/var/log/maas/maas.log
     2019-02-13T04:04:23+00:00 meinfoo maas.power: [info] Changed power state (on) of node: landscapesql-2 (hwmces)

2. No evidence that the KVM ever makes a DHCP request is present in the syslogs.

3. Comparing logs in /var/log/libvirt/qemu between a KVM that completed commissioning (e.g. landscapesql-1.log) and the one that failed (i.e. landscapesql-2.log), there are no log entries showing /usr/bin/qemu-system-x86_64 was run?

Tags: cdo-qa
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (5.2 KiB)

Hi guys,

1. maas.log: I see that the machine was successfully told to commission and, more importantly, I see that the machine successfully powered on:

2019-02-13T04:04:16.402502+00:00 meinfoo maas.drivers.pod.virsh: [info] landscapesql-2: Successfully set network boot order
2019-02-13T04:04:17+00:00 leafeon maas.node: [info] landscapesql-2: Status transition from NEW to COMMISSIONING
2019-02-13T04:04:17.567458+00:00 meinfoo maas.power: [info] Changing power state (on) of node: landscapesql-2 (hwmces)
2019-02-13T04:04:17+00:00 leafeon maas.node: [info] landscapesql-2: Commissioning started

2019-02-13T04:04:23.057874+00:00 meinfoo maas.power: [info] Changed power state (on) of node: landscapesql-2 (hwmces)

2. regiond.log: I also see that the vm in question obtained an IP address because MAAS updated the DNS configuration with such IP:

2019-02-13 04:04:32 maasserver.region_controller: [info] Reloaded DNS configuration; ip 10.244.40.168 connected to landscapesql-2 on eth0

3. rackd.log: I see that such IP PXE booted off maas, requested the boot-kernel, requested the boot-inird.

2019-02-13 04:04:39 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by 10.244.40.168
2019-02-13 04:04:39 provisioningserver.rackdservices.tftp: [info] ldlinux.c32 requested by 10.244.40.168
2019-02-13 04:04:39 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/fbf3f3c3-4261-7043-9ca3-85081ecb806c requested by 10.244.40.168
2019-02-13 04:04:39 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/01-52-54-00-82-f0-f9 requested by 10.244.40.168
2019-02-13 04:04:40 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/generic/bionic/daily/boot-kernel requested by 10.244.40.168
2019-02-13 04:04:40 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/generic/bionic/daily/boot-initrd requested by 10.244.40.168

4. /var/log/libvirt/qemu/landscapesql-2.log: I see that the machine was started:

2019-02-13 04:04:19.723+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.6 (Christian Ehrhardt <email address hidden> Fri, 09 Nov 2018 07:42:01 +0100), qemu version: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.9), hostname: meinfoo.maas
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=spice /usr/bin/kvm-spice -name guest=landscapesql-2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-135-landscapesql-2/master-key.aes -machine pc-i440fx-bionic,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid c3f3f3fb-6142-4370-9ca3-85081ecb806c -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-135-landscapesql-2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/var/lib/virt/images/0b62df7c-b153-4846-878b-c3ede473e81e,format=raw,if=none,id=drive-virtio-disk0,serial=0b62df7c-b153-4846-878b...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Ok, we hit this again and I dug more into it. We aren't going to get serial console logs because it never makes it to linux. The VM is downloading the boot-kernel file and never downloads the initrd.

The transfer size of boot-kernel is wrong. Here is the bad request:
10.244.40.32/var/log/maas/http/access.log:10.244.40.223 - - [28/Feb/2019:06:52:58 +0000] "GET /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel HTTP/1.0" 200 3908156 "-" "Syslinux/6.03"

Here is a good request:
10.244.40.32/var/log/maas/http/access.log:10.244.40.220 - - [28/Feb/2019:06:51:50 +0000] "GET /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel HTTP/1.0" 200 8281848 "-" "Syslinux/6.03"

There are MANY good requests from other VMs for the kernel, and they all have the size 8281848. Only the failed one has 3908156.

The machine is grafana-1. I'm attaching logs.

summary: - POD KVMs sometimes fail to power on
+ pod kvm fails to boot for install, boot-kernel transfer size is wrong
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sub'd to field-high as this is causing a sudden spike in failures.

Changed in maas:
milestone: none → 2.6.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Is automatic image syncing enabled in this deployment? If so, while this error happens, are images being download and/or updated?

Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

can you please specify which machines are having these failures?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Andres, I specified in comment #5 that it's grafana-1 that had the issue. The logs for that are attached in comment #6.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Setting this back to incomplete. Since these are accessing nginx, we would like to see the configuration. This is available in /var/lib/maas/http . Please gather all the nginx config files. That should give us an idea of whether that has any impact.

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We don't modify the nginx templates at all, they are just default. here they are:

http://paste.ubuntu.com/p/mQSkr7tpZs/

Changed in maas:
status: Incomplete → New
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
milestone: 2.6.0 → 2.6.0rc1
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Do you have a tcpdump of the network traffic? That is needed to be able to determine what the issue is. We need to inspect the HTTP headers that is being sent from nginx to the client. Specifically the "Content-Length" header to see if there is a mismatch there.

It is possible that the correct Content-Length is being sent to the client, but the client is either closing the HTTP connection to soon or a TCP reset is occurring. That will still report as a 200 response on the server side as the response was 200. The difference will be if the actual amount of data sent to the client matches the Content-Length of the response HTTP header.

Knowing if those mismatch will be the first step, because there is a few outcomes:

1. The server is reading the file wrong and setting the Content-Length not to the actual file length.
2. The server is sending the file and something occurs for the file handler to be closed so the streaming action is stopped (I would expect a log in nginx errors if this were to occur, but maybe not)
3. The client is closing the connection before reading all the data being streamed from the server.
4. A TCP reset is occurring break the TCP connection, giving the same result of #3 but caused by TCP reset instead of client disconnect.

Also the output of the whole tree structure of /var/lib/maas/boot-resources and byte sizes (in bytes) so the Content-Length can be compared to the actual file.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

No, it's impractical to catch a tcpdump for every single VM on every single run... the logs would be enormous.

Why don't we just configure nginx to log the headers?

Our logs tarball already includes the /var/lib/maas tree, for example at:

./10.244.40.30/var/log/maas/var.lib.maas.log

Revision history for this message
Blake Rouse (blake-rouse) wrote :

If you could configure nginx to log the required information that would also work, but a tcpdump would really get us a lot of information about what is going on. Would would only need to run it on the rack controller where the booting is occurring from. I understand that it would be a huge amount of data, so if not possible we can do our best to work with out it. Just will take more time to find the actual issue.

Changed in maas:
milestone: 2.6.0rc1 → 2.6.0rc2
Changed in maas:
milestone: 2.6.0rc2 → 2.7.0alpha1
Revision history for this message
Björn Tillenius (bjornt) wrote :

Can this still be reproduce with MAAS 2.6? If so, we'll try to reproduce this on our side.

Changed in maas:
status: New → Incomplete
assignee: Blake Rouse (blake-rouse) → nobody
Changed in maas:
milestone: 2.7.0b1 → 2.7.0b2
Changed in maas:
milestone: 2.7.0b2 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.