[b1] pod created vm fails commissioning after getting 404 from metadata api

Bug #1742971 reported by Jason Hobbs on 2018-01-12
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
2.3
High
Blake Rouse
cloud-init
Undecided
Unassigned

Bug Description

A vm created via the pods API in maas failed to commission immediately after it was created.

It PXE booted, got initrd and the kernel, dhcp'd again, and then wasn't heard from anymore:

http://paste.ubuntu.com/26372429/

There are no rsyslog logs for it. The hostname of the vm is landscapeamqp-1.maas.

If I connect to the node with virt-viewer, it is sitting at an Ubuntu prompt, but I can't login because there are no passwords set. There are no console logs available on disk (bug 1742971).

This is with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1) in an HA setup - logs from the MAAS servers are available in the attached infra-logs.tar.

To check for this bug you can look for 404 errors like this:
http://paste.ubuntu.com/p/hsbw22BFps/

This was using the default daily maas images.

Related branches

Jason Hobbs (jason-hobbs) wrote :
tags: added: foundations-engine
removed: cpe-foundations
description: updated
description: updated
Andres Rodriguez (andreserl) wrote :

@Jason,

As per our IRC chat, while there are no console logs stored on the host, this doesn't prevent you from obtaining the logs by attempting to recommission:

1. Attempt to get the kernel params being sent, you will see those on the VM's KVM. (Try re-commissioning to obtain these).

2. Enable console log as kernel params in the specific VM, and attempt to re-commission.

Changed in maas:
status: New → Incomplete
Jason Hobbs (jason-hobbs) wrote :

I hit this again today. This is a race condition - it isn't reproducible by re-commissioning the failing VM. I did that, and commissioning succeeded that time. I still grabbed the kernel command line, but I don't think it's very useful. http://paste.ubuntu.com/26404667/

Changed in maas:
status: Incomplete → New
Changed in maas:
milestone: none → 2.4.0beta1
summary: - pod created vm fails commissioning after apparently booting to linux
+ [b1] pod created vm fails commissioning after apparently booting to
+ linux
summary: - [b1] pod created vm fails commissioning after apparently booting to
- linux
+ [b1] pod created vm fails commissioning after getting 404 from metadata
+ api
Jason Hobbs (jason-hobbs) wrote :

We reproduced this some last week, and I noticed that we're getting 404's when the failing node tries to retrieve its preseed:

10.244.40.31/var/log/maas/regiond.log:2018-03-15 11:28:33 regiond: [info] 10.244.40.32 GET /metadata/latest/by-id/cwerrd/?op=get_preseed HTTP/1.1 --> 404 NOT_FOUND (referrer: -; agent: Cloud-Init/17.2)

The problem there is the /MAAS prefix is missing from the path; here's a successful request from another VM:
10.244.40.32/var/log/maas/regiond.log:2018-03-15 11:29:04 regiond: [info] 10.244.40.32 GET /MAAS/metadata/latest/by-id/w3hg84/?op=get_preseed HTTP/1.1 --> 200 OK (referrer: -; agent: Cloud-Init/17.2)

notice the /MAAS in the URL on the successful one.

Now, the question is why the /MAAS is missing in the failure case.

Jason Hobbs (jason-hobbs) wrote :

infra logs from the failure in the last comment.

description: updated
Chris Gregan (cgregan) wrote :

Upgraded to field high given the number of times we have seen this issue

Andres Rodriguez (andreserl) wrote :

Marking this as incomplete in MAAS as Jason was to provide more logs when the next occurrence of this happens.

Changed in maas:
status: New → Incomplete
assignee: nobody → Andres Rodriguez (andreserl)
Changed in maas:
milestone: 2.4.0beta1 → 2.4.0beta2
Jason Hobbs (jason-hobbs) wrote :

I'm sorry, which logs was I supposed to supply?

Changed in maas:
status: Incomplete → New

Ahh right, we turned on an extra maaslog around the kernel parameters.

On Mon, Mar 26, 2018 at 3:02 PM, Jason Hobbs <email address hidden> wrote:
> I'm sorry, which logs was I supposed to supply?
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1742971
>
> Title:
> [b1] pod created vm fails commissioning after getting 404 from
> metadata api
>
> Status in cloud-init:
> New
> Status in MAAS:
> New
>
> Bug description:
> A vm created via the pods API in maas failed to commission immediately
> after it was created.
>
> It PXE booted, got initrd and the kernel, dhcp'd again, and then
> wasn't heard from anymore:
>
> http://paste.ubuntu.com/26372429/
>
> There are no rsyslog logs for it. The hostname of the vm is
> landscapeamqp-1.maas.
>
> If I connect to the node with virt-viewer, it is sitting at an Ubuntu
> prompt, but I can't login because there are no passwords set. There
> are no console logs available on disk (bug 1742971).
>
> This is with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1) in an HA
> setup - logs from the MAAS servers are available in the attached
> infra-logs.tar.
>
> To check for this bug you can look for 404 errors like this:
> http://paste.ubuntu.com/p/hsbw22BFps/
>
>
> This was using the default daily maas images.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1742971/+subscriptions

Changed in maas:
status: New → Incomplete
Jason Hobbs (jason-hobbs) wrote :

I captured the requested log information, and it shows maas is sending the incorrect URL to the node in the kernel parameters:

10.244.40.30/var/log/maas/maas.log:Mar 28 11:59:18 leafeon maas.kernel_opts: message repeated 5 times: [ [info] ---: kernel parameters landscapeha-1 "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/generic/xenial/daily/squashfs ro ip=::::landscapeha-1:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/metadata/latest/by-id/b4hrca/?op=get_preseed apparmor=0 log_host=10.244.40.33 log_port=514"]

Changed in maas:
status: Incomplete → New
Jason Hobbs (jason-hobbs) wrote :
Changed in maas:
assignee: Andres Rodriguez (andreserl) → Blake Rouse (blake-rouse)
status: New → Triaged
importance: Undecided → High
Andres Rodriguez (andreserl) wrote :

From the machine running region/rack 10.244.40.30, we see the following:

1. /etc/maas/rackd.conf:maas_url: http://10.244.40.33/MAAS
2. /etc/maas/regiond.conf:maas_url: http://10.244.40.33/MAAS

The logs show this only for 1 machine, which correctly tells the machine to download the squashfs from the rack controller, and gives it the correct IP of the metadata URL, *but* removes the /MAAS for some reason, which effectively makes it invalid.

Mar 28 11:59:18 leafeon maas.kernel_opts: [info] ---: kernel parameters landscapeha-1 "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/generic/xenial/daily/squashfs ro ip=::::landscapeha-1:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/metadata/latest/by-id/b4hrca/?op=get_preseed apparmor=0 log_host=10.244.40.33 log_port=514"

On the other hand, there are quite a few other logs that show that the URL is correct, with the same localtion for the images, and the correct URL for the metadata.

Mar 28 12:02:50 leafeon maas.kernel_opts: [info] ---: kernel parameters maas-enlist "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/ga-16.04/xenial/daily/squashfs ro ip=::::maas-enlist:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed apparmor=0 log_host=10.244.40.33 log_port=514"

So somewhere in the code the /MAAS is being removed and we need to find out why.

Changed in cloud-init:
status: New → Invalid
Changed in maas:
assignee: Blake Rouse (blake-rouse) → Mike Pontillo (mpontillo)
Andres Rodriguez (andreserl) wrote :

In this case it doesn't include the port:

cloud-config-url=http://10.244.40.33/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed

Chris Gregan (cgregan) wrote :

This bug requires a task for 2.3.x as well. SLAs do not apply to unreleased versions and besides that, this issue cannot remain broken in xenial. Please add task

John George (jog) wrote :

Attaching logs from the latest reproduction of this bug.

Changed in maas:
milestone: 2.4.0beta2 → 2.3.x
milestone: 2.3.x → 2.4.0beta2
Andres Rodriguez (andreserl) wrote :

I'm marking this as incomplete for 2.3 provided that the latest logs show the failure on 2.3.1. Since 2.3.2 is already released, I'd like to see if 2.3.2 has improved this.

2018-04-08 10:20:28 provisioningserver.rpc.clusterservice: [info] Rack controller 'fcffxf' registered (via swoobat:pid=25312) with MAAS version 2.3.1-6470-g036d646-0ubuntu1~16.04.1.

Once we have 2.3.2 runs with this issue, please mark it as 'New'.

no longer affects: maas/2.4
Changed in maas:
milestone: 2.4.0beta2 → 2.4.0rc1
Jason Hobbs (jason-hobbs) wrote :

We hit this again on 2.3.2:

10.244.40.32/var/log/maas/regiond.log:2018-04-12 10:39:08 regiond: [info] 10.244.40.32 GET /metadata/latest/by-id/swywya/?op=get_preseed HTTP/1.1 --> 404 NOT_FOUND (referrer: -; agent: Cloud-Init/17.2)

Changed in maas:
assignee: Mike Pontillo (mpontillo) → Blake Rouse (blake-rouse)
Andres Rodriguez (andreserl) wrote :

Hi guys,

As per already discussed, Blake needs the DB dump to further debug this issue. Since we don't yet have a test run that doesn't; include the DB dump, I'll mark this as incomplete.

Please set as 'New' once we have logs with this so we can notice the status change.

Thanks!

Changed in maas:
status: Triaged → Incomplete
John George (jog) wrote :

Logs which include a database dump in the file named dump.dmp

Christian Reis (kiko) wrote :

And bump to NEW

Changed in maas:
status: Incomplete → New
Changed in maas:
importance: High → Critical
Blake Rouse (blake-rouse) wrote :

 system_id | hostname | node_type | url
-----------+----------+-----------+--------------------------
 byskdg | leafeon | 4 | http://10.244.40.33/MAAS
 wdmyam | swoobat | 4 | http://10.244.40.33/MAAS
 dftght | meinfoo | 4 | http://10.244.40.33/MAAS

Shows they all have the correct url, I do wonder about the missing '/' at the end. urljoin without a '/' at the end I believe will replace the /MAAS.

Just providing updates, still looking into the issue.

Blake Rouse (blake-rouse) wrote :

Thats what I thought:

>>> from urllib.parse import urljoin
>>> urljoin('http://localhost:5240/MAAS', '/metadata/latest/by-id/swywya/?op=get_preseed')
'http://localhost:5240/metadata/latest/by-id/swywya/?op=get_preseed'

Changed in maas:
status: New → In Progress
Jason Hobbs (jason-hobbs) wrote :

Given that theory, I'm confused how, on the same deployment of MAAS, it would work with most nodes but not with others.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.4.0rc1 → 2.4.0beta3
Changed in maas:
status: Fix Committed → Fix Released
Chris Gregan (cgregan) wrote :

Field High SLA now requires that a estimated date for a fix is listed in the comments. Please provide this estimate for the open tasks.

tags: added: track
Andres Rodriguez (andreserl) wrote :

To provide an update. We have looked at backporting these fixes to 2.3, and they are not straight forward backport. There are a few conflicts, but these fixes are built on top of changes that are only present on 2.4+ and *not* present on 2.3, which makes this takes much more complicated.

As such, this changes may not be backportable at all. We'll keep you posted.

Dean Henrichsmeyer (dean) wrote :

For some added clarification here: this doesn't backport cleanly to 2.3.x. The fix was built on machinery in 2.4 that does not exist in 2.3 making the backport more expensive and increases the risk of regressions on 2.3. Given that complexity and the fact that PODs were not officially supported until 2.4, we've decided not to backport this. Upgrading to 2.4 is the best path forward.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers