node failed commissioning or deployment - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}

Bug #1768870 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Blake Rouse
2.3
Fix Committed
Undecided
Unassigned

Bug Description

Several pod VMs failed to commission on a deploy of FCB.

In rsyslog output, I see errors like this:

May 3 06:29:00 nagios-1 cloud-init[1057]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 400: BAD REQUEST

In regiond.log, there is this traceback that appears to be associated with the error:

http://paste.ubuntu.com/p/3CBVCfRVzG/

Some other VMs in this deploy successfully commissioned.

This is with 2.3.3-6492-ge999a54-0ubuntu1~16.04.1.

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
summary: - node failed commissioning - HTTP Error 400: BAD REQUEST when requesting
- user data
+ node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be
+ one of the node's interfaces."]}
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}

Jason,

Can you please attach the 'machines read' of all VM's from the same or other pods? Looking at the code, this is what it says:

        It's possible, though very unlikely, that the boot interface we are
        seeing is already assigned to another Node. If this happens, we need to
        catch the failure as early as possible.

Changed in maas:
status: New → Incomplete
milestone: none → 2.4.0rc1
Revision history for this message
Andres Rodriguez (andreserl) wrote :

"""Check that this Node's boot interface (if present) belongs to this
Node.

It's possible, though very unlikely, that the boot interface we are
seeing is already assigned to another Node. If this happens, we need to
catch the failure as early as possible.
"""

This would imply to me that a VM could be booting off a MAC address that's assigned to another node ?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Andres, the logs tarball includes a database dump has all of the information about machines and mac addresses in it.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Right before your error, I see this:

May 3 06:27:18 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
May 3 06:27:20 meinfoo maas.node: [info] quilava: Commissioning started
May 3 06:27:21 meinfoo maas.node: [info] quilava: Deleting node
May 3 06:27:21 meinfoo maas.node: [info] quilava: Deleting my BMC '10 (10.246.56.17:type=STICKY)'
May 3 06:27:21 meinfoo maas.node: [info] 10 (10.246.56.17:type=STICKY): Deleting BMC
May 3 06:27:23 meinfoo maas.api: [info] ralts: Enlisted new machine
May 3 06:27:24 meinfoo maas.node: [info] ralts: Status transition from NEW to COMMISSIONING
May 3 06:27:24 meinfoo maas.node: [info] ralts: Commissioning started
May 3 06:27:25 meinfoo maas.node: [info] ralts: Deleting node
May 3 06:27:25 meinfoo maas.node: [info] ralts: Deleting my BMC '11 (192.168.246.122:type=STICKY)'
May 3 06:27:25 meinfoo maas.node: [info] 11 (192.168.246.122:type=STICKY): Deleting BMC
May 3 06:27:28 meinfoo maas.api: [info] elgyem: Enlisted new machine
May 3 06:27:28 meinfoo maas.node: [info] elgyem: Status transition from NEW to COMMISSIONING
May 3 06:27:29 meinfoo maas.node: [info] elgyem: Commissioning started
May 3 06:27:29 meinfoo maas.node: [info] elgyem: Deleting node
May 3 06:27:30 meinfoo maas.node: [info] elgyem: Deleting my BMC '12 (10.246.56.24:type=STICKY)'
May 3 06:27:30 meinfoo maas.node: [info] 12 (10.246.56.24:type=STICKY): Deleting BMC

This seems like you are adding a machine, then it started commissioning, but then it is being deleted?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

On the other hand, around the same time I also see this:

May 3 06:28:38 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:38, 10.244.40.170
May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
May 3 06:28:40 leafeon maas.node: [info] juju-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:42:28:70, 10.244.40.171
May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:44, 10.244.40.172
May 3 06:28:42 leafeon maas.node: [info] landscape-3: Status transition from TESTING to READY
May 3 06:28:52 leafeon maas.node: [info] kibana-3: Storage layout was set to flat.
May 3 06:28:52 leafeon maas.node: [info] kibana-3: Status transition from COMMISSIONING to TESTING
May 3 06:28:55 leafeon maas.node: [error] grafana-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
May 3 06:28:55 leafeon maas.node: [info] grafana-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING

With this specific message:

May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)

Revision history for this message
Andres Rodriguez (andreserl) wrote :

2184 2018-05-03 06:34:24.808986+00 2018-05-03 06:34:24.808986+00 mark_failed 00-maas-01-cpuinfo has run past it's timeout(0:00:10) 14 22

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1768870] Re: node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}
Download full text (3.8 KiB)

Yes, we add a node and then delete it right afterwards. We've been
doing that for a long time. This failure just started showing up with
2.3.3.

As a reminder, here's how we add nodes in FCE:

We don't have MAC addresses for the nodes, and we don't have direct IPMI
access to the nodes either. To make this work, we have MAAS do all the
heavy lifting.

The steps are as follows:

1) Check if nodes have already been enlisted

2) We add the node to MAAS with correct IPMI credentials and a fake MAC
address (MAAS requires a MAC address).

3) MAAS, prior to returning from the API call to add the machine, issues
the IPMI commands required to PXE boot the machine. It handles this
regardless of the machine's current state.

4) Immediately upon return from the add machine API call, we issue
another API call to delete the machine from MAAS. MAAS does not issue
any power commands in response to this, so the machine continues to
PXE boot, and will show up in MAAS as a 'New' node once enlistment
completes.

5) We poll MAAS for nodes in 'New' state, looking for a machine to match
our IPMI power address. When we find it, we set the proper hostname and
zone on it, and start commissioning.

6) We poll to ensure commissioning completes successfully.

If bug 1707216 were fixed, we could just add the node and MAAS
would handle the rest.

On Thu, May 3, 2018 at 11:42 AM, Andres Rodriguez
<email address hidden> wrote:
> On the other hand, around the same time I also see this:
>
>
> May 3 06:28:38 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:38, 10.244.40.170
> May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
> May 3 06:28:40 leafeon maas.node: [info] juju-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
> May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:42:28:70, 10.244.40.171
> May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:44, 10.244.40.172
> May 3 06:28:42 leafeon maas.node: [info] landscape-3: Status transition from TESTING to READY
> May 3 06:28:52 leafeon maas.node: [info] kibana-3: Storage layout was set to flat.
> May 3 06:28:52 leafeon maas.node: [info] kibana-3: Status transition from COMMISSIONING to TESTING
> May 3 06:28:55 leafeon maas.node: [error] grafana-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
> May 3 06:28:55 leafeon maas.node: [info] grafana-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
>
>
> With this specific message:
>
> May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed:
> Commissioning failed, cloud-init reported a failure (refer to the event
> log for more information)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1768870
>
> Title:
> node failed commissioning - HTTP Error 400: {'boot_interface': ["Must
> b...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (4.2 KiB)

But we don't do that with VMs - they are all created via pods, and the
failures in this bug are all with VMs, so it doesn't seem related.

On Thu, May 3, 2018 at 12:00 PM, Jason Hobbs <email address hidden> wrote:
> Yes, we add a node and then delete it right afterwards. We've been
> doing that for a long time. This failure just started showing up with
> 2.3.3.
>
> As a reminder, here's how we add nodes in FCE:
>
> We don't have MAC addresses for the nodes, and we don't have direct IPMI
> access to the nodes either. To make this work, we have MAAS do all the
> heavy lifting.
>
> The steps are as follows:
>
> 1) Check if nodes have already been enlisted
>
> 2) We add the node to MAAS with correct IPMI credentials and a fake MAC
> address (MAAS requires a MAC address).
>
> 3) MAAS, prior to returning from the API call to add the machine, issues
> the IPMI commands required to PXE boot the machine. It handles this
> regardless of the machine's current state.
>
> 4) Immediately upon return from the add machine API call, we issue
> another API call to delete the machine from MAAS. MAAS does not issue
> any power commands in response to this, so the machine continues to
> PXE boot, and will show up in MAAS as a 'New' node once enlistment
> completes.
>
> 5) We poll MAAS for nodes in 'New' state, looking for a machine to match
> our IPMI power address. When we find it, we set the proper hostname and
> zone on it, and start commissioning.
>
> 6) We poll to ensure commissioning completes successfully.
>
> If bug 1707216 were fixed, we could just add the node and MAAS
> would handle the rest.
>
> On Thu, May 3, 2018 at 11:42 AM, Andres Rodriguez
> <email address hidden> wrote:
>> On the other hand, around the same time I also see this:
>>
>>
>> May 3 06:28:38 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:38, 10.244.40.170
>> May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
>> May 3 06:28:40 leafeon maas.node: [info] juju-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
>> May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:42:28:70, 10.244.40.171
>> May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:44, 10.244.40.172
>> May 3 06:28:42 leafeon maas.node: [info] landscape-3: Status transition from TESTING to READY
>> May 3 06:28:52 leafeon maas.node: [info] kibana-3: Storage layout was set to flat.
>> May 3 06:28:52 leafeon maas.node: [info] kibana-3: Status transition from COMMISSIONING to TESTING
>> May 3 06:28:55 leafeon maas.node: [error] grafana-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
>> May 3 06:28:55 leafeon maas.node: [info] grafana-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
>>
>>
>> With this specific message:
>>
>> May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed:
>> Commissioning failed,...

Read more...

Revision history for this message
Blake Rouse (blake-rouse) wrote : Re: node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}

Seems we finally got proper tracebacks for this bug, finally!

Revision history for this message
Andres Rodriguez (andreserl) wrote :

May 3 06:27:50 grafana-1 cloud-init[2108]: 0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 2.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 4.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 8.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 16.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 32.: HTTP Error 400: BAD REQUEST
May 3 06:28:54 grafana-1 cloud-init[2108]: FAIL: HTTP error [400]Cloud-init v. 18.2 running 'modules:final' at Thu, 03 May 2018 06:27:47 +0000. Up 170.99 seconds.
May 3 06:28:54 grafana-1 cloud-init[2108]: 2018-05-03 06:28:54,945 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/user_data.sh [1]
May 3 06:28:54 grafana-1 cloud-init[2108]: 2018-05-03 06:28:54,952 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
May 3 06:28:54 grafana-1 cloud-init[2108]: 2018-05-03 06:28:54,960 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed

Revision history for this message
Andres Rodriguez (andreserl) wrote :

The error above would usually indicate that the machine was revoked the oauth key and cloud-init was unable to send more messages. But before that, we can see in the maas logs that the machine was marked failed because cloud-init told it that something failed.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

May 3 06:27:35 juju-1 cloud-init[2166]: python3-yaml is already the newest version (3.11-3build1).
May 3 06:27:35 juju-1 cloud-init[2166]: The following package was automatically installed and is no longer required:
May 3 06:27:35 juju-1 cloud-init[2166]: libfreetype6
May 3 06:27:35 juju-1 cloud-init[2166]: Use 'apt autoremove' to remove it.
May 3 06:27:35 juju-1 cloud-init[2166]: 0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 2.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 4.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 8.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 16.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: request to http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 32.: HTTP Error 400: BAD REQUEST
May 3 06:28:40 juju-1 cloud-init[2166]: FAIL: HTTP error [400]Cloud-init v. 18.2 running 'modules:final' at Thu, 03 May 2018 06:27:33 +0000. Up 163.99 seconds.
May 3 06:28:40 juju-1 cloud-init[2166]: 2018-05-03 06:28:40,034 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/user_data.sh [1]
May 3 06:28:40 juju-1 cloud-init[2166]: 2018-05-03 06:28:40,038 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
May 3 06:28:40 juju-1 cloud-init[2166]: 2018-05-03 06:28:40,046 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
May 3 06:28:40 juju-1 ec2:

Revision history for this message
Andres Rodriguez (andreserl) wrote :

same error for another machine.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

So I've digged through the dump and and found a few interesting things:

1. A 'Failed Commissioning' machine (landscapeamqp-2) failed because one script took longer to run. Judging from the rsyslog, it is not obvious why it would have been the case because the rsyslog truncates at a certain point which doesn't show the whole commissioning process it would have normally followed to get to the point it got. This to me implies that there could have been some sort of network issue.

2. Other machines in 'Commissioning' state have NO interfaces. Since these machines come from a Pod, these machines should have had one interface attached to them but it didn't. Since we see the error in comment #3, I think what could be happening is that there could be VM's with duplicated MAC addresses across the different pods.

As such, Jason:

1. would imply a network issue, so I would recommend you explore the possibility there was some network breakage at some point.
2. We need logs from libvirt. Could you please start gathering /var/log/libvirt/ ?

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1768870] Re: node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}

How could a machine that's PXE booted possibly have no interfaces?
There is something wrong in how that's being determined.

On Fri, May 4, 2018 at 2:32 AM, Andres Rodriguez
<email address hidden> wrote:
> So I've digged through the dump and and found a few interesting things:
>
> 1. A 'Failed Commissioning' machine (landscapeamqp-2) failed because one
> script took longer to run. Judging from the rsyslog, it is not obvious
> why it would have been the case because the rsyslog truncates at a
> certain point which doesn't show the whole commissioning process it
> would have normally followed to get to the point it got. This to me
> implies that there could have been some sort of network issue.
>
> 2. Other machines in 'Commissioning' state have NO interfaces. Since
> these machines come from a Pod, these machines should have had one
> interface attached to them but it didn't. Since we see the error in
> comment #3, I think what could be happening is that there could be VM's
> with duplicated MAC addresses across the different pods.
>
> As such, Jason:
>
> 1. would imply a network issue, so I would recommend you explore the possibility there was some network breakage at some point.
> 2. We need logs from libvirt. Could you please start gathering /var/log/libvirt/ ?
>
> ** Changed in: maas
> Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1768870
>
> Title:
> node failed commissioning - HTTP Error 400: {'boot_interface': ["Must
> be one of the node's interfaces."]}
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> Several pod VMs failed to commission on a deploy of FCB.
>
> In rsyslog output, I see errors like this:
>
> May 3 06:29:00 nagios-1 cloud-init[1057]: request to
> http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.:
> HTTP Error 400: BAD REQUEST
>
> In regiond.log, there is this traceback that appears to be associated
> with the error:
>
> http://paste.ubuntu.com/p/3CBVCfRVzG/
>
> Some other VMs in this deploy successfully commissioned.
>
> This is with 2.3.3-6492-ge999a54-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1768870/+subscriptions

Changed in maas:
milestone: 2.4.0rc1 → 2.4.0rc2
Revision history for this message
Blake Rouse (blake-rouse) wrote : Re: node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}

Yep duplicate MAC's are causing the issue.

May 3 06:25:01 meinfoo maas.pod: [warn] meinfoo: interface with MAC address 52:54:00:84:41:f7 was discovered on machine juju-2 and was moved from juju-1.

May 3 06:25:11 meinfoo maas.pod: [warn] meinfoo: interface with MAC address 52:54:00:3e:5e:bb was discovered on machine thruk-2 and was moved from grafana-1.

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: Incomplete → Triaged
importance: Undecided → Critical
importance: Critical → High
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
John George (jog) wrote :

Solutions QA testing with ppa:maas/stable for 16.04 still hit this bug. Are there plans to update stable with a fix?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This bug is still present in 2.3.5.

Changed in maas:
status: Fix Released → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I'm setting this back to Fix Release. This bug was fixed in 2.4 and at the time, and as per discussion, we agreed that this would only be fixed on 2.4+.

Changed in maas:
status: New → Fix Released
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I've added a task for 2.3 and we will evaluate what's the likelihood of getting this backported and the regression potential before even considering doing the actual backport.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Why is it not being fixed in 2.3?

tags: added: track
summary: - node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be
- one of the node's interfaces."]}
+ node failed commissioning or deployment - HTTP Error 400:
+ {'boot_interface': ["Must be one of the node's interfaces."]}
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.