Comment 9 for bug 1768870

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1768870] Re: node failed commissioning - HTTP Error 400: {'boot_interface': ["Must be one of the node's interfaces."]}

But we don't do that with VMs - they are all created via pods, and the
failures in this bug are all with VMs, so it doesn't seem related.

On Thu, May 3, 2018 at 12:00 PM, Jason Hobbs <email address hidden> wrote:
> Yes, we add a node and then delete it right afterwards. We've been
> doing that for a long time. This failure just started showing up with
> 2.3.3.
>
> As a reminder, here's how we add nodes in FCE:
>
> We don't have MAC addresses for the nodes, and we don't have direct IPMI
> access to the nodes either. To make this work, we have MAAS do all the
> heavy lifting.
>
> The steps are as follows:
>
> 1) Check if nodes have already been enlisted
>
> 2) We add the node to MAAS with correct IPMI credentials and a fake MAC
> address (MAAS requires a MAC address).
>
> 3) MAAS, prior to returning from the API call to add the machine, issues
> the IPMI commands required to PXE boot the machine. It handles this
> regardless of the machine's current state.
>
> 4) Immediately upon return from the add machine API call, we issue
> another API call to delete the machine from MAAS. MAAS does not issue
> any power commands in response to this, so the machine continues to
> PXE boot, and will show up in MAAS as a 'New' node once enlistment
> completes.
>
> 5) We poll MAAS for nodes in 'New' state, looking for a machine to match
> our IPMI power address. When we find it, we set the proper hostname and
> zone on it, and start commissioning.
>
> 6) We poll to ensure commissioning completes successfully.
>
> If bug 1707216 were fixed, we could just add the node and MAAS
> would handle the rest.
>
> On Thu, May 3, 2018 at 11:42 AM, Andres Rodriguez
> <email address hidden> wrote:
>> On the other hand, around the same time I also see this:
>>
>>
>> May 3 06:28:38 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:38, 10.244.40.170
>> May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
>> May 3 06:28:40 leafeon maas.node: [info] juju-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
>> May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:42:28:70, 10.244.40.171
>> May 3 06:28:41 leafeon maas.interface: [info] eno1 (physical) on leafeon: New MAC, IP binding observed: 14:02:ec:41:d7:44, 10.244.40.172
>> May 3 06:28:42 leafeon maas.node: [info] landscape-3: Status transition from TESTING to READY
>> May 3 06:28:52 leafeon maas.node: [info] kibana-3: Storage layout was set to flat.
>> May 3 06:28:52 leafeon maas.node: [info] kibana-3: Status transition from COMMISSIONING to TESTING
>> May 3 06:28:55 leafeon maas.node: [error] grafana-1: Marking node failed: Commissioning failed, cloud-init reported a failure (refer to the event log for more information)
>> May 3 06:28:55 leafeon maas.node: [info] grafana-1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
>>
>>
>> With this specific message:
>>
>> May 3 06:28:40 leafeon maas.node: [error] juju-1: Marking node failed:
>> Commissioning failed, cloud-init reported a failure (refer to the event
>> log for more information)
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1768870
>>
>> Title:
>> node failed commissioning - HTTP Error 400: {'boot_interface': ["Must
>> be one of the node's interfaces."]}
>>
>> Status in MAAS:
>> New
>>
>> Bug description:
>> Several pod VMs failed to commission on a deploy of FCB.
>>
>> In rsyslog output, I see errors like this:
>>
>> May 3 06:29:00 nagios-1 cloud-init[1057]: request to
>> http://10.244.40.33/MAAS/metadata//2012-03-01/ failed. sleeping 1.:
>> HTTP Error 400: BAD REQUEST
>>
>> In regiond.log, there is this traceback that appears to be associated
>> with the error:
>>
>> http://paste.ubuntu.com/p/3CBVCfRVzG/
>>
>> Some other VMs in this deploy successfully commissioned.
>>
>> This is with 2.3.3-6492-ge999a54-0ubuntu1~16.04.1.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1768870/+subscriptions