failure to parse valid JSON commissioning data

Bug #2056336 reported by David Torrey
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Unassigned
3.2
Triaged
High
Unassigned
3.3
Triaged
High
Unassigned
3.4
Triaged
High
Unassigned

Bug Description

MAAS 3.2.10, upgraded stepwise from 2.9.3

Customer reports that the rackd machines go through a recommissioning step after the upgrade. During this recommissioning, something is not detected correctly from the JSON gathered by 20-maas-03-machine-resources. During parsing on the MAAS side, we see a python backtrace ending in:

File "/usr/lib/python3/dist-packages/metadataserver/builtin_scripts/network.py", line 368, in update_vlan_interface
if parent_nic.vlan.fabric_id != vlan.fabric_id:
builtins.AttributeError: 'NoneType' object has no attribute 'fabric_id'

This rackd server's topology had not changed and was working in the prior MAAS version, so the cause of the error is unclear. The workaround was to completely remove the controller and redeploy it with the same fabric/vlan/subnet assignments.

Ideally, MAAS should never fail on commissioning unless there is a true hardware error or environmental condition that prevents the ability to further deploy the machine. In cases where a detected network topology does not match the database, the result should be the creation of extra fabrics/VLANs/subnets rather than an error and backtrace.

This could relate to the VLAN-subnet mapping changes in LP 2031482, which this customer is also experiencing.

Further troubleshooting information is available in the support case, which I'm happy to pass along out of band.

Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

Just adding some more information to this issue. The problem occurs when running this code in /usr/lib/python3/dist-packages/metadataserver/builtin_scripts/network.py:

def update_vlan_interface(node, name, network, links):
    """Update a VLAN interface.

    :param name: Name of the interface.
    :param network: Network settings from commissioning data.
    """
    vid = network["vlan"]["vid"]
    parent_name = network["vlan"]["lower_device"]
    parent_nic = Interface.objects.get(
        node_config=node.current_config, name=parent_name
    )
    links_vlan = get_interface_vlan_from_links(node, links)
    if links_vlan:
        vlan = links_vlan
        if parent_nic.vlan.fabric_id != vlan.fabric_id: <---- fails here

That if condition just selects if an error message is shown or not and should not make the commissioning fail. The customer solved the issue by patching this code and removing that if line.

Looking at the json file there are four interfaces with vlans, two with lower_device eno1 and two with lower device bond0. Both of those parent devices have null vlan. That is my hypothesis as to why parent_nic.vlan.fabric_id can be NoneType.

Revision history for this message
Bill Wear (billwear) wrote :

your assessment appears to be logically sound, and the if line appears to be in error. triaging on trust in your excellent work.

Changed in maas:
status: New → Triaged
Bill Wear (billwear)
Changed in maas:
importance: Undecided → High
milestone: none → 3.2.x
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Is this issue reproducible after the fix for LP2031482 lands? Would it be possible to test it?

Changed in maas:
milestone: 3.2.x → 3.5.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.