MAAS is wiping out network config

Bug #1852678 reported by Jeff Lane 
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
Medium
Unassigned
The Ubuntu-power-systems project
Won't Fix
Medium
bugproxy

Bug Description

I'm on a MAAS server attempting to deploy a node with 4 network ports. I configured all four of them and started a deployment of 19.10. However, during deployment, MAAS wipes out the configuration I had set, returning three of four network devices back to Unconfigured.

I've only seen this on this one machine, oddly enough, but MAAS is just silently resetting the config and is giving no indication in the web UI why it's doing so, there are no errors.

I do not have access to this box for logs, it is the Server Team's Power MAAS environment.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Screenshot of the interface configuration I set before deployment (note this is right after I started deployment and it shows the IP addresses MAAS has assigned to each interface)

Revision history for this message
Jeff Lane  (bladernr) wrote :

Screenshot of the interface configuration well into deployment where MAAS has wiped my config and reset three of four interfaces back to Unconfigured.

Frank Heimes (fheimes)
tags: added: ppc64el
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → MAAS (maas)
Revision history for this message
Lee Trager (ltrager) wrote :

What version of MAAS are you using? Can you post the machine output from the API(maas $PROFILE machine read $SYSTEM_ID)?

Changed in maas:
status: New → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :

Not exactly what you want, but at least soem more version info taken from the UI:
MAAS name: power8-maas MAAS
MAAS version: 2.6.0 (7802-g59416a869-0ubuntu1~18.04.1)

Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi Lee,

this is the machine output before deployment with everything configured.

Revision history for this message
Jeff Lane  (bladernr) wrote :

uhhh... scratch that, wrong file, that's the original screen shot. THIS is the machine output.

Revision history for this message
Jeff Lane  (bladernr) wrote :

And this is the output during deployment after the interfaces are reset to Unconfigured.

Changed in maas:
status: Incomplete → New
Revision history for this message
Jeff Lane  (bladernr) wrote :

OK so I'm a bit confused. I reconfigured all the ports to use the same subnet, and this time it worked.

But this is just weird, because OTHER machines have deployed fine with mixed subnets, such as the one in the file attached to this comment.

Revision history for this message
Lee Trager (ltrager) wrote :

This looks like it is getting reset in the backend and is not a UI issue. Its very difficult for me to see whats going on as I can't reproduce this locally and don't have access to logs. Is there anyway you can get logs or give me access to the system so I can poke around and see whats happening?

Revision history for this message
Jeff Lane  (bladernr) wrote :

This was on the Server Team's Power8 MAAS infra, unfortunately it's not my environment so I'm not able to give you access. Perhaps Josh Powers could get you that access?

Revision history for this message
Lee Trager (ltrager) wrote :

I am unable to reproduce this. I configured thiel to auto assign every interface on the system and deployed 19.10. IP addresses were assigned as expected. I confirmed that the host was configured as expected as well.

buntu@thiel:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="19.10 (Eoan Ermine)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.10"
VERSION_ID="19.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=eoan
UBUNTU_CODENAME=eoan
ubuntu@thiel:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enP2p1s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:89:f0:64 brd ff:ff:ff:ff:ff:ff
    inet 10.245.71.140/21 brd 10.245.71.255 scope global enP2p1s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe89:f064/64 scope link
       valid_lft forever preferred_lft forever
3: enP2p1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:89:f0:65 brd ff:ff:ff:ff:ff:ff
    inet 10.245.71.139/21 brd 10.245.71.255 scope global enP2p1s0f1
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe89:f065/64 scope link
       valid_lft forever preferred_lft forever
4: enP2p1s0f2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:89:f0:66 brd ff:ff:ff:ff:ff:ff
    inet 10.245.71.177/21 brd 10.245.71.255 scope global enP2p1s0f2
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe89:f066/64 scope link
       valid_lft forever preferred_lft forever
5: enP2p1s0f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:89:f0:67 brd ff:ff:ff:ff:ff:ff
    inet 10.245.71.162/21 brd 10.245.71.255 scope global enP2p1s0f3
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe89:f067/64 scope link
       valid_lft forever preferred_lft forever

Changed in maas:
status: New → Incomplete
Revision history for this message
Jeff Lane  (bladernr) wrote :

I recreated this immediately. Please review the screenshots and configure Thiel as I have in this new screenshot.

Revision history for this message
Jeff Lane  (bladernr) wrote :

And it only took a couple minutes for it to reset the config per this screenshot.

Changed in maas:
status: Incomplete → Confirmed
Changed in ubuntu-power-systems:
importance: Undecided → Medium
Changed in maas:
importance: Undecided → Medium
Revision history for this message
Lee Trager (ltrager) wrote :

Looking at our two screenshots the only difference seems to be I was able to deploy all interfaces on 10.245.71.0/21 while you tried to deploy 3 interfaces on 192.168.122.0/24. When I configured the machine I believe 10.245.71.0/21 was the default network and only one I could choose.

Which network should these interfaces be on?
Have you tried recommissioning the machine?

Revision history for this message
Lee Trager (ltrager) wrote :

I poked around in the logs and found what appears to be happening. I think the VLAN isn't being updated until the netplan config is generated. When that happens MAAS deletes all assigned IPs. Will dig in further to see what the fix is tomorrow.

2020-01-14 07:49:33 maasserver.region_controller: [info] Reloaded DNS configuration:
         * ip 10.245.71.163 allocated
         * ip 192.168.122.4 allocated
         * ip 192.168.123.3 allocated
         * ip 192.168.122.3 allocated
2020-01-14 07:49:56 regiond: [info] 10.245.71.3 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2020-01-14 07:50:26 regiond: [info] 10.245.71.3 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2020-01-14 07:50:56 regiond: [info] 10.245.71.3 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2020-01-14 07:51:17 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2020-01-14 07:51:26 regiond: [info] 10.245.71.3 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2020-01-14 07:51:28 maasserver.models.signals.interfaces: [info] enP2p1s0f0 (physical) on thiel: deleted IP addresses due to VLAN update (5002 -> 0).
2020-01-14 07:51:28 maasserver.models.signals.interfaces: [info] enP2p1s0f2 (physical) on thiel: deleted IP addresses due to VLAN update (5002 -> 0).
2020-01-14 07:51:28 maasserver.models.signals.interfaces: [info] enP2p1s0f1 (physical) on thiel: deleted IP addresses due to VLAN update (5002 -> 0).
2020-01-14 07:51:28 maasserver.models.signals.interfaces: [info] enP2p1s0f2 (physical) on thiel: deleted IP addresses due to VLAN update (5002 -> 0).
2020-01-14 07:51:28 maasserver.models.signals.interfaces: [info] enP2p1s0f1 (physical) on thiel: deleted IP addresses due to VLAN update (5002 -> 0).
2020-01-14 07:51:28 maasserver.models.signals.interfaces: [info] enP2p1s0f1 (physical) on thiel: deleted IP addresses due to VLAN update (5002 -> 0).

Revision history for this message
Lee Trager (ltrager) wrote :

I see what is happening now. On boot all four interfaces are requesting pxelinux.cfg at once using the same IP.

2020-01-23 02:50:58 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-67 requested by 10.245.71.191
2020-01-23 02:50:58 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-67 requested by 10.245.71.191
2020-01-23 02:50:59 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-65 requested by 10.245.71.191
2020-01-23 02:50:59 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-64 requested by 10.245.71.191
2020-01-23 02:50:59 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-66 requested by 10.245.71.191
2020-01-23 02:50:59 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-64 requested by 10.245.71.191
2020-01-23 02:50:59 provisioningserver.rackdservices.tftp: [info] /ppc64el/pxelinux.cfg/01-0c-c4-7a-89-f0-66 requested by 10.245.71.191
2020-01-23 02:51:28 provisioningserver.rackdservices.http: [info] /images/ubuntu/ppc64el/ga-19.10/eoan/daily/boot-kernel requested by 10.245.71.191
2020-01-23 02:51:28 provisioningserver.rackdservices.http: [info] /images/ubuntu/ppc64el/ga-19.10/eoan/daily/boot-initrd requested by 10.245.71.191
2020-01-23 02:51:56 provisioningserver.rackdservices.http: [info] /images/ubuntu/ppc64el/ga-19.10/eoan/daily/squashfs requested by 10.245.71.191

MAAS keeps track of the interface being used for booting as well as the VLAN booting happens on. Because each device is requesting pxelinux.cfg MAAS sets the boot_interface to each device. MAAS sees that the request came in on a VLAN other then what that interface is configured for and updates it. Updating a VLAN causes all IP information to be automatically deleted.

Machines normally request boot information one device at a time which allows MAAS's algorithm to work and not in parallel.

* Why are all interfaces requesting boot information at once?
* Why are all requests coming in using the same IP on 10.245.64.0/21 from 0c:c4:7a:89:f0:67?
* Can you try updating the firmware?

Changed in maas:
status: Confirmed → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :

I checked the firmware and it looks like it's the recommended "prod" level.
Petitboot System Information:
 System type: 8001-22C
 System id: C829UAF04B10265
 Primary platform versions:
        open-power-IBM-P8DTU-V2.00.GA2.SP1-20180105-prod
        op-build-4059438
        hostboot-7fdfb37
        occ-301b535
        skiboot-5.4.2-2a21b57
        linux-4.4.24-openpower1-48c3582
        petitboot-v1.4.0-ee0f918
        p8dtu-xml-04e8a01
 BMC current side:
        Device ID: 0x20
        Device Rev: 0x1
        Firmware version: 1.27.00000
        IPMI version: 2

Since I booted manually into Petitboot anyway, I verified what's possible there to avoid PXE boot from multiple interfaces. And it looks like it can be configured / restricted.

I changed the Petitboot settings to only allow nw boot from enP2p1s0f0 and only allow DHCP on that same, single interface, too - looks now like this:

 Petitboot (v1.4.0-ee0f918) 8001-22C C829UAF04B10265
 ──────────────────────────────────────────────────────────────────────────────
  [Network: enP2p1s0f0 / 0c:c4:7a:89:f0:64]
    execute
    netboot enP2p1s0f0 (pxelinux.0)
  [Disk: sda2 / bdbebffe-6fa8-4783-a82b-3b470dd78440]
    Ubuntu, with Linux 5.4.0-12-generic (recovery mode)
    Ubuntu, with Linux 5.4.0-12-generic
    Ubuntu

  System information
  System configuration
  System status log
  Language
  Rescan devices
  Retrieve config from URL
 *Exit to shell

After booting (an already existing) test Ubuntu from disk, I was able to verify that all interfaces are (still) there (as expected, just double-checked).

I then commissioned the system again and deployed it using MAAS - it all worked fine.
And afaics it only did PXE from one interface.

So it 'seems' like it is now fixed with the Petitboot re-config.

So would you be able to re-try, Jeff?

PS: I think we didn't faced such an issue before, because we usually only have one port connected to the nw, to not waste too many switch ports. But for this machine it was recently requested to have all ports connected...

Changed in maas:
status: Incomplete → New
Revision history for this message
Frank Heimes (fheimes) wrote :

I just released 'thiel' again ...

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Looking at Lee's comment #16, if I'm reading that correctly, four interfaces with different MAC addresses appear to be using the same IP address to request pxelinux.cfg in parallel.

What seems odd to me, is that the four interfaces (with different MAC addresses) are using the same IP address. Is it possible to see from the MAAS server's DHCP log whether the four different MAC addresses have been allocated different IP addresses?

Revision history for this message
Lee Trager (ltrager) wrote :

In #15 you can see MAAS allocates four different IP addresses on the correct subnets. For whatever reason the firmware is using the same IP for all interfaces. I'm marking this as needs more information as from what I can tell this is a firmware issue.

Changed in maas:
status: New → Incomplete
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

To summarise this issue:
* when the 4 NICs are all configured to PXEBoot and are connected to the same vlan, even though they have been assigned different IP addresses, during the PXEBoot sequence all NICs appear to request pxelinux.cfg using the same IP address (see comment #16).
* when PXEBoot is disabled on all but one of the NICs, it works fine.

The question is whether this is a firmware issue with the NIC, or whether this is expected behaviour?

Changed in ubuntu-power-systems:
assignee: MAAS (maas) → bugproxy (bugproxy)
Frank Heimes (fheimes)
tags: added: reverse-proxy-bugzilla
Revision history for this message
Mike Ranweiler (mranweil) wrote :

This is a Briggs - can you also list the BMC and pnor version? Or if you just copy the firmware revision/time and PNOR info from the BMC gui that helps.

Revision history for this message
Frank Heimes (fheimes) wrote :

This is the data from the BMC:

System:

System Firmware Revision : 01.27
IP address : <IP address>
Firmware Build Time : 20170802
BMC MAC address : 0c:c4:7a:68:2d:d9
PNOR Build Time : 20180105
CPLD Version : B2.81.01

FRU Reading:

FRU Information
FRU Device ID: 
Chassis Info:
Chassis Type: Other
Chassis Part Number: 8001-22C
Chassis Serial Number: C829UAF04B10265
Board Info:
Language: English
MfgDateTime: 1996/01/01 00:00:00
Board Manufacturer: IBM
Board Product Name: P8DTU
Board Serial Num: OM166S007292
Board Part Num: P8DTU
Product Info:
Language: English
Manufacturer Name: IBM
Product Name: P8DTU-IBM-2
Product PartNum: SSP-6028U-ENR4T-05-IB001
Product Version: NONE
Product SerialNum: S234158X6625811
AssetTag: NONE
FRU File ID:
Custom Product Info:

and this the ipmi view:

# fru print 47
 Product Name : OpenPOWER Firmware
 Product Version : open-power-IBM-P8DTU-V2.00.GA2.SP1-20180105-prod
 Product Extra : op-build-4059438
 Product Extra : hostboot-7fdfb37
 Product Extra : occ-301b535
 Product Extra : skiboot-5.4.2-2a21b57
 Product Extra : linux-4.4.24-openpower1-48c3582
 Product Extra : petitboot-v1.4.0-ee0f918
 Product Extra : p8dtu-xml-04e8a01

# fru print 0
 Chassis Type : Other
 Chassis Part Number : 8001-22C
 Chassis Serial : C829UAF04B10265
 Board Mfg Date : Mon Jan 1 01:00:00 1996
 Board Mfg : IBM
 Board Product : P8DTU
 Board Serial : OM166S007292
 Board Part Number : P8DTU
 Product Manufacturer : IBM
 Product Name : P8DTU-IBM-2
 Product Part Number : SSP-6028U-ENR4T-05-IB001
 Product Version : NONE
 Product Serial : S234158X6625811
 Product Asset Tag : NONE

Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Revision history for this message
Oliver O'Halloran (oohal) wrote :

> The question is whether this is a firmware issue with the NIC, or whether this is expected behaviour?

For background, on Power servers with the OPAL firmware (such as this one) we use an embedded Linux distribution as the boot environment. All the PXE interactions are handled by petitboot which runs in userspace rather than a PXE ROM on the NIC itself.

When no specific interface is assigned as the boot device petitboot will request an IP via DHCP on every available interface and assign it as you'd expect. You're seeing all the tftp requests from the same IP because petitboot doesn't bind the tftp client to the "correct" NIC when requesting the config file for it's MAC. I'll agree that's a bit odd and probably a bug on our end, but that is the default behaviour.

Anyway, I think you'll either need to document this a limitation which requires the user to configure a specific NIC as the boot device or re-think how things are handled on the MAAS end. If we patch petitboot to bind the tftp request to the "correct NIC" it'll still depend on SuperMicro shipping a firmware update for the system to update the bootloader since the bootloader is signed and a part of the firmware secureboot chain.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Thanks Oliver. It sounds like fixing this issue in firmware will take some time, and documenting the issue may be a more immediate solution.

@MAAS team, could this be added to a Power MAAS discourse page?

Changed in maas:
status: Incomplete → Triaged
Changed in ubuntu-power-systems:
status: Incomplete → Triaged
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → Won't Fix
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Thanks to Bill for drafting and publishing the following on the MAAS team discourse page: https://maas.io/docs/tips-tricks-and-traps.

As we understand from Oliver that IBM will not be fixing this in firmware, marking as "won't fix".

Changed in maas:
status: Triaged → Won't Fix
bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-183939 severity-medium targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.