MAAS cannot deploy/boot if OVS bridge is configured on a single PXE NIC

Bug #1898997 reported by Lukas Märdian on 2020-10-08
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Undecided
Unassigned
netplan
Undecided
Lukas Märdian
netplan.io (Ubuntu)
Undecided
Lukas Märdian
Focal
Undecided
Lukas Märdian
Groovy
Undecided
Lukas Märdian

Bug Description

Problem description:
If we try to deploy a single-NIC machine via MAAS, configuring an Open vSwitch bridge as the primary/PXE interface, the machine will install and boot Ubuntu 20.04 but it cannot finish the whole configuration (e.g. copying of SSH keys) and cannot be accessed/controlled via MAAS. It ends up in a "Failed" state.

This is because systemd-network-wait-online.service fails (for some reason), before netplan can fully setup and configure the OVS bridge. Because of broken networking cloud-init cannot complete its final stages, like setup of SSH keys or signaling its state back to MAAS. If we wait a little longer the OVS bridge will actually come online and networking is working – SSH not being setup and MAAS state still "Failed", though.

Steps to reproduce:
* Setup a (virtual) MAAS system, e.g. inside a LXD container using a KVM host, as described here:
https://discourse.maas.io/t/setting-up-a-flexible-virtual-maas-test-environment/142
* Install & setup maas[-cli] snap from 2.9/beta channel (instead of the deb/PPA from the discourse post)
* Configure netplan PPA+key for testing via "Settings" -> "Package repos":
https://launchpad.net/~slyon/+archive/ubuntu/ovs
* Prepare curtin preseed in /var/snap/maas/current/preseeds/curtin_userdata, inside the LXD container (so you can access the broken machine afterwards):
======================
#cloud-config
debconf_selections:
 maas: |
  {{for line in str(curtin_preseed).splitlines()}}
  {{line}}
  {{endfor}}
late_commands:
  maas: [wget, '--no-proxy', '{{node_disable_pxe_url}}', '--post-data', '{{node_disable_pxe_data}}', '-O', '/dev/null']
  90_create_user: ["curtin", "in-target", "--", "sh", "-c", "sudo useradd test -g 0 -G sudo"]
  92_set_user_password: ["curtin", "in-target", "--", "sh", "-c", "echo 'test:test' | sudo chpasswd"]
  94_cat: ["curtin", "in-target", "--", "sh", "-c", "cat /etc/passwd"]
  98_cloud_init: ["curtin", "in-target", "--", "apt-get", "-y", "install", "cloud-init"]
======================
* Compose a new virtual machine via MAAS' "KVM" menu, named e.g. "test1"
* Watch it being commissioned via MAAS' "Machines" menu
* Once it's ready select your machine (e.g. "test1.maas") -> Network
* Select the single network interface (e.g. "ens4") -> Create bridge
* Choose "Bridge type: Open vSwitch (ovs)", Select "Subnet" and "IP mode", save.
* Deploy machine to Ubuntu 20.04 via "Take action" button

The machine will install the OS and boot, but will end up in a "Failed" state inside MAAS due to network/OVS not being setup correctly. MAAS/SSH has no control over it. You can access the (broken) machine via serial console from the KVM-host (i.e. LXD container) via "virsh console test1" using the "test:test" credentials.

=== SRU/Focal/netplan.io ===
[Impact]
This update contains bug-fixes and packaging improvements and we would like to make sure all of our supported customers have access to these improvements.

The notable ones are:

   * Setup OVS early in network-pre.target to avoid delays (LP: #1898997)

See the changelog entry below for a full list of changes and bugs.

[Test Case]
The following development and SRU process was followed:
https://wiki.ubuntu.com/NetplanUpdates

Netplan contains an extensive integration test suite that is ran using
the SRU package for each releases. This test suite's results are available here:
http://autopkgtest.ubuntu.com/packages/n/netplan.io

A successful run is required before the proposed netplan package
can be let into -updates.

The netplan team will be in charge of attaching the artifacts and console
output of the appropriate run to the bug. Netplan team members will not
mark ‘verification-done’ until this has happened.

[Regression Potential]
In order to mitigate the regression potential, the results of the
aforementioned integration tests are attached to this bug.

Focal:
https://git.launchpad.net/~slyon/+git/files/tree/LP1898997/focal_amd64.log
https://git.launchpad.net/~slyon/+git/files/tree/LP1898997/focal_arm64.log
https://git.launchpad.net/~slyon/+git/files/tree/LP1898997/focal_armhf.log
https://git.launchpad.net/~slyon/+git/files/tree/LP1898997/focal_ppc64el.log
https://git.launchpad.net/~slyon/+git/files/tree/LP1898997/focal_s390x.log

[Discussion]
To fully fix the MAAS/OVS problem, cloud-init needs to be updated as well. The fixes to netplan.io and cloud-init can be applied independently, though.

[Changelog]
- Setup OVS early in network-pre.target to avoid delays (LP: #1898997)
- Suggest openvswitch-switch runtime dependency
- Improve stability of autopkgtests

Related branches

Lukas Märdian (slyon) on 2020-10-08
description: updated
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :

I found a way to execute the netplan-OVS systemd units earlier in the boot process, which fixes the broken networking after boon on single NIC systems.

Packages from this PPA should now enable booting of Focal images with a single OVS bridge PXE interface:
https://launchpad.net/~slyon/+archive/ubuntu/ovs

But, even though the deploy succeeds, there still seem to be some related issues with SSH not being setup correctly after the deploy.

Upstream work is being tracked here:
https://github.com/CanonicalLtd/netplan/pull/165

Changed in netplan:
status: New → In Progress
assignee: nobody → Lukas Märdian (slyon)
Lukas Märdian (slyon) wrote :

The SSH failure seems to be related to cloud-init not detecting the OVS bridge + slaves correctly. Therefore, the cloud-init 'init' stage fails with an exception:
"RuntimeError: Not all expected physical devices present: {'52:54:00:d9:08:1c'}"

I'm working on a pull request here:
https://github.com/canonical/cloud-init/pull/608

In combination with the netplan PR, this should solve the issue described here.

Changed in cloud-init:
status: New → In Progress
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon) wrote :

This should now be working using the test packages from this PPA:
https://launchpad.net/~slyon/+archive/ubuntu/ovs

Lukas Märdian (slyon) on 2020-10-14
description: updated
tags: added: fr-721
Lukas Märdian (slyon) on 2020-10-15
Changed in netplan:
status: In Progress → Fix Committed
Lukas Märdian (slyon) wrote :

The netplan part is fixed in Groovy as of: netplan.io 0.100-0ubuntu5

Changed in netplan:
status: Fix Committed → Fix Released
Lukas Märdian (slyon) on 2020-10-15
description: updated
Lukas Märdian (slyon) on 2020-10-19
Changed in netplan.io (Ubuntu Groovy):
status: New → Fix Released
Lukas Märdian (slyon) wrote :

The netplan.io SRU for Focal is ready and looking for a sponsor:
https://code.launchpad.net/~slyon/netplan/+git/ubuntu/+merge/392290

Lukas Märdian (slyon) on 2020-10-22
Changed in cloud-init:
status: In Progress → Confirmed

Hello Lukas, or anyone else affected,

Accepted netplan.io into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/netplan.io/0.100-0ubuntu4~20.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in netplan.io (Ubuntu Focal):
status: New → Fix Committed
tags: added: verification-needed verification-needed-focal
Lukas Märdian (slyon) wrote :

Hello Brian,

I have been testing netplan.io 0.100-0ubuntu4~20.04.3 on Focal according to https://wiki.ubuntu.com/NetplanUpdates, i.e. having all the integration tests run on autopkgtest.u.c and appending the logs to the bug report (see above).

All tests run successfully, except for the known FLAKY wifi tests, which are marked as such.

description: updated
tags: added: verification-done-focal
removed: verification-needed-focal
Changed in netplan.io (Ubuntu):
assignee: nobody → Lukas Märdian (slyon)
Changed in netplan.io (Ubuntu Groovy):
assignee: nobody → Lukas Märdian (slyon)
Changed in netplan.io (Ubuntu Focal):
assignee: nobody → Lukas Märdian (slyon)
Lukas Märdian (slyon) wrote :

Merged at upstream cloud-init: https://github.com/canonical/cloud-init/commit/3c432b32de1bdce2699525201396a8bbc6a41f3e

Will make its way into the packages eventually.

Changed in cloud-init:
status: Confirmed → Fix Committed

The verification of the Stable Release Update for netplan.io has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.100-0ubuntu4~20.04.3

---------------
netplan.io (0.100-0ubuntu4~20.04.3) focal; urgency=medium

  * debian/control:netplan.io: Suggest openvswitch-switch runtime dependency
    - Do not suggest on riscv64, where OVS isn't available in Focal
  * Add d/p/0003-tests-tunnels-improve-WG-handshake-regex.patch
    and d/p/0004-tests-ovs-fix-OVS-timeouts.patch
    - Improve stability of autopkgtests
  * Add d/p/0005-Fix-MAAS-OVS-first-boot-for-single-NIC-PXE-systems-1.patch
    - Setup OVS early in network-pre.target to avoid delays (LP: #1898997)

 -- Lukas Märdian <email address hidden> Mon, 19 Oct 2020 14:49:52 +0200

Changed in netplan.io (Ubuntu Focal):
status: Fix Committed → Fix Released

This bug is believed to be fixed in cloud-init in version 20.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Lukas Märdian (slyon) wrote :

I can confirm this is now working with hirsute images on MAAS 2.9.0~rc3, not using any additional PPAs.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers