cloud-init 19.2.36 fails with python exception "Not all expected physical devices present ..." during bionic image deployment from MAAS

Bug #1846535 reported by Nikolay Vinogradov on 2019-10-03
70
This bug affects 11 people
Affects Status Importance Assigned to Milestone
cloud-init
Critical
Dan Watkins
cloud-init (Ubuntu)
Critical
Dan Watkins
Xenial
Undecided
Dan Watkins
Bionic
Undecided
Dan Watkins
Disco
Undecided
Dan Watkins
Eoan
Critical
Dan Watkins

Bug Description

[Impact]

Any instances launched with bridges or bonds in their network configuration will fail to bring up networking.

[Test Case]
# Juju bootstrap on maas of a machine sets up a network bridge that triggers a failure in cloud-init init stage.
# This results in a maas machine deployment failure and the machine gets released

Procedure:

# Alternate steps on a maas machine with a bridge already created
A1. confirm a bridge interface is configured for the target machine on interface eno1, name it broam, attach it to a subnet and select auto-assign for IP
A2. click deploy -> bionic
A3. Once manual deployment fails go to step 2 below

# Alternative 2 juju bootstrap failure on maas
B1: juju bootrap mymaas --no-gui
B2: Once bootstrap fails go to step 2

2. After deployment failure and machine is powered off click on the failed/released node in the MAAS UI
3. Click Rescue Mode from the 'Take Action' drop down in the MAAS UI
4. Grab the IP from the interfaces tab
5. ssh ubuntu@<theRescueMachineIP> -- cloud-init status --long
# Expect failure message
6. Click Exit Rescue Mode on the node in MAAS UI.

7. ssh to the maas server add the following lines to /etc/maas/preseeds/curtin_userdata to test official *-proposed packages:

system_upgrade: {enabled: True}
apt:
  sources:
    proposed.list:
       source: deb $MIRROR $RELEASE-proposed main universe # upstream -proposed

8. Repeat step 1 and expect bootstrap success
# expect to see MAASDatasource from bootstrapped machine and no errors
9. juju ssh 0 -- cloud-init status --long

Additional verification checks to avoid regression
 - DONE oracle
 - DONE ec2
 - DONE openstack
 - DONE gce
 - DONE azure
 - DONE nocloud kvm
 - DONE nocloud lxd

[Regression Potential]

The change being SRU'd adds more conditions to an existing conditional. There is potential to regress the cases that the existing conditional was introduced to cover, so we will be testing those specifically. Other than that, there was some minor refactoring of the existing conditional statement (which did not change the logic it checks), which could cause issues for Oracle netfailover interfaces. We will also specifically test on Oracle.

[Original Report]

Symptoms
========

After deployment of Ubuntu Bionic image on MAAS provider (deploying to a bare metal server) juju cannot access any deployed machine due to missing SSH keys and machines are stuck in pending state:

$ juju ssh 0
ERROR retrieving SSH host keys for "0": keys not found

$ juju machines
Machine State DNS Inst id Series AZ Message
0 pending 172.20.10.125 block-3 bionic AZ3 Deployed
1 pending 172.20.10.124 block-2 bionic AZ2 Deployed
2 pending 172.20.10.126 block-1 bionic AZ1 Deployed
3 pending 172.20.10.127 object-2 bionic AZ1 Deployed
4 pending 172.20.10.128 object-1 bionic AZ2 Deployed
5 pending 172.20.10.129 object-3 bionic AZ3 Deployed

It worth mentioning that pods can be successfully deployed with MAAS, only bare metal deployment fails.

We checked different bionic images: cloud-init 19.2.24 works, and cloud-init 19.2.36 doesn't.

Related branches

Michał Ajduk (majduk) wrote :

Issue was introduced in Cloud-init v 19.2-36-g059d049c-0ubuntu1~18.04.1.
It was not present in Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1~18.04.1

Symptoms:
2019-10-03 13:10:59,100 - __init__.py[WARNING]: Not all expected physical devices present: {'3c:fd:fe:d5:7a:42', '3c:fd:fe:d5:70:d9', '3c:fd:fe:d5:7a:41', '3c:fd:fe:d5:70:d8', '3c:fd:fe:d5:7a:40', '3c:fd:fe:d5:70:da'}

It seems that following change causes that:
   * New upstream snapshot. (LP: #1844334)
     - net: add is_master check for filtering device list

The deployments with bonds seems to be affected. I think since slaves MAC address change after a bond is created cloud-init fails to find the interfaces with original MAC address (they're present in netplan though).

The cloud-init commit that introduced the change: https://github.com/cloud-init/cloud-init/commit/b3a87fc0a2c88585cf77fa9d2756e96183c838f7

It affects branches : 19.2-36 and 19.2-25

Chad Smith (chad.smith) wrote :

Confirmed I can successfully install bionic on baremetal with 19.2-36 without seeing this issue directly in MAAS. I'll try now using Juju to deploy the bare metal machine now to see if I can reproduce the failure.

Nikolay do you have access to cloud-init collect-logs (or minimally /var/log/cloud-init.log) on the failed system?

David Coronel (davecore) wrote :

subscribed ~field-high

Issue seems to be affecting deploys with network bonds. Workaround is to use an older cloud image with an older cloud-init version.

cloud-init-output.log attached

Note mac addresses of bond slaves

Note mac addresses again. See also the error in attached cloud-init-output.log.

Jason Hobbs (jason-hobbs) wrote :

sub'd to field critical. this is blocking all of our tests.

Robie Basak (racb) on 2019-10-03
tags: added: regression-update
tags: added: cpe-onsite

EOD update from the cloud-init team:

We've identified what the problem is: in the problematic code path, we have started filtering out network devices that have a "master", which accidentally also includes physical interfaces that are members of a bridge or bond. As suggested in comment #1, it's the "net: add is_master check for filtering device list" commit that introduced the issue. Unfortunately, that commit was a fix for a critical issue (bug 1844191) from another source, so we cannot simply revert it and take our time to land a different fix.

We are currently pursuing three potential options for fixes: (a) completely circumvent the now-incorrect function in the code path that is raising the exception, (b) specifically avoid excluding bridge and bond member interfaces from the list, and (c) match more closely on the type of interface the 'master' check was intended to exclude. In order of preference, we would prefer (c) over (b) over (a). (Because (a) does not obviate the need for something along the lines of (b) or (c) for other code paths, and because we may have to play whack-a-mole with other cases that (b) needs to expand to include.)

We have initial implementations of (a) and (b) which we are testing (though they are not yet ready to land in trunk in their current form). We are investigating (c), but it's likely that we won't reach a conclusion on that investigation fast enough to warrant waiting for its completion before addressing this issue.

Bumped up the importance due to more reports in production env
hitting this problem (apparently, not fully confirmed).

Changed in cloud-init (Ubuntu):
importance: Undecided → Critical
tags: added: sts
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu):
status: New → Confirmed
Changed in cloud-init (Ubuntu):
status: Confirmed → In Progress
Changed in cloud-init:
status: New → In Progress
assignee: nobody → Dan Watkins (daniel-thewatkins)
importance: Undecided → Critical
Changed in cloud-init (Ubuntu):
assignee: nobody → Dan Watkins (daniel-thewatkins)

This bug is fixed with commit a7d8d032 to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=a7d8d032

Changed in cloud-init:
status: In Progress → Fix Committed
Changed in cloud-init (Ubuntu Bionic):
status: New → In Progress
Changed in cloud-init (Ubuntu Disco):
status: New → In Progress
Changed in cloud-init (Ubuntu Xenial):
status: New → In Progress
Changed in cloud-init (Ubuntu Bionic):
assignee: nobody → Dan Watkins (daniel-thewatkins)
Changed in cloud-init (Ubuntu Disco):
assignee: nobody → Dan Watkins (daniel-thewatkins)
description: updated
Chad Smith (chad.smith) on 2019-10-04
description: updated

Hello Nikolay, or anyone else affected,

Accepted cloud-init into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/19.2-36-g059d049c-0ubuntu2~19.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Disco):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-disco
Changed in cloud-init (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Steve Langasek (vorlon) wrote :

Hello Nikolay, or anyone else affected,

Accepted cloud-init into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/19.2-36-g059d049c-0ubuntu2~18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

David Coronel (davecore) wrote :

As an FYI, my workaround for this is to grab the squasfh file from https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20190930/ and copy it over the file /var/lib/maas/boot-resources/current/ubuntu/amd64/ga-18.04/bionic/daily/squashfs on my MAAS nodes (assuming you deploy Ubuntu 18.04 with the GA kernel).

Chad Smith (chad.smith) on 2019-10-04
description: updated
Steve Langasek (vorlon) wrote :

Hello Nikolay, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/19.2-36-g059d049c-0ubuntu2~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2

---------------
cloud-init (19.2-36-g059d049c-0ubuntu2) eoan; urgency=medium

  * cherry-pick a7d8d032: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

 -- Daniel Watkins <email address hidden> Fri, 04 Oct 2019 11:42:12 -0400

Changed in cloud-init (Ubuntu Eoan):
status: In Progress → Fix Released
Chad Smith (chad.smith) on 2019-10-04
description: updated
Jason Hobbs (jason-hobbs) wrote :

I successfully verified the bionic fix on MAAS. Here's what I did:

1. deployed a machine with a bridge via maas
2. machine went to deployed mode, couldn't ssh in
3. switched to rescue mode, ssh'd in
4. mounted /, captured cloud-init-output.log with error: http://paste.ubuntu.com/p/Gg53xf9wtZ/
5. I think enabled proposed via curtin_userdata and repeated, and ssh worked and the error was gone: http://paste.ubuntu.com/p/gXVGmKvWHY/

David Coronel (davecore) wrote :

@darmadoo: The cloud-init package is baked into the cloud images that MAAS uses to deploy instances. You have to wait for the next cloud image that will contain this fixed cloud-init, or you can use the workaround in my comment #16 to use a previous image in the meantime.

Mark Darmadi (darmadoo) wrote :

I have also successfully verified the bionic fix on MAAS.

Enabling the *-proposed repo via curtin_userdata pulled the latest cloud-init package fixed the issue.

Chad Smith (chad.smith) wrote :

Oracle SRU verification logs

Chad Smith (chad.smith) wrote :

gce sru verification logs

Chad Smith (chad.smith) wrote :

Attach file openstack-sru-19.2.36.ubuntu2.txt.

Chad Smith (chad.smith) wrote :

Attach file ec2-sru-19.2.36.ubuntu2.txt.

description: updated
Chad Smith (chad.smith) wrote :

Attach file nocloud-lxd-sru-19.2.36.ubuntu2.txt.

Chad Smith (chad.smith) wrote :

Attach file nocloud-kvm-sru-19.2.36.ubuntu2.txt.

Chad Smith (chad.smith) wrote :

Attach file azure-sru-19.2.36.ubuntu2.txt.

description: updated
tags: added: verification-done verification-done-bionic verification-done-disco verification-done-xenial
removed: verification-needed verification-needed-bionic verification-needed-disco verification-needed-xenial
Lee Trager (ltrager) wrote :

I manually tested the new cloud-init using the MAAS CI as we don't have automated tests for bonds or bridges. Xenial, Bionic, and Disco can all commissioning and deploy fine using static IPs, bonds, and bridges.

Nobuto Murata (nobuto) wrote :

It looks like images.maas.io has an older image with cloud-init 19.2-24-ge7881d5c-0ubuntu1~18.04.1 (the version before the regression was introduced) as 20191004 (the latest as of right now).

$ curl -s https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20191003/squashfs.manifest | grep -w cloud-init
cloud-init 19.2-36-g059d049c-0ubuntu1~18.04.1

$ curl -s https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20191004/squashfs.manifest | grep -w cloud-init
cloud-init 19.2-24-ge7881d5c-0ubuntu1~18.04.1

diff --git a/squashfs.manifest.20191003 b/squashfs.manifest.20191004
index 524873c..9a64019 100644
--- a/squashfs.manifest.20191003
+++ b/squashfs.manifest.20191004
@@ -25,7 +25,7 @@ byobu 5.125-0ubuntu1
 bzip2 1.0.6-8.1ubuntu0.2
 ca-certificates 20180409
 cloud-guest-utils 0.30-0ubuntu5
-cloud-init 19.2-36-g059d049c-0ubuntu1~18.04.1
+cloud-init 19.2-24-ge7881d5c-0ubuntu1~18.04.1
 cloud-initramfs-copymods 0.40ubuntu1.1
 cloud-initramfs-dyn-netconf 0.40ubuntu1.1
 command-not-found 18.04.5
@@ -336,9 +336,9 @@ ncurses-term 6.1-1ubuntu1.18.04
 net-tools 1.60+git20161116.90da8a0-1ubuntu1
 netbase 5.4
 netcat-openbsd 1.187-1ubuntu0.1
-netplan.io 0.98-0ubuntu1~18.04.1
+netplan.io 0.97-0ubuntu1~18.04.1
 networkd-dispatcher 1.7-0ubuntu3.3
-nplan 0.98-0ubuntu1~18.04.1
+nplan 0.97-0ubuntu1~18.04.1
 ntfs-3g 1:2017.3.23-2ubuntu0.18.04.2
 open-iscsi 2.0.874-5ubuntu2.7
 open-vm-tools 2:10.3.10-1~ubuntu0.18.04.1
@@ -447,7 +447,7 @@ tcpdump 4.9.2-3
 telnet 0.17-41
 time 1.7-25.1build1
 tmux 2.6-3ubuntu0.2
-tzdata 2019c-0ubuntu0.18.04
+tzdata 2019b-0ubuntu0.18.04
 ubuntu-advantage-tools 17
 ubuntu-keyring 2018.09.18.1~18.04.0
 ubuntu-minimal 1.417.3

That's correct, that was done to work around the cloud-init issue while we land the fix in the archive. (That specific change was tracked in bug 1846845, but that's in a private project so odds are people won't be able to view it.)

Changed in cloud-init (Ubuntu Xenial):
assignee: nobody → Dan Watkins (daniel-thewatkins)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2~19.04.1

---------------
cloud-init (19.2-36-g059d049c-0ubuntu2~19.04.1) disco; urgency=medium

  * cherry-pick a7d8d032: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

 -- Daniel Watkins <email address hidden> Fri, 04 Oct 2019 11:46:15 -0400

Changed in cloud-init (Ubuntu Disco):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2~18.04.1

---------------
cloud-init (19.2-36-g059d049c-0ubuntu2~18.04.1) bionic; urgency=medium

  * cherry-pick a7d8d032: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

 -- Daniel Watkins <email address hidden> Fri, 04 Oct 2019 11:35:54 -0400

Changed in cloud-init (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2~16.04.1

---------------
cloud-init (19.2-36-g059d049c-0ubuntu2~16.04.1) xenial; urgency=medium

  * cherry-pick a7d8d032: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

 -- Daniel Watkins <email address hidden> Fri, 04 Oct 2019 12:01:19 -0400

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released

Using 19.2-36-g059d049c-0ubuntu1~18.04.1 any of our new AWS instances can not get networking. A standard package update AMI build resulted in moving from 19.2.24 to 19.2.36 and now our instances come up inaccessible.

Our cloudconfig does not make any adjustments to the networking.

The specific environment where we have the issue is AWS, US-West-2, m5.Xlarge instances, on a private subnet, within a VPC.

Please let me know what information I can provide that may help to troubleshoot. I'm unable to access any instances running the new cloud init so information I can retrieve from them is limited.
-------

-------

[ 30.990869] cloud-init[729]: Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 running 'init' at Wed, 09 Oct 2019 05:09:30 +0000. Up 30.80 seconds.
[ 13.017666] cloud-init[736]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 13.019271] cloud-init[736]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.020911] cloud-init[736]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 13.022470] cloud-init[736]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.024021] cloud-init[736]: ci-info: | ens5 | False | . | . | . | 06:57:5b:c1:24:52 |
[ 13.025576] cloud-init[736]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[ 13.027182] cloud-init[736]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 13.028763] cloud-init[736]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

Alexander Weber (deshke) wrote :

the issue Richard describes also applies to GCP and Azure instances.

The netplan configuration is not re-generated if the mac address changes and is hardcoded with a `match: { mac: address}`. So if a image is created the newly spawned instances do not have network enabled on their main interface.

the only work around i've found is

```
# disabling default cloudInit network and enable dhcp based on known interfaces
echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
echo 'network:
    version: 2
    renderer: networkd
    ethernets:
        ens5:
            dhcp4: true
            dhcp6: true
            optional: true
        ens4:
            dhcp4: true
            dhcp6: true
            optional: true
' > /etc/netplan/50-cloud-init.yaml
```

Hi Richard and Alexander,
Dan Watkins (driving the recent upload) will be around any minute and take a look.

Until then if you both could try to get to a failing instance (or its disk) somehow and get the log [1] which usually is at /var/log/cloud-init-output.log that would be great.
I know logging in is hard due to the issue, but you might see in [1] also options to e.g. send logs to a rsyslog service if you have one. Another alternative is to add a late_command that pushes the log somewhere else.

[1]: https://cloudinit.readthedocs.io/en/latest/topics/logging.html

Hi Richard, Alexander,

Richard: We've published new daily images to AWS containing the newer version of cloud-init. Can you try to reproduce using the most recent daily image (which is ami-0802dbc378772aca8 in us-west-2), please? If you can still reproduce, then please file a new bug (because we'll need to triage and track the fix separately).

Alexander: That sounds like a distinct issue, as that has been the behaviour of cloud-init for quite some time now. Could you file a separate bug so we can follow up on it?

Thanks!

Dan

Thanks! I'll work to capture logs and reproduce. We worked around for now by using apt to hold the package back (we use an older base image from july that then does an apt get update all while building our image).

I was able to capture the logs by making a snapshot of the volume and mounting it to a different host, however with the previous version of cloud-init it had the same problem, sorry for the false note but it looks like this issue is not related to the upgrade.

On Wed, Oct 09, 2019 at 05:18:27PM -0000, Richard Maynard wrote:
> I was able to capture the logs by making a snapshot of the volume and
> mounting it to a different host, however with the previous version of
> cloud-init it had the same problem, sorry for the false note but it
> looks like this issue is not related to the upgrade.

That's a relief! Thanks for digging into this to get confirmation. If
there's another cloud-init issue causing the problem, please do file a
separate bug!

Joshua Powers (powersj) wrote :

This is fixed in Ubuntu and as such I am unsubscribing field-critical

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers