18.04 minimal images on GCE intermittently fail to set up networking

Bug #1766287 reported by Philip Roche on 2018-04-23
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
High
Ryan Harper
cloud-init (Ubuntu)
Undecided
Chad Smith
Nominated for Bionic by Chad Smith
Nominated for Cosmic by Chad Smith

Bug Description

When running tests on 18.04 Minimal daily images on GCE we are noticing intermittent failure to set up networking.

One test run launched 4 instances of image daily-ubuntu-minimal-1804-bionic-v20180420 in GCE project ubuntu-os-cloud-devel. Only one of the four successfully set up networking.

There appears to be only loopback devices set up.

cloud-init 18.2-14-g6d48d26 is installed on these images.

I have attached the console log from the failed launch.

I can provide access to the successfully launched instance if required.

Related branches

Philip Roche (philroche) wrote :
Philip Roche (philroche) wrote :

Also see attached GCE console of successfully launched instance for comparison

Philip Roche (philroche) wrote :

To help with debugging I will try and collect logs from the failed instance launch.

Changed in cloud-init:
assignee: nobody → Philip Roche (philroche)
Philip Roche (philroche) wrote :

I have gathered as much as I can from attaching a snapshot of the boot disk from the instance that failed to boot.

I was unable to run `cloud-init collect-logs` from inside the chroot (See filed bug https://bugs.launchpad.net/cloud-init/+bug/1766335).

The attached archive contains:

* /var/log/cloud-init*log
* cloud-init analyze show output
* cloud-init analyze dump output
* cloud-init package version
* journalctl output
* /var/lib/cloud/instance/user-data.txt

I hope this helps.

Changed in cloud-init:
assignee: Philip Roche (philroche) → nobody
Ryan Harper (raharper) wrote :

Here's what I think is happening.

In the success case, the virtio nic is renamed by the kernel to a "stable" name prior to cloud-init local enumerating the system nics and picking a fallback device.

$ journalctl -o short-precise | egrep "(Cloud-init|rename)"
Apr 23 16:19:45.517627 ubuntu kernel: virtio_net virtio1 ens4: renamed from eth0
Apr 23 16:19:47.427137 ubuntu cloud-init[163]: Cloud-init v. 18.2 running 'init-local' at Mon, 23 Apr 2018 16:19:47 +0000. Up 6.12 seconds.

On the failing case, we see that the rename happens *after* cloud-init-local has started

Apr 23 10:33:24 ubuntu kernel: [ 3.334493] virtio_net virtio1 ens4: renamed from eth0
Apr 23 10:33:24 ubuntu cloud-init[165]: Cloud-init v. 18.2 running 'init-local' at Mon, 23 Apr 2018 10:33:21 +0000. Up 3.19 seconds.

Note here cloud-init's uptime value 3.19 seconds, is *before* the rename kernel time 3.33, about 14 milliseconds before.

When this race happens, cloud-init local reads /sys/class/net for interfaces and picks eth0; as it has not yet been renamed, then generates a config for eth0, and when rendered to netplan; it contains a Name=eth0 as part of the match section, so networkd does not apply the config as the interface is actually ens4 at this time.

There is a possibility that systemd-networkd isn't doing the rename properly; that is, in the failure path, the files will look like:

% cat /run/systemd/network$ cat 10-netplan-ens4.link
[Match]
MACAddress=42:01:0a:80:00:03

[Link]
Name=eth0
WakeOnLan=off

% cat 10-netplan-ens4.network
[Match]
MACAddress=42:01:0a:80:00:03
Name=eth0

[Network]
DHCP=ipv4

[DHCP]
UseMTU=true

The .link file should have forced ens4 back to eth0; and looks like this was happening with this log message:

Apr 23 10:33:24 ubuntu systemd-networkd[359]: ens4: Interface name change detected, ens4 has been renamed to eth0.
Apr 23 10:33:24 ubuntu systemd-networkd[359]: eth0: Interface name change detected, eth0 has been renamed to ens4.

But somehow it's moved back; when then means the .network config won't appy.

Philip Roche (philroche) wrote :

As requested in IRC, please find attached the collect-logs archive from a successful boot.

Philip Roche (philroche) wrote :
Ryan Harper (raharper) on 2018-04-24
Changed in cloud-init:
importance: Undecided → High
Ryan Harper (raharper) wrote :

I'm able to recreate by launching the specified image in europe-west1 region;

The race is between udev coldplug which triggers systemd persistent naming events, and the start of cloud-init-local.service; if the nic has not yet been renamed by the time that cloud-init-local runs, then it will render a config for eth0 which won't match up with the nic which get's renamed to ens4.

On key to for cloud-init is that at local time, we are expecting that udev renaming will have already completed. Systemd provides a 'systemd-udev-settle.service' which can be invoked after the 'systemd-udev-trigger.service' aka, the Coldplug.

Currently nothing in the cloud-image (minimal, or regular) provides a Wants=systemd-udev-settle.service; which means that nothing is waiting for udev events to have completed. There are a number of reasons why not to wait for things; in some cases usb or other devices take quite a while to come up and this blocks boot. Currently in ubuntu at least LVM and zfs will ensure that systemd-udev-settle.service is wanted and run before sysinit.target is reached.

We would like to have cloud-init-local.service both Want and run After systemd-udev-settle.service has complete; this ensure that any persistent name rules will have fired (systemd-udev-trigger.service starts the events) and the settle blocks until the kernel uevent queue is empty. At this point no other entity is issuing network device renames and cloud-init-local can rely on whatever names of the devices that are present.

To verify this; I've setup a reboot loop on an instance wherew we've just added:

% git diff
diff --git a/systemd/cloud-init-local.service.tmpl b/systemd/cloud-init-local.service.tmpl
index ff9c644..2babf05 100644
--- a/systemd/cloud-init-local.service.tmpl
+++ b/systemd/cloud-init-local.service.tmpl
@@ -3,6 +3,8 @@
 Description=Initial cloud-init job (pre-networking)
 {% if variant in ["ubuntu", "unknown", "debian"] %}
 DefaultDependencies=no
+Wants=systemd-udev-settle.service
+After=systemd-udev-settle.service
 {% endif %}
 Wants=network-pre.target
 After=systemd-remount-fs.service

I will run this overnight to see how successful this approach is. It requires further discussion w.r.t whether we can generally enable this service without impacting other use-cases.

Ryan Harper (raharper) wrote :

To recreate:

- install the google-cloud-sdk and configure your gcloud command line
- cat >user-data.yaml <<EOF
#cloud-config
ssh_import_id: lp:<your lp id>
EOF

# launch instance that can recreate
- gcloud compute instances create recreate-lp1766287 --zone=europe-west1 --image daily-ubuntu-minimal-1804-bionic-v20180420 --metadata-from-file user-data=user-data.yaml --metadata=serial-port-enable=1

# get the ip of the instance into variable
- IP=`gcloud compute instances list | awk '/recreate-lp1766287/ {print $5}'`

# connect to instance
- ssh ubuntu@$IP;

# on the instance, set root password for serial console login
- sudo bash; passwd

# update cloud-init-local.service config to Want/After systemd-udev-settle.service
- sudo sed -i -e '/DefaultDependencies/i Wants=systemd-udev-settle.service\nAfter=systemd-udev-settle.service' /lib/systemd/system/cloud-init-local.service

# reset cloud-init
- sudo cloud-init clean --logs --reboot

# in separate terminal, fire up serial console
- cloud compute connect-to-serial-port recreate-lp1766287

# In the shell with the IP variable defined,
# Run this loop to watch and trigger reboots if network comes up:

- COUNT=0; while true; do echo "---"; echo "COUNT=$COUNT"; ssh -o ConnectTimeout=5s ubuntu@$IP -- "sudo cloud-init status --wait; sudo cloud-init clean --logs; sudo shutdown --reboot +1; exit" 2>/dev/null; if [ "$?" = "0" ]; then echo "Boot Success!"; COUNT=$(($COUNT + 1)); fi; echo "waiting 4s"; sleep 4; done

Philip Roche (philroche) wrote :

@raharper I hadn't realised you had been working on a reproducer.

Attached is a test script which I have written which I have used to successfully reproduce the issue.

A summary of test results is provided at the end of the test run.

Usage

`reproduce-failed-boot.sh`

OR

`reproduce-failed-boot.sh --image-serial 20180420`

Update TOTAL_LAUNCHES for fewer tests (currently set to 20).
Update DELETE_FAILED_LAUNCHES=true to delete all instances including failed launches

Philip Roche (philroche) wrote :

After applying the systemd-udev-settle.service changes to cloud-init
on my instance, I've got 484 successful reboots with no interruption
of networking.

On Wed, Apr 25, 2018 at 6:19 AM, Philip Roche <email address hidden> wrote:
> ** Attachment removed: "reproduce-failed-boot.sh"
> https://bugs.launchpad.net/cloud-init/+bug/1766287/+attachment/5126998/+files/reproduce-failed-boot.sh
>
> ** Attachment added: "reproduce-failed-boot.sh"
> https://bugs.launchpad.net/cloud-init/+bug/1766287/+attachment/5127008/+files/reproduce-failed-boot.sh
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766287
>
> Title:
> 18.04 minimal images on GCE intermittently fail to set up networking
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1766287/+subscriptions

Scott Moser (smoser) on 2018-04-26
Changed in cloud-init:
status: New → Confirmed
Changed in cloud-init (Ubuntu):
status: New → In Progress
Changed in cloud-init:
assignee: nobody → Ryan Harper (raharper)
Scott Moser (smoser) wrote :

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=4731c8da

Changed in cloud-init:
status: Confirmed → Fix Committed
Philip Roche (philroche) wrote :

I too have verified the GCE bionic images with cloud-init 18.2-27 (currently in bionic-proposed) in europe-west region. 20 of 20 launched successfully.

Nice work.

This bug is believed to be fixed in cloud-init in version18.2-27-g6ef92c98-0ubuntu1~18.04.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Changed in cloud-init (Ubuntu):
assignee: nobody → Chad Smith (chad.smith)

This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Released → Fix Committed
status: Fix Committed → Fix Released
Philip Roche (philroche) wrote :

I'd like to reopen this following Disco minimal images failing to set up networking due to similar reasons to this bug with the only difference being that no nic was found.

A workaround was found to set up cloud-init service config:

/etc/systemd/system/cloud-init-local.service.d/gcp.conf
```
[Unit]
After=systemd-udev-trigger.service

[Service]
ExecStartPre=/bin/udevadm settle
```

The goal of this workaround is to:

1) ensure that cloud-init-local.service runs after
   systemd-udev-trigger.service starts (this is what triggers
   udev coldplug events, like plugging in the nic)
2) Run udevadm settle before we start cloud-init local so that any
   nic processing is completed before cloud-init starts looking for
   a nic.

Currently this is only required on minimal images but there is a
chance it could occur in base images too should they boot quick
enough. Minimal disco does not have snap preseeding as base images do and
due to this running before cloud-init it makes it extremely unlikely to
happen in base images.

I understand that cloud-init might not be the place to fix the issue for all images but I'd like to re-open this bug to start that discussion.

I have attached cloud-init logs, netplan yaml, image manifest and sosreports from an instance that failed to set up networking.

Philip Roche (philroche) wrote :

On guidance from raharper I have opened new bug for this @ https://bugs.launchpad.net/cloud-init/+bug/1822353

tags: added: id-5d0a33dc7c02f24574ae04aa
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers