cloud-init overriding set-name in netplan file

Bug #2006106 reported by Patrik Lundin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Expired
Medium
Unassigned

Bug Description

After creating an Ubuntu 22.04 instance in OpenStack the following netplan file is generated:
```
# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        ens3:
            accept-ra: true
            dhcp4: true
            dhcp6: true
            match:
                macaddress: fa:16:3e:c7:f9:7e
            mtu: 1500
            set-name: ens3
```

With the matching links:
```
# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
ens3 UP fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST,UP,LOWER_UP>
```

I was then trying to rename the interface from "ens3" to "eth0", updating the file like so:
```
# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        eth0:
            accept-ra: true
            dhcp4: true
            dhcp6: true
            match:
                macaddress: fa:16:3e:c7:f9:7e
            mtu: 1500
            set-name: eth0
```

Applying the config works, the interface is renamed without dropping my SSH connection:
```
# netplan apply

# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0 UP fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST,UP,LOWER_UP>
```

So far so good, but now I reboot the machine, and it will not come back online:
```
# reboot
Connection to XXX.XXX.XXX.XXX closed by remote host.
Connection to XXX.XXX.XXX.XXX closed.
```

Logging in via a locally connected console I can see the following:
```
# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
ens3 DOWN fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST>
```

So for some reason the interface comes up as "ens3" again, also it has no address configuration assigned which is the reason I can not reach it. If I then run a manual "netplan apply" I can get it online again:

```
# netplan apply
# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0 UP fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST,UP,LOWER_UP>
```

Now logged in over SSH again checking the dmesg log for renames the following can be seen:
```
# dmesg | grep rename
[ 2.142770] virtio_net virtio0 ens3: renamed from eth0
[ 6.089816] virtio_net virtio0 eth0: renamed from ens3
[ 7.253661] virtio_net virtio0 ens3: renamed from eth0
[ 278.607558] virtio_net virtio0 eth0: renamed from ens3
```

So the network name has been flapping back and forth between "ens3" and "eth0".

After digging around I think this is what happens:
```
[ 2.142770] virtio_net virtio0 ens3: renamed from eth0 <- systemd-networkd, as part of initramfs
[ 6.089816] virtio_net virtio0 eth0: renamed from ens3 <- systemd-networkd, as part of booted OS, using the files generated by my initial "netplan apply".
[ 7.253661] virtio_net virtio0 ens3: renamed from eth0 <- cloud-init, for some reason
[ 278.607558] virtio_net virtio0 eth0: renamed from ens3 <- my manual "netplan apply" after logging in to the console
```

Looking at /var/log/cloud-init.log the following message is seen:
```
2023-02-06 07:57:27,270 - __init__.py[DEBUG]: Detected interfaces {'eth0': {'downable': True, 'device_id': '0x0001', 'driver': 'virtio_net', 'mac': 'fa:16:3e:c7:f9:7e', 'name': 'eth0', 'up': False}, 'lo': {'downable': False, 'device_id': None, 'driver': None, 'mac': '00:00:00:00:00:00', 'name': 'lo', 'up': True}}
2023-02-06 07:57:27,270 - __init__.py[DEBUG]: achieving renaming of [['fa:16:3e:c7:f9:7e', 'ens3', None, None]] with ops [('rename', 'fa:16:3e:c7:f9:7e', 'ens3', ('eth0', 'ens3'))]
2023-02-06 07:57:27,270 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth0', 'name', 'ens3'] with allowed return codes [0] (shell=False, capture=True)
```

I had a hard time understanding how cloud-init knew about the previous "ens3" name initially, but now I think this has been persisted in the obj.pkl at initial install time boot and is now picked up on subsequent boots, from that same log:
```
2023-02-06 07:57:27,211 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
```

Taking a look in the file:
```
# cat p.py
#!/usr/bin/env python3

import pickle

# open a file, where you stored the pickled data
with open('/var/lib/cloud/instance/obj.pkl', 'rb') as file:
    data = pickle.load(file)

print(data.network_config)
```

```
# ./p.py
{'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:c7:f9:7e', 'name': 'ens3'}, {'type': 'nameserver', 'address': 'XXX.XXX.XXX.XXX'}, {'type': 'nameserver', 'address': 'YYYY:YYYY:YYYY::YYYY:YYYY:YYYY'}]}
```

From what I can tell this "name" is picked up in the openstack helper at https://github.com/canonical/cloud-init/blob/483f79cb3b94c8c7d176e748892a040c71132cb3/cloudinit/sources/helpers/openstack.py#L715

So... the question then is, how should this work? Right now it seems cloud-init is helping me with a rename even if I have asked the netplan file to set another name than the machine had at initial install.

One thing that occured to me is that maybe I am expected to feed cloud-init user-data so it can know initially that I want the interface called "eth0", but reading https://cloudinit.readthedocs.io/en/22.4.2/topics/network-config.html it states "User-data cannot change an instance’s network configuration." so it seems this is not expected behaviour.

For now I guess the simplest workaround is to just disable the network management parts as mentioned in the generated netplan file, this works:
```
# echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
# reboot
```

Now the machine comes up by itself, and there are less renames happening:
```
# dmesg | grep rename
[ 2.165152] virtio_net virtio0 ens3: renamed from eth0
[ 6.108291] virtio_net virtio0 eth0: renamed from ens3
```

It feels strange to have to disable the network management parts... What would be the correct way to deal with this situation?

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks for filing this bug and helping make cloud-init better. Let's see if we can get to the root of the problem.

This may involve us requesting your attached logs from running `cloud-init collect-logs` and attaching the corresponding tar file.

Please check that tarfile instance-data-sensitive.json before attaching because it could contain sensitive information if you provided passwords or user-credentials in user-data on the affected VM.

Minimally I think we need to see the output of journalctl -b 0 -o short-precise and the full cloud-init.log. (which are both grabbed by cloud-init collect-logs anyway).

Generally, I don't think the OpenStack datasource default behavior should be for cloud-init to be actively rewriting or re-applying network config across reboot. It generally should be inert unless the datasource IMDS (instance metadata) either changes the instance-id in meta-data to a new UUID (telling cloud-init it needs to reconfigure the world) or if OpenStack was configured to re-render network per-boot.

So, we might have a bug that LinuxNetworking.apply_network_config_names is running more often than it should across normal system reboots even when the DataSourceOpenStack hasn't told cloud-init to re-render and re-apply new networking config due to BOOT_NEW_INSTANCE event.

I would have expected cloud-init to exit and do nothing with network renames across normal reboots due to these checks
https://github.com/canonical/cloud-init/blob/main/cloudinit/stages.py#L905-L916

I think it will help to see full cloud-init.log here to surmise what really has happened with all the PER_BOOT, PER_INSTANCE_REBOOT, datasource cache validation, instance-id and event checks. So we can better determine why cloud-init thinks it should be touching anything w/ network renames across subsequent boots.

I'll set this to 'incomplete' status above, but please set it back to 'new' status when you get a chance to attach logs.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Patrik Lundin (eest) wrote (last edit ):

Attached is the result of running "cloud-init collect-logs" at the point where the netplan file has been modified to state "eth0", "netplan apply" has been run (replacing "ens3" with "eth0" at runtime), the machine has been rebooted and then ends up with an unconfigured "ens3" interface instead of the expected "eth0".

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Chad Smith (chad.smith) wrote :
Download full text (5.0 KiB)

Thank you much for the logs Patrik.

I can see logs indicating what you suggested, renames applying every reboot regardless of whether the datasource and network has been actively detected and applied: We shouldn't see the "applying net names" logs when "No network config applied. Neither a new instance nor datasource network update allowed". I agree that this bug is undesireable behavior, cloud-init should remain inert on renames because sysadmins could have gone in and changed the static /etc/netplan/*yaml to represent something other than cloud-init's original config.

That said, editing /etc/netplan/50-cloud-init.yaml is also a recipe for problems in the future if the instance-id presented by OpenStack to this node changes the product_uuid for this vm via /sys/class/dmi/id/product_uuid. When that happens, cloud-init will recrawl the OpenStack IMDS endpoints @ 169.254.169.254 and rewrite all network and system configuration, blowing away changes to the 50-cloud-init.yaml file.

# Confirmation of logs applying net names on 2nd reboot in 'init' and 'init-local'stage leading to name thrashing.

$ egrep -i 'applying net|Cloud-init v.|netplan' YOUR_LOGS
2023-02-20 16:19:42,270 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Mon, 20 Feb 2023 16:19:42 +0000. Up 6.51 seconds.
2023-02-20 16:19:43,780 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]}
2023-02-20 16:19:43,786 - stages.py[INFO]: Applying network configuration from ds bringup=False: {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]}
2023-02-20 16:19:43,788 - __init__.py[DEBUG]: Selected renderer 'netplan' from priority list: ['netplan', 'eni', 'sysconfig']
2023-02-20 16:19:43,791 - subp.py[DEBUG]: Running command ['netplan', 'info'] with allowed return codes [0] (shell=False, capture=True)
2023-02-20 16:19:43,974 - util.py[DEBUG]: Writing to /etc/netplan/50-cloud-init.yaml - wb: [644] 555 bytes
2023-02-20 16:19:43,975 - subp.py[DEBUG]: Running command ['netplan', 'generate'] with allowed return codes [0] (shell=False, capture=True)
2023-02-20 16:19:46,359 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Mon, 20 Feb 2023 16:19:46 +0000. Up 10.60 seconds.
2023-02-20 16:19:46,540 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]}
2023-02-20 16:19:51,905 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running ...

Read more...

Changed in cloud-init:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Patrik Lundin (eest) wrote :

Hello,

Thanks for the follow-up. If editing "50-cloud-init.yaml" is not a good idea, what is the appropriate way to deal with a rename then? Even if we remove "50-cloud-init.yaml" and create another file, if the "instance-id" changes (is this likely?) will this not just result in "50-cloud-init.yaml" being recreated, now leading to having two conflicting netplan files instead?

If the correct thing is "feed cloud-init this information from the start" then I am not sure how to properly do that: as I stated initially it is documented that you are not allowed to configure network stuff via user-data (https://cloudinit.readthedocs.io/en/22.4.2/topics/network-config.html)

Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Triaged → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.