Bug #1905983 “Secondary network interface left unconfigured afte...” : Bugs : cloud-init

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#1

cloud-init.tar.gz Edit (45.1 KiB, application/x-tar)

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-11-27:

#2

Thanks for the bug report, Michał! I'm going to break down what I'm seeing happen in the log in this comment, and then leave another comment with analysis of what I think may be happening.

On the first reboot, we see:

2020-11-27 15:10:03,596 - handlers.py[DEBUG]: start: init-local/check-cache: attempting to read from cache [check]
2020-11-27 15:10:03,596 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2020-11-27 15:10:03,597 - util.py[DEBUG]: Read 10974 bytes from /var/lib/cloud/instance/obj.pkl
2020-11-27 15:10:03,612 - util.py[DEBUG]: Reading from /var/lib/cloud/seed/nocloud/meta-data (quiet=False)
2020-11-27 15:10:03,613 - util.py[DEBUG]: Reading from /var/lib/cloud/seed/nocloud-net/meta-data (quiet=False)
2020-11-27 15:10:03,613 - stages.py[DEBUG]: cache invalid in datasource: DataSourceNoCloud [seed=/dev/sr0][dsmode=net]
2020-11-27 15:10:03,613 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: cache invalid in datasource: DataSourceNoCloud [seed=/dev/sr0][dsmode=net]

So cloud-init goes to find metadata for this boot, and detects this as a first boot:

2020-11-27 15:10:03,617 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloud'>
2020-11-27 15:10:03,617 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: New instance first boot

However, it then finds metadata which indicates that the instance ID _hasn't_ changed (it's still august-guppy):

2020-11-27 15:10:03,838 - main.py[DEBUG]: [local] init will now be targeting instance id: august-guppy. new=False

And so we revert to the expected behaviour:

2020-11-27 15:10:03,863 - __init__.py[DEBUG]: Datasource DataSourceNoCloud [seed=/dev/sr0][dsmode=net] not updated for events: System boot
2020-11-27 15:10:03,863 - stages.py[DEBUG]: No network config applied. Neither a new instance nor datasource network update on 'System boot' event

Then, on the second reboot, things go sideways. We see the same invalid cache (at 2020-11-27 15:12:17,336) and the same first boot detection (at 2020-11-27 15:12:17,339). Where things change, however, is that cloud-init does not find any NoCloud metadata:

2020-11-27 15:12:17,351 - __init__.py[DEBUG]: Datasource DataSourceNoCloud [seed=None][dsmode=net] not updated for events: New instance first boot
2020-11-27 15:12:17,352 - handlers.py[DEBUG]: finish: init-local/search-NoCloud: SUCCESS: no local data found from DataSourceNoCloud

This means that it treats this as a fresh boot; as there is no network_data available from a data source, cloud-init generates the fallback configuration and applies it to the system:

2020-11-27 15:12:17,470 - stages.py[INFO]: Applying network configuration from fallback bringup=False: {'ethernets': {'eth0': {'dhcp4': True, 'set-name': 'eth0', 'match': {'macaddress': '52:54:00:f3:9f:51'}}}, 'version': 2}

As we can see, this is the configuration you're seeing rendered into the instance.

Thanks for the bug report, Michał!  I'm going to break down what I'm seeing happen in the log in this comment, and then leave another comment with analysis of what I think may be happening.

On the first reboot, we see:

2020-11-27 15:10:03,596 - handlers.py[DEBUG]: start: init-local/check-cache: attempting to read from cache [check]
2020-11-27 15:10:03,596 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2020-11-27 15:10:03,597 - util.py[DEBUG]: Read 10974 bytes from /var/lib/cloud/instance/obj.pkl
2020-11-27 15:10:03,612 - util.py[DEBUG]: Reading from /var/lib/cloud/seed/nocloud/meta-data (quiet=False)
2020-11-27 15:10:03,613 - util.py[DEBUG]: Reading from /var/lib/cloud/seed/nocloud-net/meta-data (quiet=False)
2020-11-27 15:10:03,613 - stages.py[DEBUG]: cache invalid in datasource: DataSourceNoCloud [seed=/dev/sr0][dsmode=net]
2020-11-27 15:10:03,613 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: cache invalid in datasource: DataSourceNoCloud [seed=/dev/sr0][dsmode=net]

So cloud-init goes to find metadata for this boot, and detects this as a first boot:

2020-11-27 15:10:03,617 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloud'>
2020-11-27 15:10:03,617 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: New instance first boot

However, it then finds metadata which indicates that the instance ID _hasn't_ changed (it's still august-guppy):

2020-11-27 15:10:03,838 - main.py[DEBUG]: [local] init will now be targeting instance id: august-guppy. new=False

And so we revert to the expected behaviour:

2020-11-27 15:10:03,863 - __init__.py[DEBUG]: Datasource DataSourceNoCloud [seed=/dev/sr0][dsmode=net] not updated for events: System boot
2020-11-27 15:10:03,863 - stages.py[DEBUG]: No network config applied. Neither a new instance nor datasource network update on 'System boot' event

Then, on the second reboot, things go sideways.  We see the same invalid cache (at 2020-11-27 15:12:17,336) and the same first boot detection (at 2020-11-27 15:12:17,339).  Where things change, however, is that cloud-init does not find any NoCloud metadata:

2020-11-27 15:12:17,351 - __init__.py[DEBUG]: Datasource DataSourceNoCloud [seed=None][dsmode=net] not updated for events: New instance first boot
2020-11-27 15:12:17,352 - handlers.py[DEBUG]: finish: init-local/search-NoCloud: SUCCESS: no local data found from DataSourceNoCloud

This means that it treats this as a fresh boot; as there is no network_data available from a data source, cloud-init generates the fallback configuration and applies it to the system:

2020-11-27 15:12:17,470 - stages.py[INFO]: Applying network configuration from fallback bringup=False: {'ethernets': {'eth0': {'dhcp4': True, 'set-name': 'eth0', 'match': {'macaddress': '52:54:00:f3:9f:51'}}}, 'version': 2}

As we can see, this is the configuration you're seeing rendered into the instance.

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#3

Right, we can only say that the configuration of the instance doesn't change between reboots. It even has the cloud-config ISO attached still. Not sure why it wouldn't find the NoCloud data, but IIUC it shouldn't even be looking?

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#4

Another point is that those images are somewhat outdated… the Core 18 image itself is from February:

http://cdimage.ubuntu.com/ubuntu-core/18/stable/current/

Appliances are newer, but still August:

http://cdimage.ubuntu.com/ubuntu-core/appliances/

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-11-27:

#5

So: cloud-init needs some external source of truth about what the current instance's ID is, otherwise we can't tell whether this is a first boot or not (see https://cloudinit.readthedocs.io/en/latest/topics/boot.html#first-boot-determination for full details[0]).

In this instance's case, that external source of truth is the NoCloud configuration presented to the instance at /dev/sr0. I strongly suspect that by the third boot in this sequence, /dev/sr0 is no longer available to the instance, so it cannot determine whether this is a first boot or not; the default behaviour is to treat it as a first boot (because, generally speaking, datasources are constantly available: if we don't find, e.g., the EC2 API then it almost certainly means this instance is no longer in EC2). Does this match with what you expect multipass to be doing?

On further reflection, I think the second boot behaviour is expected: the check that's failing is intended as a quick shortcut around datasource discovery, and mounting an ISO to read config is not "quick". So I _think_ we expect the cache check to fail, and then the full discovery (which _does_ mount the ISO) to find it again and proceed as normal: that's what we see happening.

Lastly, I know that snapd does have some custom cloud-init handling, as a result of https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1879530. I don't recall exactly how it behaves, so it's probably worth looping someone from snapd (perhaps Ian?) to chime in on what it's doing, in case that has bearing on the issue at hand.

[0] Please let me know if you think that doc could be improved!

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#6

@oddbloke the ISO continunes to be available, it's just a `mount /dev/cdrom…` away:

$ sudo mount /dev/cdrom /mnt/cdrom/
mount: /mnt/cdrom: WARNING: device write-protected, mounted read-only.
$ ls /mnt/cdrom/
meta-data network-config user-data vendor-data
$ cat /mnt/cdrom/meta-data
#cloud-config
instance-id: august-guppy
local-hostname: august-guppy

$ cat /mnt/cdrom/network-config
#cloud-config
version: 2
ethernets:
  default:
    match:
      macaddress: 52:54:00:f3:9f:51
    dhcp4: true
  extra0:
    match:
      macaddress: 52:54:00:1a:4f:f9
    dhcp4: true
    dhcp4-overrides:
      route-metric: 200
    optional: true

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#7

Oh and one more data point, we don't see this behaviour _outside_ of Ubuntu Core images, so indeed the `snapd` lead may be worth following.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-11-27:

#8

> the ISO continues to be available

OK, well, that's annoying. ;) Can you run the following on the instance (this should indicate whether or not cloud-init is detecting the device as available):

python3 -c "from cloudinit.stages import _pkl_load; obj = _pkl_load('/var/lib/cloud/instance/obj.pkl'); print(obj._get_devices(obj.ds_cfg.get('fs_label', 'cidata')))"

And also `ls /dev/disk/by-label/ -lah`; we're looking for a "cidata" labelled drive. This is what it looks like on a LXD VM:

total 0
drwxr-xr-x 2 root root 100 Nov 27 15:02 .
drwxr-xr-x 7 root root 140 Nov 27 15:02 ..
lrwxrwxrwx 1 root root 11 Nov 27 15:02 UEFI -> ../../sda15
lrwxrwxrwx 1 root root 9 Nov 27 15:02 cidata -> ../../sdb
lrwxrwxrwx 1 root root 10 Nov 27 15:02 cloudimg-rootfs -> ../../sda1

> the `snapd` lead may be worth following.

Yeah, that specific bug was to do with ignoring drives which are inserted after first boot: that sounds very pertinent to what's happening here. (I would _expect_ cloud-init to just be straight up disabled if snapd were doing something though, so that may be a dead end; perhaps I'm misremembering the snapd behaviour, however.)

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#9

$ sudo python3 -c "from cloudinit.stages import _pkl_load; obj = _pkl_load('/var/lib/cloud/instance/obj.pkl'); print(obj._get_devices(obj.ds_cfg.get('fs_label', 'cidata')))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: 'DataSourceNone' object has no attribute '_get_devices'

$ ls /dev/disk/by-label/ -lah
total 0
drwxr-xr-x 2 root root 100 Nov 27 15:12 .
drwxr-xr-x 8 root root 160 Nov 27 15:12 ..
lrwxrwxrwx 1 root root 9 Nov 27 15:12 cidata -> ../../sr0
lrwxrwxrwx 1 root root 10 Nov 27 15:12 system-boot -> ../../sda2
lrwxrwxrwx 1 root root 10 Nov 27 15:12 writable -> ../../sda3

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-11-27:

#10

> AttributeError: 'DataSourceNone' object has no attribute '_get_devices'

Aha, right; should have seen that coming, as we already knew we'd failed to detect the datasource which has the `_get_devices` method.

> lrwxrwxrwx 1 root root 9 Nov 27 15:12 cidata -> ../../sr0

OK, well, that looks right!

Is there any way I can reproduce this locally? And, if I can, do you know if I would be able to modify the cloud-init source code in the image, or is it immutable?

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#11

Easiest to reproduce:

$ lxc remote add --protocol simplestreams saviq-appliances https://people.canonical.com/\~msawicz/appliance-lxd/
$ lxc init saviq-appliances:nextcloud --vm ncloud
$ lxc config edit ncloud

You want these snippets to be reflected in there:

config:
  volatile.eth0.hwaddr: 00:16:3e:70:ae:ff
  volatile.eth1.hwaddr: 00:16:3e:70:ae:ef
  user.user-data: |
    users:
    - default
    ssh_import_id:
    - lp:oddbloke
  user.network-config: |
    version: 2
      ethernets:
        eth0:
          match:
            macaddress: 00:16:3e:70:ae:ff
          dhcp4: true
        extra0:
          match:
            macaddress: 00:16:3e:70:ae:ef
          dhcp4: true
          dhcp4-overrides:
            route-metric: 200
          optional: true

devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth1:
    name: eth1
    nictype: bridged
    parent: lxdbr0
    type: nic

$ lxc start ncloud

You can then monitor what it's doing:

$ lxc console ncloud

And _should_ be able to SSH in to the IPs listed in `lxc ls`.

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#12

You probably don't want to complete console-conf network config, as that will overwrite the cloud-init one.

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-27:

#13

Hmm the ssh authorized keys don't seem to get populated, I can't ssh in… But you know the drift :)

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-11-30:

#14

OK some additional context: snapd is (by design) writing out this on first boot:

```
$ cat /etc/cloud/cloud.cfg.d/zzzz_snapd.cfg
datasource_list: [NoCloud]
datasource:
NoCloud:
fs_label: null
```

So that explains why cloud-init would not find the ISO volume when this is in place. The problem seems to be then, that cloud-init treats subsequent boots as first ones?

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-12-01:

#15

> So that explains why cloud-init would not find the ISO volume when this is in place.

Yep, that would explain it.

> The problem seems to be then, that cloud-init treats subsequent boots as first ones?

The problem is that snapd is configuring cloud-init in a way that ensures that cloud-init will detect all subsequent boots as first ones if the instance ID is only provided by a configuration ISO. snapd _is_ doing this to avoid a security issue with cloud-init on physical devices (which is: anyone can rock up with a cloud-config drive on a USB and own your device if fs_label is not null: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1879530), but cloud-init is behaving exactly as we would expect given this configuration.

I expect you can work around this by specifying the `manual_cache_clean` configuration option per https://cloudinit.readthedocs.io/en/latest/topics/boot.html#first-boot-determination so that cloud-init will _never_ detect a first boot on this instance (unless its state is blown away).

That said, I don't know if there's a more general snapd fix: is there a class of appliance like multipass which has a different threat model (i.e. if someone can attach a malicious ISO to your multipass VM, they can probably run `multipass shell`) for which this configuration _should not_ be written out?

(I'm going to move this to Incomplete, please move it back to New if the manual_cache_clean workaround... doesn't. ;)

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2020-12-02:

#16

> The problem is that snapd is configuring cloud-init in a way that ensures that cloud-init will detect all subsequent boots as first ones if the instance ID is only provided by a configuration ISO

What if snapd also recorded the same instance_id in the zzzz_snapd.cfg file as from first-boot? would cloud-init then do the right thing on reboots?

I admit I'm a bit unclear what the right thing for cloud-init to do here is, because the behavior is confusing to me. It really seems to me like cloud-init should cache or otherwise process the data it gets from the first-boot so that other things like netplan or systemd apply the configuration on subsequent boots without needing cloud-init to run. Or is the issue that when cloud-init runs on subsequent boots it _undoes_ the processing from cloud-init on the first boot?

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-12-02:

#17

> It really seems to me like cloud-init should cache or otherwise process the data it gets from the first-boot so that other things like netplan or systemd apply the configuration on subsequent boots

cloud-init does cache its data, and will only apply per-instance configuration once per instance.

> without needing cloud-init to run.

cloud-init doesn't need to run to _configure_ the system on subsequent boots (though it _can_ be configured to do things on every boot), it needs to run on every boot to determine whether this is a first boot or not. Without this, captured images from instances in clouds would not behave as fresh instances as people expect.

> Or is the issue that when cloud-init runs on subsequent boots it _undoes_ the processing from cloud-init on the first boot?

The issue is that snapd is configuring cloud-init to not be able to detect its current instance ID, so it (correctly, for clouds) treats this as a first boot. It isn't undoing processing per se, but it is performing the processing again (and, of course, as snapd has configured cloud-init to not find the multipass metadata, it applies a default configuration; this certainly would look like an undo in some respects, particularly with regards to networking).

As laid out in https://cloudinit.readthedocs.io/en/latest/topics/boot.html#first-boot-determination, `manual_cache_clean` is the configuration option to use to indicate that cloud-init should always trust the current instance ID (and therefore _never_ detect a first boot again).

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2020-12-02:

#18

@oddbloke thanks for the explanation, could you provide some input on my question about what if snapd saved the instance_id to the zzzz_snapd.cfg, would that fix the issue? i.e. if we wrote something like this instead would that fix the issue here?

```
datasource_list: [NoCloud]
datasource:
NoCloud:
fs_label: null
instance_id: <whatever was generated the first time>
```

(to be fair I'm not sure that this is the right place in the config or if it needs to be underneath some other config item, etc. but hopefully this demonstrates the point).

What I would really like to avoid is to have to somehow parse the full effective cloud-init config to save it into the zzzz_snapd.cfg file as that seems rather tedious and it doesn't seem that cloud-init has a nice way to tell us what it's total "effective config" is for all possible datasources and configuration inputs.

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2020-12-02:

#19

Or are you saying that the correct thing for snapd to write is:

```
datasource_list: [NoCloud]
datasource:
NoCloud:
fs_label: null
manual_cache_clean: false
```

?

sorry it just clicked what you were saying about manual_cache_clean after I posted my previous comment

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-12-02:

#20

@anonymouse67 IIUC you want `true`:

https://cloudinit.readthedocs.io/en/latest/topics/boot.html?highlight=manual_cache_clean#first-boot-determination

> When false (the default), cloud-init will check and clean the cache if the instance IDs do not match (this is the default, as discussed above). When true, cloud-init will trust the existing cache (and therefore not clean it).

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-12-02:

#21

I can confirm that adding `manual_cache_clean: true` to zzzz_snapd.cfg makes things work for us.

Ian Johnson (anonymouse67) on 2020-12-02

Changed in snapd:
assignee:	nobody → Ian Johnson (anonymouse67)
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-12-03:

#22

@oddbloke can you confirm please that this is the approach you'd recommend?

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-12-07:

#23

Given my understanding of the requirements here, yes, I think that is the right path forward: once a VM is launched, it will be locked to the first given instance ID (and therefore cloud-init will never detect a new boot, even when the config ISO "disappears" from its view, and will never reapply per-boot config).

Revision history for this message

Michał Sawicz (saviq) wrote on 2020-12-08:

#24

@ijohnson I wonder if Multipass should write out the `zzzz_snapd.cfg` contents itself, while the snapd situation (new release, images rebuilt etc.) settles down?

Ian Johnson (anonymouse67) on 2020-12-08

Changed in snapd:
importance:	Medium → High
status:	Triaged → In Progress

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2020-12-08:

#25

Thanks for confirming @oddblock, I will work on a PR to snapd adding this key for new installs only.

To date this has been the only regression from the work we did in June for the CVE, so we're inclined to say that nobody else has had issues with the existing implementation, but if other bugs are found or there is good reason that we need to apply this to already existing installs, we can do that, but it will take more time.

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2020-12-08:

#26

@saviq could you provide an example cloud-init configuration drive that exhibits the problem? I tried to take the one that is generated from the reproducer you provided above, but when I boot it with uc18, cloud-init complains about it:

[ 21.724669] cloud-init[827]: 2020-12-08 23:07:17,165 - util.py[WARNING]: Failed loading yaml blob. Invalid format at line 2 column 12: "mapping values are not allowed here
[ 21.740363] cloud-init[827]: in "<unicode string>", line 2, column 12:
[ 21.749154] cloud-init[827]: ethernets:
[ 21.755861] cloud-init[827]: ^"
[ 21.763337] cloud-init[827]: 2020-12-08 23:07:17,174 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed

We can propose a fix without the regression test added, but it would be good to have a regression spread test on top of the behavioral unit tests.

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2020-12-09:

#27

Initial PR is up: https://github.com/snapcore/snapd/pull/9770

Ian Johnson (anonymouse67) on 2020-12-10

Changed in snapd:
milestone:	none → 2.48.2

Revision history for this message

Ian Johnson (anonymouse67) wrote on 2021-01-06:

#28

This fix is now in candidate channel of core/snapd, and is being phased out to stable now.

We still do not have a way to regression test this, so if someone can provide us an example way to reproduce the problem (or just some kind of cloud-init config that would be broken by not having this fix) that would be awesome.

Changed in snapd:
status:	In Progress → Fix Committed
status:	Fix Committed → Fix Released

Revision history for this message

Michał Sawicz (saviq) wrote on 2021-01-09:

#29

Hey @sil2100, appliance images would need a rebuild with this fix.

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-12:

#30

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/3814

Changed in cloud-init:
status:	Incomplete → Expired

cloud-init

Secondary network interface left unconfigured after reboot of Core 18

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	cloud-init	Expired	Undecided	Unassigned
	snapd	Fix Released	High	Ian Johnson	snapd 2.48.2