_get_data() is invoked twice in init-local for Azure

Bug #1989651 reported by Chris Patterson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Triaged
Undecided
Unassigned

Bug Description

On provisioning boot _get_data() is called once in init-local.

On subsequent boots for Ubuntu 18.04 and 20.04 it is called twice when datasource update is considered for "boot-new-instance" and 'boot".

grep -e Cloud-init -e get_data -e Update /var/log/cloud-init.log
2022-09-15 04:04:44,981 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'init-local' at Thu, 15 Sep 2022 04:04:44 +0000. Up 16.54 seconds.
2022-09-15 04:04:45,155 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2022-09-15 04:04:45,156 - handlers.py[DEBUG]: start: azure-ds/_get_data: _get_data
2022-09-15 04:04:45,930 - handlers.py[DEBUG]: finish: azure-ds/_get_data: SUCCESS: _get_data
2022-09-15 04:04:48,030 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'init' at Thu, 15 Sep 2022 04:04:48 +0000. Up 19.59 seconds.
2022-09-15 04:04:57,975 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'modules:config' at Thu, 15 Sep 2022 04:04:57 +0000. Up 29.49 seconds.
2022-09-15 04:04:59,856 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'modules:final' at Thu, 15 Sep 2022 04:04:59 +0000. Up 31.38 seconds.
2022-09-15 04:05:00,036 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 finished at Thu, 15 Sep 2022 04:05:00 +0000. Datasource DataSourceAzure [seed=/dev/sr0]. Up 31.62 seconds

<reboot>

2022-09-15 04:06:11,100 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'init-local' at Thu, 15 Sep 2022 04:06:11 +0000. Up 6.98 seconds.
2022-09-15 04:06:11,128 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2022-09-15 04:06:11,128 - handlers.py[DEBUG]: start: azure-ds/_get_data: _get_data
2022-09-15 04:06:12,062 - handlers.py[DEBUG]: finish: azure-ds/_get_data: SUCCESS: _get_data
2022-09-15 04:06:12,111 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot
2022-09-15 04:06:12,111 - handlers.py[DEBUG]: start: azure-ds/_get_data: _get_data
2022-09-15 04:06:12,415 - handlers.py[DEBUG]: finish: azure-ds/_get_data: SUCCESS: _get_data
2022-09-15 04:06:14,700 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'init' at Thu, 15 Sep 2022 04:06:14 +0000. Up 10.58 seconds.
2022-09-15 04:06:16,934 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'modules:config' at Thu, 15 Sep 2022 04:06:16 +0000. Up 12.73 seconds.
2022-09-15 04:06:17,774 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 running 'modules:final' at Thu, 15 Sep 2022 04:06:17 +0000. Up 13.57 seconds.
2022-09-15 04:06:17,858 - util.py[DEBUG]: Cloud-init v. 22.2-0ubuntu1~18.04.3 finished at Thu, 15 Sep 2022 04:06:17 +0000. Datasource DataSourceAzure [seed=/var/lib/waagent]. Up 13.76 seconds

Chris Patterson (cjp256)
description: updated
summary: - _get_data() is invoked twice in init-local for Azure
+ _get_data() is invoked twice in init-local for Azure on 18.04 & 20.04
Chris Patterson (cjp256)
summary: - _get_data() is invoked twice in init-local for Azure on 18.04 & 20.04
+ _get_data() is invoked twice in init-local for Azure
Chris Patterson (cjp256)
description: updated
Revision history for this message
Chad Smith (chad.smith) wrote :

Agreed this is a problem on Bionic and Focal due to obj.pkl being removed per boot and the datasource also being set to generate network config PER_BOOT. I'll add more context on the potential issue later today.

Changed in cloud-init (Ubuntu):
status: New → Triaged
Revision history for this message
Chad Smith (chad.smith) wrote :

Given that Bionic/Focal remove the /var/lib/cloud/instance/obj.pkl cloud-init calls find_source to rediscover the Azure datasource and persist a discovered datasource cache by crawling get_data due to the event BOOT_NEW_INSTANCE in order to persist said obj.pkl.[1]

The DatasourceAzure is also configured to run get_data to update network config per EventType.BOOT as well[2] during apply_network_config calls. This is an event update which is only checks whether the "apply_network_config" semaphore has run on this boot, and if the datasource supports per EventType.BOOT configuration.[2] It doesn't currently have visibility to whether get_data ran for another event reason (BOOT_NEW_INSTANCE) which is unique to a condition introduced by Azure's removal of obj.pkl triggering the find_source call again.

When init-local ties to run apply_network config, and Azure is set to allow EventType.BOOT both conditions will be true in [2] and result in that 2nd get_data call.

To avoid this waste, we could either:
 1- make `update_metadata_if_supported` aware that both EventType.BOOT_NEW_INSTANCE, and EventType.BOOT fired by preserving the per boot events that have already been processed so we don't invoke a 2nd BOOT-related event when another has already run get_data
 2. make `update_metadata_if_supported` aware of last get_data and if it happened already during boot, avoid calling get_data a 2nd time during the same boot
 3. avoid calling update_metadata_if_supported for EventType.BOOT in apply_network_config if we have called the method due to an EventType.BOOT_NEW_INSTANCE

[1] https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/__init__.py#L940
[2] https://github.com/canonical/cloud-init/blob/main/cloudinit/stages.py#L903

Revision history for this message
Chad Smith (chad.smith) wrote :

And finally, as brought up in an Azure sync call by Chris, we could also avoid automatic removal of the cached obj.pkl file per boot via /lib/systemd/system/cloud-init-local.service.d/50-azure-clear-persistent-obj-pkl.conf.

Cloud-init 21.4 has already SRU'd functionality to Focal that has capacity to re-write network configuration per EventTypeBOOT (which is where one of the _get_data calls we see in the bug description). The risk of the removal of /lib/systemd/system/cloud-init-local.service.d/50-azure-clear-persistent-obj-pkl.conf from focal images should be limited. We have already done this in Jammy and not received any bugs related to the absence of 50-azure-clear-persistent-obj-pkl.conf.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.