Activity log for bug #1998655

Date Who What changed Old value New value Message
2022-12-02 22:41:54 Mark Rogers bug added bug
2022-12-03 16:28:20 Mark Rogers description We're running RHEL 8.7 VMs on Apache CloudStack. The cloud-init service waits for Network Manager to start, but it runs before Network Manager is done. Because the network isn't ready yet, the CloudStack data source is unable to determine the virtual router address. This causes cloud-init to fall back to using DataSourceNone. That leaves the VM without the proper configuration. I see one bug, and one potential bug here. The bug is that cloud-init is starting too early in the boot process. The potential bug is that the CloudStack data source doesn't have any retry logic when it looks for the virtual router. To fix the timing but, I recommend adding `After=NetworkManager-wait-online.service` to the cloud-init.service file. That will make it wait until after the network interface is ready. Note that it's already waiting for systemd-networkd-wait-online.service, but that service doesn't exist in RHEL 8.7, so systemd ignores that line. You can see the timing issue in this excerpt from /var/log/messages: Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.5790] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6212] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6213] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6215] manager: NetworkManager state is now CONNECTED_SITE Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6217] device (ens3): Activation: successful, device activated. Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6220] manager: NetworkManager state is now CONNECTED_GLOBAL Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6222] manager: startup complete Dec 2 18:27:04 mark-test-fs systemd[1]: Started Network Manager Wait Online. Dec 2 18:27:04 mark-test-fs systemd[1]: Starting Initial cloud-init job (metadata service crawler)... Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,027 - util.py[WARNING]: Failed to fetch password from virtual router 10.11.8.0 Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,043 - stages.py[INFO]: Loaded datasource DataSourceCloudStack - DataSourceCloudStack Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,093 - stages.py[INFO]: Applying network configuration from fallback bringup=True: {'ethernets': {'ens3': {'dhcp4': True, 'set-name': 'ens3', 'match': {'macaddress': '1e:00:29:00:02:b4'}}}, 'version': 2} Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1480] agent-manager: agent[f74dfcab0129a270,:1.13/nmcli-connect/0]: agent registered Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1487] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1489] manager: NetworkManager state is now DISCONNECTING Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1494] device (ens3): disconnecting for new activation request. Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1494] audit: op="connection-activate" uuid="21d47e65-8523-1a06-af22-6f121086f085" name="System ens3" pid=975 uid=0 result="success" Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2131] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2247] dhcp4 (ens3): canceled DHCP transaction Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2247] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2247] dhcp4 (ens3): state changed no lease Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2770] manager: NetworkManager state is now DISCONNECTED Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2771] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085) Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2777] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2778] manager: NetworkManager state is now CONNECTING Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2779] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2790] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2791] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2855] dhcp4 (ens3): state changed new lease, address=10.11.8.179 Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2858] policy: set 'System ens3' (ens3) as default for IPv4 routing and DNS We're running RHEL 8.7 VMs on Apache CloudStack. The cloud-init service waits for Network Manager to start, but it runs before Network Manager is done. Because the network isn't ready yet, the CloudStack data source is unable to determine the virtual router address. This causes cloud-init to fall back to using DataSourceNone. That leaves the VM without the proper configuration. I see one bug, and one potential bug here. The bug is that cloud-init is starting too early in the boot process. The potential bug is that the CloudStack data source doesn't have any retry logic when it looks for the virtual router. To fix the timing bug, I recommend adding `After=NetworkManager-wait-online.service` to the cloud-init.service file. That will make it wait until after the network interface is ready. Note that it's already waiting for systemd-networkd-wait-online.service, but that service doesn't exist in RHEL 8.7, so systemd ignores that line. You can see the timing issue in this excerpt from /var/log/messages: Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.5790] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6212] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6213] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6215] manager: NetworkManager state is now CONNECTED_SITE Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6217] device (ens3): Activation: successful, device activated. Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6220] manager: NetworkManager state is now CONNECTED_GLOBAL Dec 2 18:27:04 mark-test-fs NetworkManager[723]: <info> [1670005624.6222] manager: startup complete Dec 2 18:27:04 mark-test-fs systemd[1]: Started Network Manager Wait Online. Dec 2 18:27:04 mark-test-fs systemd[1]: Starting Initial cloud-init job (metadata service crawler)... Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,027 - util.py[WARNING]: Failed to fetch password from virtual router 10.11.8.0 Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,043 - stages.py[INFO]: Loaded datasource DataSourceCloudStack - DataSourceCloudStack Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,093 - stages.py[INFO]: Applying network configuration from fallback bringup=True: {'ethernets': {'ens3': {'dhcp4': True, 'set-name': 'ens3', 'match': {'macaddress': '1e:00:29:00:02:b4'}}}, 'version': 2} Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1480] agent-manager: agent[f74dfcab0129a270,:1.13/nmcli-connect/0]: agent registered Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1487] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1489] manager: NetworkManager state is now DISCONNECTING Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1494] device (ens3): disconnecting for new activation request. Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.1494] audit: op="connection-activate" uuid="21d47e65-8523-1a06-af22-6f121086f085" name="System ens3" pid=975 uid=0 result="success" Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2131] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2247] dhcp4 (ens3): canceled DHCP transaction Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2247] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2247] dhcp4 (ens3): state changed no lease Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2770] manager: NetworkManager state is now DISCONNECTED Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2771] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085) Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2777] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2778] manager: NetworkManager state is now CONNECTING Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2779] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2790] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2791] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2855] dhcp4 (ens3): state changed new lease, address=10.11.8.179 Dec 2 18:27:05 mark-test-fs NetworkManager[723]: <info> [1670005625.2858] policy: set 'System ens3' (ens3) as default for IPv4 routing and DNS
2022-12-06 19:12:35 James Falcon cloud-init: status New Triaged
2022-12-13 16:55:52 James Falcon cloud-init: status Triaged Fix Committed
2023-02-22 16:52:30 Alberto Contreras cloud-init: status Fix Committed Fix Released
2023-05-12 20:07:11 James Falcon bug watch added https://github.com/canonical/cloud-init/issues/4050