BACKGROUND:
cloud-init-local.service runs before networking has started. On non-Oracle platforms, before networking has come up, cloud-init will create an ephemeral connection to the cloud's IMDS using DHCP to retrieve instance metadata. On Oracle, this normally isn't necessary as we boot with connectivity to the IMDS out of the box. This can be seen in the following Jammy instance using an SR-IOV NIC:
2024-03-05 14:09:05,351 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0, 'headers': {'User-Agent'
: 'Cloud-Init/23.3.3-0ubuntu0~22.04.1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:09:05,362 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 2663b) after 1 attempts
2024-03-05 14:09:05,362 - ephemeral.py[DEBUG]: Skip ephemeral DHCP setup, instance has connectivity to {'url': 'http://169.254.169.254/opc/v2/instance/', 'headers': {'Authorization': 'Bearer Oracle'}, 'timeout': 5}
2024-03-05 14:09:05,362 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/23
.3.3-0ubuntu0~22.04.1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:09:05,368 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 2663b) after 1 attempts
Notice the "Skip ephemeral DHCP setup, instance has connectivity". This means that cloud-init has determined that it already has connectivity and doesn't need to do any additional setup to retrieve data from the IMDS.
We can also see the same behavior on a Noble paravirtualized instance:
2024-03-01 20:51:33,482 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-01 20:51:33,488 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 3067b) after 1 attempts
2024-03-01 20:51:33,488 - ephemeral.py[DEBUG]: Skip ephemeral DHCP setup, instance has connectivity to {'url': 'http://169.254.169.254/opc/v2/instance/', 'headers': {'Authorization': 'Bearer Oracle'}, 'timeout': 5}
2024-03-01 20:51:33,489 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-01 20:51:33,500 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 3067b) after 1 attempts
2024-03-01 20:51:33,501 - util.py[DEBUG]: Writing to /run/cloud-init/cloud-id-oracle - wb: [644] 7 bytes
PROBLEM:
On a Noble instance using Hardware-assisted (SR-IOV) networking, this is not working. cloud-init-local.service no longer has immediate connectivity to the IMDS. Since it cannot connect, in then attempts to create an ephemeral connection to the IMDS using DHCP. It is able to obtain a DHCP lease, but then when it tries to connect to the IMDS, the call just hangs. The call has no timeout, so this results in an instance that cannot be logged into even via the serial console because cloud-init is blocking the rest of boot. A simple cloud-init workaround is to add something along the lines of `timeout=2` to https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceOracle.py#L349 . This allows cloud-init to boot. Looking at the logs, we can see that cloud-init is unable to connect to the IMDS:
2024-03-05 14:23:54,836 - ephemeral.py[DEBUG]: Received dhcp lease on ens3 for 10.0.0.133/255.255.255.0
2024-03-05 14:23:54,837 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:23:56,841 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:23:57,842 - url_helper.py[DEBUG]: [1/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:23:59,847 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:24:00,847 - url_helper.py[DEBUG]: [2/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:24:02,852 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v1/instance/' with {'url': 'http://169.254.169.254/opc/v1/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1'}} configuration
2024-03-05 14:24:04,855 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:24:05,855 - url_helper.py[DEBUG]: [1/3] open 'http://169.254.169.254/opc/v1/instance/' with {'url': 'http://169.254.169.254/opc/v1/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1'}} configuration
2024-03-05 14:24:07,859 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:24:08,859 - url_helper.py[DEBUG]: [2/3] open 'http://169.254.169.254/opc/v1/instance/' with {'url': 'http://169.254.169.254/opc/v1/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1'}} configuration
2024-03-05 14:24:10,863 - handlers.py[DEBUG]: finish: init-local/search-Oracle: FAIL: no local data found from DataSourceOracle
2024-03-05 14:24:10,863 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceOracle.DataSourceOracle'> failed
2024-03-05 14:24:10,863 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceOracle.DataSourceOracle'> failed
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 370, in read_opc_metadata
instance_data = _fetch(metadata_version, path="instance")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 346, in _fetch
return readurl(
^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 370, in readurl
raise excps[-1]
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=2.0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 1028, in find_source
if s.update_metadata_if_supported(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 914, in update_metadata_if_supported
result = self.get_data()
^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 460, in get_data
return_value = self._check_and_get_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 392, in _check_and_get_data
return self._get_data()
^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 165, in _get_data
fetched_metadata = read_opc_metadata(
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 373, in read_opc_metadata
instance_data = _fetch(metadata_version, path="instance")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 346, in _fetch
return readurl(
^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 370, in readurl
raise excps[-1]
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=2.0)
2024-03-05 14:24:10,898 - main.py[DEBUG]: No local datasource found
Despite this, cloud-init is still able to read and render the networking configuration sourced from initramfs:
2024-03-05 14:24:10,899 - util.py[DEBUG]: Read 272 bytes from /run/net-ens3.conf
...
2024-03-05 14:24:10,914 - stages.py[INFO]: Applying network configuration from initramfs bringup=False: {'config': [{'type': 'physical', 'name': 'ens3', 'subnets': [{'type': 'dhcp', 'control': 'manual', 'netmask': '255.255.255.0', 'broadcast': '10.0.0.255', 'gateway': '10.0.0.1', 'dns_nameservers': ['169.254.169.254']}], 'mac_address': '02:00:17:0f:50:8d'}], 'version': 1}
2024-03-05 14:24:10,914 - util.py[DEBUG]: Writing to /run/cloud-init/sem/apply_network_config.once - wb: [644] 23 bytes
2024-03-05 14:24:10,915 - distros[DEBUG]: Selected renderer 'netplan' from priority list: ['netplan', 'eni', 'sysconfig']
2024-03-05 14:24:10,918 - subp.py[DEBUG]: Running command ['netplan', 'info'] with allowed return codes [0] (shell=False, capture=True)
2024-03-05 14:24:11,109 - subp.py[DEBUG]: command ['netplan', 'info'] took 0.1s to run
2024-03-05 14:24:11,109 - util.py[DEBUG]: Attempting to load yaml from string of length 332 with allowed root types (<class 'dict'>,)
2024-03-05 14:24:11,111 - util.py[DEBUG]: Writing to /etc/netplan/50-cloud-init.yaml - wb: [600] 481 bytes
2024-03-05 14:24:11,111 - subp.py[DEBUG]: Running command ['netplan', 'generate'] with allowed return codes [0] (shell=False, capture=True)
2024-03-05 14:24:11,300 - subp.py[DEBUG]: command ['netplan', 'generate'] took 0.1s to run
This allows networking to come up as expected on the primary interface, but cloud-init has been unable to fetch userdata/metadata or retrieve information about any secondary interfaces.
SUMMARY:
I see two separate issues here:
1. Cloud-init should be able to deal with the lack of network in early boot. This can be fixed on the cloud-init side.
2. Early boot network connectivity works across every other series and instance type except for Noble using Hardware-assisted (SR-IOV) networking.
I am unsure the cause of #2.
@bdrung , I realize that initramfs-tools may not be the culprit, but given how Oracle uses iSCSI along with the recent dhcpcd changes, I added it here too.