On first reboot after install, the machine fails to bring up its network interfaces

Bug #2097302 reported by maasuser1
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Incomplete
Undecided
Unassigned

Bug Description

Describe the bug:

Stable reproduction on Ubuntu 20.04/22.04/24.04 LTS with MAAS 3.5.3 stable. I don’t know why, but it is a horrible experience to sit and look at the IPMI monitor and watch it for several power cycles.

I tried:

- Release the machine.
- Delete the machine from MAAS and re-add it to MAAS.
- Do commissioning before deployment.
- Set DHCP on eno1 or assign static IP address on eno1.

Similar to [this post](https://discourse.maas.io/t/on-first-reboot-after-install-node-fails-to-bring-up-its-network-interface/6307/6)

Steps to reproduce:

Deploy an arbitrary machine with arbitrary Ubuntu.

Expected behavior (what should have happened?):
The machine is up, continues its preseeding process, and completes the deployment.

Actual behavior (what actually happened?):
The machine is up, but lost network access back to MAAS.

MAAS version and installation type (deb, snap):
MAAS 3.5.3 stable from snap.

MAAS setup (HA, single node, multiple regions/racks):
primary + secondary

Host OS distro and version:
ubuntu 22.04 LTS

Additional context:

Attached is the screen recording of the machine from IPMI session.

Revision history for this message
maasuser1 (maasuser1) wrote :
Revision history for this message
maasuser1 (maasuser1) wrote :

After removing the secondary MAAS controller from the cluster, and rebooting the primary controller, the problem still exists.

Revision history for this message
maasuser1 (maasuser1) wrote :
Revision history for this message
Jacopo Rota (r00ta) wrote :

I’d suggest to enable login with password and look into the machine to understand what went wrong

Changed in maas:
status: New → Incomplete
Revision history for this message
maasuser1 (maasuser1) wrote :

I get into the machine by resetting the password of the `root` user in Ubuntu recovery mode.

Some findings:

- `/etc/netplan` is an empty folder
- Both `eno1` and `eno2` in state `DOWN`
- User `ubuntu` doesn't exist in `/etc/shadow`
- `dmesg` seems fine
- `ip link set eno1np0 up` and then `dhcpd eno1np0` can bring up the network, but SSH service fails due to `start request repeated too quickly`

Attached is the log obtained by MAAS, before the machine rebooted during OS installation.

Revision history for this message
Jacopo Rota (r00ta) wrote :

That's weird. You sure the machine is not booting from another disk where you have another OS installed?

Revision history for this message
maasuser1 (maasuser1) wrote :

Hello r00ta, yes I'm sure. I installed different Ubuntu releases (20.04/22.04/24.04) and checked the output of the login screen. This issue now affects all machines in our deployment no matter which Ubuntu version is chosen.

Revision history for this message
Bruno Hildenbrand (bhildenbrand) wrote :
Download full text (8.0 KiB)

Same here. Can also be reproduced with a LXD VM.

Repro steps:
1. Create empty LXD VM, PXE-boot it with MAAS
2. Commission
3. Add an ethernet device, associate it to a VLAN (eg., vlan=2, subnet = 10.0.2.0/24, No DHCP, ip address = 10.0.2.122)
4. Deploy Ubuntu (the logs below are from 22.04)

On 1st boot just after deploy, cloud-init fails with:

Feb 15 02:18:44 ubuntu cloud-init[785]: Cloud-init v. 24.4-0ubuntu1~22.04.1 running 'init-local' at Sat, 15 Feb 2025 02:18:44 +0000. Up 10.28 seconds.
Feb 15 02:18:44 ubuntu cloud-init[785]: 2025-02-15 02:18:44,287 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "attempting to read from cache [trust]", "event_type": "start", "origin": "c>
Feb 15 02:18:44 ubuntu cloud-init[785]: 2025-02-15 02:18:44,302 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "no cache found", "event_type": "finish", "origin": "cloudinit", "timestamp">
Feb 15 02:18:44 ubuntu cloud-init[785]: 2025-02-15 02:18:44,321 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/search-MAASLocal", "description": "searching for local data from DataSourceMAASLocal", "event_type": "sta>
Feb 15 02:18:44 ubuntu cloud-init[785]: 2025-02-15 02:18:44,327 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/search-MAASLocal", "description": "no local data found from DataSourceMAASLocal", "event_type": "finish",>
Feb 15 02:18:44 ubuntu audit[780]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="snap.lxd.daemon" pid=780 comm="apparmor_parser"
Feb 15 02:18:44 ubuntu cloud-init[785]: 2025-02-15 02:18:44,411 - networking.py[WARNING]: Not all expected physical devices present: {'00:16:3e:e0:49:ef'}
Feb 15 02:18:44 ubuntu cloud-init[785]: 2025-02-15 02:18:44,413 - main.py[ERROR]: failed stage init-local
Feb 15 02:18:44 ubuntu cloud-init[785]: Traceback (most recent call last):
Feb 15 02:18:44 ubuntu cloud-init[785]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 509, in main_init
Feb 15 02:18:44 ubuntu cloud-init[785]: init.fetch(existing=existing)
Feb 15 02:18:44 ubuntu cloud-init[785]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 552, in fetch
Feb 15 02:18:44 ubuntu cloud-init[785]: return self._get_data_source(existing=existing)
Feb 15 02:18:44 ubuntu cloud-init[785]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 403, in _get_data_source
Feb 15 02:18:44 ubuntu cloud-init[785]: raise e
Feb 15 02:18:44 ubuntu cloud-init[785]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 390, in _get_data_source
Feb 15 02:18:44 ubuntu cloud-init[785]: ds, dsname = sources.find_source(
Feb 15 02:18:44 ubuntu cloud-init[785]: File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 1072, in find_source
Feb 15 02:18:44 ubuntu cloud-init[785]: raise DataSourceNotFoundException(msg)
Feb 15 02:18:44 ubuntu cloud-init[785]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: (DataSourceMAASLocal)
Feb 15 02:18:44 ubuntu cloud-init[785]: During handling of the above except...

Read more...

Revision history for this message
Jacopo Rota (r00ta) wrote :

I think that's expected. if the user configures an interface that does not exist, the machine will fail to deploy after a timeout.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Please the the output of `maas $profile machine read $system_id`, as well as the curtin installation logs for the machine.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
maasuser1 (maasuser1) wrote :
Download full text (4.9 KiB)

Issue still observed on 3.5.4. Couldn't deploy any machine, and some times even commissioning may fail due to HTTP 500.

I think such an error should be handled gracefully rather than trigger an HTTP 500 error...

```
Mar 10 14:22:58 maas maas-machine[258545]: [server2] cloud-init[1676]: Finished maas-lshw (id: 10900, script_version_id: 48): 0
Mar 10 14:22:58 maas maas-regiond[258342]: maasserver: [error] ################################ Exception: Status for scriptresult 10915 is not running or pending (2) ################################
Mar 10 14:22:58 maas maas-regiond[258342]: maasserver: [error] Traceback (most recent call last):
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
Mar 10 14:22:58 maas maas-regiond[258342]: response = wrapped_callback(request, *callback_args, **callback_kwargs)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/lib/python3.10/site-packages/maasserver/utils/views.py", line 298, in view_atomic_with_post_commit_savepoint
Mar 10 14:22:58 maas maas-regiond[258342]: return view_atomic(*args, **kwargs)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
Mar 10 14:22:58 maas maas-regiond[258342]: return func(*args, **kwds)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/lib/python3.10/site-packages/maasserver/api/support.py", line 62, in __call__
Mar 10 14:22:58 maas maas-regiond[258342]: response = super().__call__(request, *args, **kwargs)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
Mar 10 14:22:58 maas maas-regiond[258342]: response = func(*args, **kwargs)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/usr/lib/python3/dist-packages/piston3/resource.py", line 196, in __call__
Mar 10 14:22:58 maas maas-regiond[258342]: result = self.error_handler(e, request, meth, em_format)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/usr/lib/python3/dist-packages/piston3/resource.py", line 194, in __call__
Mar 10 14:22:58 maas maas-regiond[258342]: result = meth(request, *args, **kwargs)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/lib/python3.10/site-packages/maasserver/api/support.py", line 371, in dispatch
Mar 10 14:22:58 maas maas-regiond[258342]: return function(self, request, *args, **kwargs)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/lib/python3.10/site-packages/metadataserver/api.py", line 880, in signal
Mar 10 14:22:58 maas maas-regiond[258342]: target_status = process(node, request, status)
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/lib/python3.10/site-packages/metadataserver/api.py", line 683, in _process_commissioning
Mar 10 14:22:58 maas maas-regiond[258342]: self._store_results(
Mar 10 14:22:58 maas maas-regiond[258342]: File "/snap/maas/38907/lib/python3.10/site-packages/metadataserver/api.py", line 565, in _store_results
Mar 10 14:22:58 maas maas-regiond...

Read more...

Revision history for this message
maasuser1 (maasuser1) wrote :
Revision history for this message
maasuser1 (maasuser1) wrote :
Revision history for this message
maasuser1 (maasuser1) wrote :

I think it might have some connection with [this](https://github.com/canonical/cloud-init/issues/6065).
@r00ta Do you have any thoughts? Thanks in advance

Revision history for this message
Jacopo Rota (r00ta) wrote :

Yes, I think that's the case

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.