Provisioning LXD vmhost fails in 3.3-rc3

Bug #2003925 reported by John Terrell
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned
3.3
Won't Fix
Medium
Unassigned
3.4
Won't Fix
Medium
Unassigned

Bug Description

LXD vmhost provisioning is still failing in 3.3.0~rc3-13147-g.6fcfdaf5e. There's another bug (https://bugs.launchpad.net/maas/+bug/1988759) which claims this was fixed for 3.2 but it is still failing.

Details: When provisioning, the provision actually succeeds (the machine is alive and responsive) but is stuck in the "rebooting" phase in Maas. It stays in this phase until it times out and fails. The installation log indicates no errors (ends with "curtin: Installation finished").

Attached is the installation log.

-John

Revision history for this message
John Terrell (coolpeople) wrote :
Revision history for this message
John Terrell (coolpeople) wrote :

Output yaml.

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

Can you login in the machine after deployment completes? if so, please share with us the contents of `/var/log/cloud-init-output.log` and `/var/log/cloud-init.log`

Changed in maas:
status: New → Incomplete
Revision history for this message
John Terrell (coolpeople) wrote :

Logs enclosed.

Revision history for this message
John Terrell (coolpeople) wrote :

BTW, this is now on RC4.

Revision history for this message
John Terrell (coolpeople) wrote :

Looking at the logs, a few things stand out.

I see this repeated:

2023-01-26 17:05:29,179 - handlers.py[WARNING]: Failed posting event: {"name": "init-local", "description": "searching for local datasources", "event_type": "finish", "origin": "cloudinit", "timestamp": 1674752729.1758957, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='10.10.200.5', port=5248): Max retries exceeded with url: /MAAS/metadata/status/m7d7he (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f64797fb2e0>:
2023-01-26 17:05:29,180 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.

Why can't the machine contact the controller?

Revision history for this message
Alvin Cura (alvinc) wrote :

I see similar. It's due to Open vSwitch.

2023-01-20 06:34:52,106 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "attempting to read from cache [trust]", "event_type": "start", "origin": "cloudinit", "timestamp": 1674196492.0927331}. This was caused by: HTTPConnectionPool(host='10.100.127.223', port=5248): Max retries exceeded with url: /MAAS/metadata/status/yxrest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e612f4790>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-01-20 06:34:52,107 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "no cache found", "event_type": "finish", "origin": "cloudinit", "timestamp": 1674196492.0928519, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='10.100.127.223', port=5248): Max retries exceeded with url: /MAAS/metadata/status/yxrest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e613141c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-01-20 06:34:52,107 - __init__.py[INFO]: Open vSwitch is not yet up; no interfaces will be detected as OVS-internal

Will try again without Open vSwitch to rule it out as cause

Revision history for this message
Alvin Cura (alvinc) wrote :

Here on a different host without Open vSwitch. I'm guessing a race condition, because there is no reason for connection problem otherwise.

023-01-24 20:50:42,983 - handlers.py[DEBUG]: Queuing POST to http://10.100.127.223:5248/MAAS/metadata/status/6th3hw, data: {'name': 'init-local', 'description': 'searching for loc
al datasources', 'event_type': 'finish', 'origin': 'cloudinit', 'timestamp': 1674593442.9835653, 'result': 'SUCCESS'}
2023-01-24 20:50:42,983 - handlers.py[DEBUG]: WebHookHandler flushing remaining events
2023-01-24 20:50:42,985 - handlers.py[WARNING]: Failed posting event: {"name": "init-local", "description": "searching for local datasources", "event_type": "finish", "origin": "cl
oudinit", "timestamp": 1674593442.9835653, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='10.100.127.223', port=5248): Max retries exceeded with url: /MAAS/meta
data/status/6th3hw (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fea47ffe160>: Failed to establish a new connection: [Errno 101] Network is unreach
able'))
2023-01-24 20:50:42,985 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.
2023-01-24 20:50:48,208 - util.py[DEBUG]: Cloud-init v. 22.4.2-0ubuntu0~20.04.2 running 'init' at Tue, 24 Jan 2023 20:50:48 +0000. Up 17.87 seconds.
2023-01-24 20:50:48,208 - main.py[DEBUG]: No kernel command line url found.
2023-01-24 20:50:48,208 - main.py[DEBUG]: Closing stdin.

Now, a few lines down the logfile, it does connect:
2023-01-24 20:50:48,218 - handlers.py[DEBUG]: start: init-network/check-cache: attempting to read from cache [trust]
2023-01-24 20:50:48,218 - handlers.py[DEBUG]: Queuing POST to http://10.100.127.223:5248/MAAS/metadata/status/6th3hw, data: {'name': 'init-network/check-cache', 'description': 'attempting to read from cache [trust]', 'event_type': 'start', 'origin': 'cloudinit', 'timestamp': 1674593448.2180347}
2023-01-24 20:50:48,218 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2023-01-24 20:50:48,218 - stages.py[DEBUG]: no cache found
2023-01-24 20:50:48,218 - handlers.py[DEBUG]: finish: init-network/check-cache: SUCCESS: no cache found
2023-01-24 20:50:48,218 - handlers.py[DEBUG]: Queuing POST to http://10.100.127.223:5248/MAAS/metadata/status/6th3hw, data: {'name': 'init-network/check-cache', 'description': 'no cache found', 'event_type': 'finish', 'origin': 'cloudinit', 'timestamp': 1674593448.2188902, 'result': 'SUCCESS'}
2023-01-24 20:50:48,219 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance

Revision history for this message
John Terrell (coolpeople) wrote :

I don't have OpenVSwitch in my environment unless MAAS is installing it.

Revision history for this message
John Terrell (coolpeople) wrote :

Any further actions here?

Revision history for this message
Alvin Cura (alvinc) wrote :

I also have had no further success. Even without Open vSwitch, no success.

As of this writing, I am having to `lxd init` from the failed deployed host to try to connect them.

Changed in maas:
status: Incomplete → Triaged
Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

2023-02-16 12:02:37 metadataserver.api_twisted: [info] Trying to create vmhost for deployment for node fptpn3 (chief-orca)
2023-02-16 12:02:40 metadataserver.api_twisted: [critical] Failed to process status message instantly.
        Traceback (most recent call last):
          <...twisted and orm...>
          File "/snap/maas/x1/lib/python3.10/site-packages/metadataserver/api_twisted.py", line 505, in _processMessage
            _create_vmhost_for_deployment(node)
          File "/snap/maas/x1/lib/python3.10/site-packages/metadataserver/api_twisted.py", line 265, in _create_vmhost_for_deployment
            discover_and_sync_vmhost(pod, node.owner)
          File "/snap/maas/x1/lib/python3.10/site-packages/maasserver/vmhost.py", line 88, in discover_and_sync_vmhost
            raise PodProblem(str(error)) from error
        maasserver.exceptions.PodProblem: Failed talking to pod: Pod 10: Failed to connect to the LXD REST API: HTTPSConnectionPool(host='172.16.1.4', port=8443): Max retries exceeded with url: /1.0?project=maas (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f52921f9db0>: Failed to establish a new connection: [Errno 113] No route to host'))

Meanwhile, `172.16.1.48` is reported in MAAS UI for `chief-orca`, while `172.16.1.4` was the machine's IP during commissioning.

Revision history for this message
John Terrell (coolpeople) wrote :

FYI: this still repros in the latest 3.3 builds.

Changed in maas:
importance: Undecided → Medium
milestone: none → 3.4.0
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Changed in maas:
milestone: 3.4.x → 3.5.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.