Image download on MAAS 3.6.1 is slow

Bug #2121474 reported by Mathieu Marchand
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Jacopo Rota
3.6
Won't Fix
High
Unassigned
3.7
Fix Released
High
Jacopo Rota

Bug Description

Describe the bug:
On a freshly installed MAAS 3.6.1 on Ubuntu 24.04LTS, downloading image is extremely slow (Didn't get pas 1% inside 30 minutes) while download those image from curl command is under 1s.

Steps to reproduce:
- Fresh install of 3.6.1 HA on 3 nodes
- Select a new image to download.
- Wait, and after 30 minutes, progress is still at 1%
- journalctl logs shows this:

Aug 26 17:31:21 maas-013 maas-regiond[3367]: temporalio.activity: [warn] Completing activity as failed ({'activity_id': '1', 'activity_type': 'download-bootresourcefile', 'attempt': 8, 'namespace': 'default', 'task_queue': 'region-internal', 'workflow_id': 'download-bootresource:upstream:f82754bc29f7', 'workflow_run_id': '3cd5524e-8fe1-4cc9-a7c2-915cf2266389', 'workflow_type': 'download-bootresource'})
Aug 26 17:31:21 ca3-maas-013 maas-regiond[3367]: Traceback (most recent call last):
Aug 26 17:31:21 ca3-maas-013 maas-regiond[3367]: File "/snap/maas/40143/usr/lib/python3/dist-packages/temporalio/worker/_activity.py", line 453, in _run_activity
Aug 26 17:31:21 maas-013 maas-regiond[3367]: result = await impl.execute_activity(input)
Aug 26 17:31:21 maas-013 maas-regiond[3367]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Aug 26 17:31:21 maas-013 maas-regiond[3367]: File "/snap/maas/40143/usr/lib/python3/dist-packages/temporalio/worker/_activity.py", line 711, in execute_activity
Aug 26 17:31:21 maas-013 maas-regiond[3367]: return await input.fn(*input.args)
Aug 26 17:31:21 maas-013 maas-regiond[3367]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Aug 26 17:31:21 maas-013 maas-regiond[3367]: File "/snap/maas/40143/lib/python3.12/site-packages/maasserver/workflow/bootresource.py", line 261, in download_bootresourcefile
Aug 26 17:31:21 maas-013 maas-regiond[3367]: raise ApplicationError(
Aug 26 17:31:21 maas-013 maas-regiond[3367]: temporalio.exceptions.ApplicationError: ClientPayloadError: Response payload is not completed
Aug 26 17:31:22 maas-013 maas-regiond[6796]: twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.2.3.4', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.2.43.4', port=58632, flowInfo=0, scopeID=0))
Aug 26 17:31:22 maas-013 maas-rackd[3551]: Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.2.3.4', port=58632, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.2.3.4, port=5253, flowInfo=0, scopeID=0))
Aug 26 17:31:22 maas-013 maas-regiond[6796]: twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.2.3.4', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.2.3.4', port=58644, flowInfo=0, scopeID=0))
Aug 26 17:31:22 maas-013 maas-rackd[3551]: Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.2.3.4', port=58644, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.2.3.4', port=5253, flowInfo=0, scopeID=0))

Expected behavior (what should have happened?):

Image get downloaded inside a few minutes.

Actual behavior (what actually happened?):

Images download is stuck and not progressing.

MAAS version and installation type (deb, snap):

3.6.1 snap

MAAS setup (HA, single node, multiple regions/racks):

HA, 3 nodes.

Host OS distro and version:

Ubuntu 24.04.02 LTS

Additional context:

Running inside VMware VMs.

Related branches

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

The error `ClientPayloadError: Response payload is not completed` is raised by the aiohttp framework due to some internal race condition. This framework is used by MAAS as an async HTTP client, and this exception is observed only when doing chunked HTTP transfers like the Image Sync does.

There was an upstream bug (https://github.com/aio-libs/aiohttp/issues/4581) to investigate this, and some attempts to fix it were made. It was closed on Aug 2024 after the issue was no longer reproducible by the maintainer. Although there are still reports of this error on newer releases, the issue stays closed because of the lack of a better reproducer.

MAAS options:

1) backport the python-aiohttp package from Questing (v3.11.16).
Noble has the 3.9.1 version and the bug was closed at version 3.10.5, so this package should contain all the attempts to fix this issue made by the upstream maintainer.

2) drop aiohttp and use httpx instead.
httpx is used by API v3 tests, so it's already a dependency of MAAS. async HTTP clients are used in a few places in MAAS, and in most cases replacing the framework is trivial. the image download workflow is the most advanced use case, but we have an implementation of it using httpx in site-manager, so there's a reference implementation that we can use.

tags: added: bug-council
Revision history for this message
Thorsten Merten (thorsten-merten) wrote (last edit ):

FYI: here is a nice comparison of the clients https://oxylabs.io/blog/httpx-vs-requests-vs-aiohttp

Their TL;DR is that httpx can be considered more of a superset and should serve more use cases (at least as a client), while aiohttp it trying to make some things simpler from a developers perspective.

So transitioning to the client would be only work but should not give us blockers.

However, it is potentially harder to backport. Either a lot of cherry-picking or different solutions for 3.5-3.7 than for current master and potentially both clients as a dependency in the backports.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote (last edit ):

Without a reproducer for the issue, it will be impossible to determine whether the problem is actually solved, so we should consider creating a reproducer first to evaluate both the new verrsions of aiohttp and httpx for likelihood of running into the same issue.

If httpx is free from the issue, we can consider switching over to it, which will reduce the number of dependencies in MAAS as an added benefit.

Changed in maas:
importance: Undecided → High
milestone: none → 3.7.x
status: New → Triaged
Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
milestone: 3.7.x → 3.8.x
Revision history for this message
Jacopo Rota (r00ta) wrote :

Setting to won't fix in 3.6 unless it turns out this is a very hot bug to be fixed. We do not want to introduce additional dependencies in point releases

tags: removed: bug-council
Revision history for this message
Mostafa Abdelwahab (mostafaabdelwahab) wrote (last edit ):

Hi, I found this bug by searching for hournalctl logs.

I am doing a deployment where maas is importing the images from an offline on-prem mirror. THe import of of a 800 MB image (noble amd64) takes 90 minutes. From the to of your head, would this slowness be attributable to this bug or is that too long for an offline mirror and something else must be wrong?

(nap maas 3.6.2 in HA on 3 libvirt jammy VMs on the same baremetal node
The offline mirror is on another jammy libvirt VM n the same subnet and the same bare metal node

Update: contacted the MAAS team directly and they answered this bug is the likely reason for the slowness, even for the offline mirror

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.