Conflict error during w3 request

Bug #2034014 reported by Marian Gasparovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Jacopo Rota
3.2
Fix Released
High
Jacopo Rota
3.3
Fix Released
High
Jacopo Rota
3.4
Fix Released
High
Jacopo Rota
3.5
Fix Released
High
Jacopo Rota

Bug Description

Since Aug 31 we started to receive this error when we deploy machines with Terraform MAAS provider

Error: ServerError: 409 Conflict (<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd](http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd)">
<html xmlns="[http://www.w3.org/1999/xhtml](http://www.w3.org/1999/xhtml)" xml:lang="en" lang="en">
  <head>
    <title>Error: Conflict Error</title>
  </head>
  <body>
    <h2 class="p-heading--four">
      Conflict error. Try your request again, as it will most likely succeed.
    </h2>
  </body>
</html>
)

It happens only in our long running MAAS which serves six "silos". We have another two silos where we deploy new MAAS every time, we did not encounter this error there.
Both long running and freshly deployed MAAS's are 3.2.9
This seems like an issue which happens after MAAS is up for some time. We did not encounter it before Aug 31 and many times since. I will update with exact number after I triage all.
Typical example is here
https://oil-jenkins.canonical.com/artifacts/d356f7bc-91ad-4a86-a5e8-30d16af4f379/index.html
I will upload MAAS logs later.

I was observing one of the runs when it was happening. It was supposed to deploy six machines using tag as constraint. I could see 6 machines with tags available but during TF run it started to deploy only five4 of them, it did not touch the sixth.

We hit this bug 34 times in 81 test runs in four days.

Tags: cdo-qa

Related branches

description: updated
Revision history for this message
Jacopo Rota (r00ta) wrote :

Hi Marian,

thanks for the report. Could you attach also the logs or an sos report?

Revision history for this message
Marian Gasparovic (marosg) wrote :
Download full text (8.5 KiB)

2023-09-04 10:13:10 maasserver.utils.views: [error] Attempt #5 for /MAAS/api/2.0/machines/ failed; giving up (91.8s elapsed in total)
Traceback (most recent call last):
  File "/snap/maas/29774/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "maasserver_filesystem_partition_id_acquired_5fe51ba7_uniq"
DETAIL: Key (partition_id, acquired)=(140, t) already exists.

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/utils/views.py", line 245, in handle_uncaught_exception
    return self.process_exception_by_middleware(exc, request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/utils/views.py", line 243, in handle_uncaught_exception
    raise exc from exc.__cause__
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/utils/views.py", line 309, in get_response
    response = django_get_response(request)
  File "/snap/maas/29774/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 75, in get_response
    response = self._middleware_chain(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/prometheus/middleware.py", line 52, in __call__
    response = self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 331, in __call__
    response = self.get_response(request)
  File "/snap/maas/29774/usr/lib/python3/dist-packages/django/utils/deprecation.py", line 94, in __call__
    response = response or self.get_response(request)
  File "/snap/maas/29774/usr/lib/python3/dist-packages/django/utils/deprecation.py", line 94, in __call__
    response = response or self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 172, in __call__
    return self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 517, in __call__
    result = self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 363, in __call__
    return self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 363, in __call__
    return self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 444, in __call__
    return self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 501, in __call__
    return self.get_response(request)
  File "/snap/maas/29774/usr/lib/python3/dist-packages/django/utils/deprecation.py", line 94, in __call__
    response = response or self.get_response(request)
  File "/snap/maas/29774/usr/lib/python3/dist-packages/django/utils/deprecation.py", line 94, in __call__
    response = response or self.get_response(request)
  File "/snap/maas/29774/lib/python3.8/site-packages/maasserver/middleware.py", line 103, in __call__
    return self.get_response(request)
  File "...

Read more...

Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
Revision history for this message
Jacopo Rota (r00ta) wrote :
Download full text (4.1 KiB)

I've managed to reproduce the issue locally. In order to reproduce it, you need to:
1) setup a vm with 4 cores and 32G of RAM (or whatever, just ensure the ram is enough).
2) restore the DB taken from the reporter's environment
3) install MAAS 3.2.9 (or a newer version)
4) hammer MAAS with a script like for at least 1 minute (in order to overload MAAS) and keep it running

```
import asyncio
import aiohttp
import time
import uuid
import random
import numpy as np
import json

base_url = 'http://10.0.2.46:5240' # Replace with your MAAS URL

async def make_http_request(session, http_method, url):
    headers = {"Authorization": "OAuth oauth_version=1.0, oauth_signature_method=PLAINTEXT, oauth_consumer_key=<REDACTED>, oauth_token=<REDACTED>, oauth_signature=&<REDACTED>, oauth_nonce=" + str(uuid.uuid4()) + ", oauth_timestamp=" + str(int(time.time()))}
    full_url = f"{base_url}{url}"
    print(full_url)
    try:
        body = {}
        if http_method == "POST":
            body = {'mode': 'AUTO', 'subnet': 2}
        async with session.request(http_method, full_url, data=json.dumps(body), headers=headers) as response:
            while True:
                chunk = await response.content.read(512) # Adjust the chunk size as needed
                if not chunk:
                    break
                await asyncio.sleep(0.5) # just slow down the download speed
            print("OK")
            return full_url, response.status
    except Exception as e:
        print("EXCEPTION")
        print(str(e))
        return full_url, str(e)

# Some of the HTTP calls from the reporter's environment
urls = [
("GET", "/MAAS/api/2.0/machines/?agent_name=9a7502fa-3149-4d63-8b92-aec26ae8d7eb&id=hydcaa"), # This is the most problematic one, as usual.
("GET", "/MAAS/api/2.0/version/"),
("GET", "/MAAS/api/2.0/users/?op=whoami"),
("GET", "/MAAS/api/2.0/zones/"),
("GET", "/MAAS/api/2.0/spaces/"),
("GET", "/MAAS/api/2.0/static-routes/"),
("GET", "/MAAS/rpc/"),
("GET", "/MAAS/api/2.0/devices/?hostname=juju-ad2531-5-lxd-8"),
("POST", "/MAAS/api/2.0/nodes/mdyndn/interfaces/519149/?op=link_subnet")
]

async def main():
    async with aiohttp.ClientSession() as session:
        while True:
            # pick a request to shoot according to the probabilities calculated on the reporter's logs
            bullet = np.random.choice(range(len(urls)), p=[
                0.207619047619047616,
                0.2161904761904762,
                0.14761904761904763,
                0.02857142857142857,
                0.0761904761904762,
                0.03333333333333333,
                0.047619047619047616,
                0.09047619047619047,
                0.1523809523809524
            ])
            method, url = urls[bullet]
            # sleep a bit between every request.
            # the less we sleep, the more you'll hit the issue.
            await asyncio.sleep(0.07)
            asyncio.ensure_future(make_http_request(session, method, url))

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
```
5) run the following bash script
```
#!/bin/bash

for i in {1..6}; do
  curl --header "Authorization: OAuth...

Read more...

Revision history for this message
Jacopo Rota (r00ta) wrote :

I've managed to debug the issue and I confirm that it takes ages for the transaction to commit: this is why the lock is not released and the other "allocate" calls are stuck at the lock-entry.

23-09-05 14:23:59 maasserver.api.machines: [info] TRYING TO GET LOCK
2023-09-05 14:23:59 maasserver.api.machines: [info] LOCK ACQUIRED
23-09-05 14:23:59 maasserver.api.machines: [info] TRYING TO GET LOCK
2023-09-05 14:23:59 maasserver.api.machines: [info] TRYING TO GET LOCK
2023-09-05 14:23:59 maasserver.api.machines: [info] TRYING TO GET LOCK
2023-09-05 14:23:59 maasserver.utils.dblocks: [info] MAAS BUSINESS DONE, TRANSACTION TO BE COMMITTED
... SOME HTTP CALLS IN THE MEANWHILE
2023-09-05 14:24:22 maasserver.utils.views: [info] TRANSACTION COMMITTED
2023-09-05 14:24:22 maasserver.api.machines: [info] LOCK ACQUIRED
....

as you can see it takes more than 20 seconds to commit the transaction for the first request. All the retries will hit the same issue and only one at a time will commit the transaction very slowly. At the end of the timeout (90 seconds) all the remaining request will get 409.

Unfortunately there is not much that we can do on this bug right now: the performance issue will be addressed properly in the next roadmap cycles. Once the performance improvements will be there, your machine should be enough to handle your machines. But until then, I think the only way is to upgrade to a more powerful machine and/or add a region controller and setup HA proxy.

Revision history for this message
Jacopo Rota (r00ta) wrote :

For example, I just tried with a VM with 6 cores and 6 MAAS workers and it is able to handle the load produced by the script above and allocate the 6 machines

Revision history for this message
Jacopo Rota (r00ta) wrote :

So, rethinking about this bug I think there is one main suboptimal design:
1) maas retries the requests internally up to 90 seconds -> this is making the original request **synchronous** for the client, which is wrong.
2) the client is not retrying the request.

Right now MAAS includes a `Retry-After` header only if it would return 503 (and our gomaasclient is looking for it to retry the requests).

My proposal would be that MAAS should include a `Retry-After` header also for 409 responses and the gomaasclient should handle it and retry the requests properly.

Revision history for this message
Jacopo Rota (r00ta) wrote :

The fix on terraform and juju still need to land. Keep you posted here when it's done

Revision history for this message
Jacopo Rota (r00ta) wrote :

Fix committed to terraform and juju

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.