Intermittent DB failure when creating VM pods via post /MAAS/api/2.0/machines/?op=allocate

Bug #1843493 reported by Pedro Guimarães
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
High
Unassigned

Bug Description

MAAS: 2.6.0 (7802-g59416a869-0ubuntu1~18.04.1)

Whenever I boot multiple VM pods at the same time using "allocate" operation, I get some VMs failing with overall message on Juju as:
3 down pending bionic suitable availability zone for machine 3 not found

Looking into MAAS source code, I can see decision on which pod will receive corresponding VM is done on method:
https://github.com/maas/maas/blob/5fe288985249afedadf4656b595238856b13ce4d/src/maasserver/forms/pods.py#L726

Added more logs to the "except" statement and found out we're throwing OperationalError. I believe it corresponds to a resource dispute on database, since error message is: "could not serialize access due to concurrent update", which is also documented on Postgres: https://www.postgresql.org/docs/9.1/transaction-iso.html

Full traceback of OperationalError:

Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
     return self.cursor.execute(sql, params)
 psycopg2.extensions.TransactionRollbackError: could not serialize access due to concurrent update

 The above exception was the direct cause of the following exception:

 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/maasserver/forms/pods.py", line 745, in compose
     creation_type=NODE_CREATION_TYPE.DYNAMIC)
   File "/usr/lib/python3/dist-packages/maasserver/forms/pods.py", line 664, in compose
     return create_and_sync((requested_machine, result))
   File "/usr/lib/python3/dist-packages/maasserver/forms/pods.py", line 606, in create_and_sync
     self.pod.sync_hints(pod_hints)
   File "/usr/lib/python3/dist-packages/maasserver/models/bmc.py", line 622, in sync_hints
     hints.save()
   File "/usr/lib/python3/dist-packages/maasserver/models/cleansave.py", line 252, in save
     return super(CleanSave, self).save(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 808, in save
     force_update=force_update, update_fields=update_fields)
   File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 838, in save_base
     updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
   File "/usr/lib/python3/dist-packages/maasserver/models/cleansave.py", line 278, in _save_table
     update_fields=update_fields)
   File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 905, in _save_table
     forced_update)
   File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 955, in _do_update
     return filtered._update(values) > 0
   File "/usr/lib/python3/dist-packages/django/db/models/query.py", line 664, in _update
     return query.get_compiler(self.db).execute_sql(CURSOR)
   File "/usr/lib/python3/dist-packages/django/db/models/sql/compiler.py", line 1204, in execute_sql
     cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
   File "/usr/lib/python3/dist-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
     raise original_exception
   File "/usr/lib/python3/dist-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
     cursor.execute(sql, params)
   File "/usr/lib/python3/dist-packages/maasserver/prometheus/middleware.py", line 21, in execute
     return super().execute(sql, params=params)
   File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
     return self.cursor.execute(sql, params)
   File "/usr/lib/python3/dist-packages/django/db/utils.py", line 94, in __exit__
     six.reraise(dj_exc_type, dj_exc_value, traceback)
   File "/usr/lib/python3/dist-packages/django/utils/six.py", line 685, in reraise
     raise value.with_traceback(tb)
   File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
     return self.cursor.execute(sql, params)
 django.db.utils.OperationalError: could not serialize access due to concurrent update

Related branches

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

I can reproduce this issue using Juju with following bundle: https://pastebin.ubuntu.com/p/NywWq5SPTm/ (needs to create an OAM network with oam-space name)

description: updated
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

I've tried to modify maasserver/forms/pods.py's compose method to something like: https://pastebin.canonical.com/p/vVZQN5GKPs/

The idea was to wait for 10s every time I receive OperationalError as an exception. However, I am seeing now TransactionManagementError poping up on the logs: https://pastebin.canonical.com/p/WNcvjXqRDg/

I believe there is a race-condition, since building a VM means updating resource consumption for its host, if I have two requests in parallel for VMs like:

1) First request for VM
2) SELECT all commit-related info
3) Second request for VM
4) select all commit-related info
5) Picked node for first request -> update committed resources info for first request
6) Second request fails because it was based on old info

Changed in maas:
status: New → In Progress
importance: Undecided → High
milestone: none → 2.7.0alpha1
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
milestone: 2.7.0b1 → 2.7.0b2
Changed in maas:
milestone: 2.7.0b2 → none
Changed in maas:
assignee: Blake Rouse (blake-rouse) → nobody
status: In Progress → Triaged
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Is this reproducible on a more recent MAAS (3.3+)?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.