Intermittent DB failure when creating VM pods via post /MAAS/api/2.0/machines/?op=allocate

Bug #1843493 reported by Pedro Guimarães on 2019-09-10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Blake Rouse

Bug Description

MAAS: 2.6.0 (7802-g59416a869-0ubuntu1~18.04.1)

Whenever I boot multiple VM pods at the same time using "allocate" operation, I get some VMs failing with overall message on Juju as:
3 down pending bionic suitable availability zone for machine 3 not found

Looking into MAAS source code, I can see decision on which pod will receive corresponding VM is done on method:

Added more logs to the "except" statement and found out we're throwing OperationalError. I believe it corresponds to a resource dispute on database, since error message is: "could not serialize access due to concurrent update", which is also documented on Postgres:

Full traceback of OperationalError:

Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/django/db/backends/", line 64, in execute
     return self.cursor.execute(sql, params)
 psycopg2.extensions.TransactionRollbackError: could not serialize access due to concurrent update

 The above exception was the direct cause of the following exception:

 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/maasserver/forms/", line 745, in compose
   File "/usr/lib/python3/dist-packages/maasserver/forms/", line 664, in compose
     return create_and_sync((requested_machine, result))
   File "/usr/lib/python3/dist-packages/maasserver/forms/", line 606, in create_and_sync
   File "/usr/lib/python3/dist-packages/maasserver/models/", line 622, in sync_hints
   File "/usr/lib/python3/dist-packages/maasserver/models/", line 252, in save
     return super(CleanSave, self).save(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/django/db/models/", line 808, in save
     force_update=force_update, update_fields=update_fields)
   File "/usr/lib/python3/dist-packages/django/db/models/", line 838, in save_base
     updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
   File "/usr/lib/python3/dist-packages/maasserver/models/", line 278, in _save_table
   File "/usr/lib/python3/dist-packages/django/db/models/", line 905, in _save_table
   File "/usr/lib/python3/dist-packages/django/db/models/", line 955, in _do_update
     return filtered._update(values) > 0
   File "/usr/lib/python3/dist-packages/django/db/models/", line 664, in _update
     return query.get_compiler(self.db).execute_sql(CURSOR)
   File "/usr/lib/python3/dist-packages/django/db/models/sql/", line 1204, in execute_sql
     cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
   File "/usr/lib/python3/dist-packages/django/db/models/sql/", line 899, in execute_sql
     raise original_exception
   File "/usr/lib/python3/dist-packages/django/db/models/sql/", line 889, in execute_sql
     cursor.execute(sql, params)
   File "/usr/lib/python3/dist-packages/maasserver/prometheus/", line 21, in execute
     return super().execute(sql, params=params)
   File "/usr/lib/python3/dist-packages/django/db/backends/", line 64, in execute
     return self.cursor.execute(sql, params)
   File "/usr/lib/python3/dist-packages/django/db/", line 94, in __exit__
     six.reraise(dj_exc_type, dj_exc_value, traceback)
   File "/usr/lib/python3/dist-packages/django/utils/", line 685, in reraise
     raise value.with_traceback(tb)
   File "/usr/lib/python3/dist-packages/django/db/backends/", line 64, in execute
     return self.cursor.execute(sql, params)
 django.db.utils.OperationalError: could not serialize access due to concurrent update

Related branches

Pedro Guimarães (pguimaraes) wrote :

I can reproduce this issue using Juju with following bundle: (needs to create an OAM network with oam-space name)

description: updated
Pedro Guimarães (pguimaraes) wrote :

I've tried to modify maasserver/forms/'s compose method to something like:

The idea was to wait for 10s every time I receive OperationalError as an exception. However, I am seeing now TransactionManagementError poping up on the logs:

I believe there is a race-condition, since building a VM means updating resource consumption for its host, if I have two requests in parallel for VMs like:

1) First request for VM
2) SELECT all commit-related info
3) Second request for VM
4) select all commit-related info
5) Picked node for first request -> update committed resources info for first request
6) Second request fails because it was based on old info

Changed in maas:
status: New → In Progress
importance: Undecided → High
milestone: none → 2.7.0alpha1
assignee: nobody → Blake Rouse (blake-rouse)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers