Same node handed out in response to two parallel 'acquire' requests

Bug #1382575 reported by Jason Hobbs on 2014-10-17
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Jason Hobbs

Bug Description

I have a script that uses the cli to acquire nodes in quick succession in parallel, in two separate processes. I do 30 acquire node requests, and all return successfully. However, less than 30 unique system-ids are returned, so sometimes MAAS is handing out the same system ID in response to multiple node acquire requests.

This was initially reported by a user and I have confirmed it on my own setup, on MAAS 1.7.0~beta6+bzr3260-0ubuntu1~trusty1

Related branches

I've attached a script that reproduces the issue.

You can work around this issue by editing the first line of "/etc/maas/maas-http.conf" changing 'processes=2' to 'processes=1'.

summary: - Same node handed out in two parallel 'acquire' requests from the same
- user
+ Same node handed out in response to two parallel 'acquire' requests from
+ the same user
Jason Hobbs (jason-hobbs) wrote :

This also happens when there are two users making the requests.

summary: - Same node handed out in response to two parallel 'acquire' requests from
- the same user
+ Same node handed out in response to two parallel 'acquire' requests
Changed in maas:
status: Triaged → In Progress
assignee: nobody → Jason Hobbs (jason-hobbs)
Graham Binns (gmb) wrote :

Node.acquire() doesn't commit after save(), so the state change won't be committed until the request completes. That's where the race lies. We could also perhaps take out a lock on the table when acquiring the node; Gavin, can you weigh in on that?

Jason Hobbs (jason-hobbs) wrote :

Adding a transaction.commit() in node.acquire(), after the save, doesn't fix this, so I don't think django's transaction commit timing is the full story.

Gavin Panella (allenap) wrote :

At the isolation level we're at — read committed — the database doesn't
protect us from concurrent updates to the same object. Two threads can
read the object at the same time, update the status and user, then save
it back, with the last one "winning".

Running with serialisable isolation would cause the last of these to
commit to fail, so we'd eliminate the race. However, we'd want to retry
the failed operation.

PostgreSQL also offers repeatable read isolation. This may be a useful
middle-ground between read committed and serialisable, but I don't know
enough about it to comment more.

Django makes it simple enough to set the isolation level, but doesn't
provide retry support, as far as I can tell. It makes things somewhat
harder to do ourselves too, because it tries to normalise exceptions
coming from the database, semi-smothering the exception we'd want to
trigger a retry on.

Post 1.7 my thoughts are leaning towards greater isolation between
transactions, and more explicit transaction handling in our code:

- Increase the isolation level.

- Do not automatically wrap the request in a transaction.

- Forbid database access unless a transaction is explicitly begun.

- Create a context manager that runs the wrapped function within a
  transaction, but also detects serialisation errors and retries a
  finite number of times. Use this in preference to Django's atomic()
  decorator.

Andres Rodriguez (andreserl) wrote :

Gavin,

Your ISOLATION changes to the DB are the ones that caused this issue. We had come across this issue before and https://bugs.launchpad.net/juju-core/+bug/1314409 was the fix for it. This, to me, seems like a critical regression and changes need to be reverted.

Thanks

Andres Rodriguez (andreserl) wrote :

This also needs to be backported to 1.7

Changed in maas:
status: In Progress → Fix Committed
Christian Reis (kiko) wrote :

This is a redux of bug 1314409.

Just for completeness sake, we discussed this at length this week and we can't find where this has regressed; in fact, the isolation level as far as we can tell has never changed; it was erroneously thought to have been changed in bug 1314409, but we don't think that change ever actually did anything.

If that's true, and this indeed has regressed from 1.5, it's not a factor of changing the isolation level, but something else in the lifecycle transitions that is making this prone to races. I'm not sure what that could be.

Christian Reis (kiko) wrote :

The real, deeper fix is to change the isolation level in the database, work which Gavin currently has underway: https://code.launchpad.net/~allenap/maas/transactions-redux-the-revenge

On Tuesday 21 Oct 2014 18:57:28 you wrote:
> If that's true, and this indeed has regressed from 1.5, it's not a
> factor of changing the isolation level, but something else in the
> lifecycle transitions that is making this prone to races. I'm not sure
> what that could be.

I rather suspect it's blind luck or the calling environment that originally
experienced the bug had some subtle change.

Changed in maas:
status: Fix Committed → Fix Released
Andres Rodriguez (andreserl) wrote :

This bug was filed in upstream MAAS and not in Ubuntu. It was fixed as part of 1.7. This was fixed and verified to be working in all Ubuntu releases. Ubuntu 1.7 is being SRU'd. Marking this as verification-done, as it seems to be blocking SRU.

tags: added: verification-done
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers