Bulk operation leaves nodes in inconsistent state

Bug #1377099 reported by Raphaël Badin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Graham Binns

Bug Description

I tested this in the UI but I think the same goes for the API: when performing a bulk operation, the operation only succeeds if all the sub operations succeed. This leads to:
- poor user experience: if I perform 4 operations and one of them fails all the other operations will be rolled back
- possibly leaving nodes in an inconsistent state: if only one in 4 operations fails, the power operations of the remaining 3 operations will still be performed although the DB side of things will be rolled back. (This is because the RPC operations —the power actions in this instance— are not part of the transaction and thus don't get rolled back).

= How to reproduce =

- Get two nodes 'READY'
- Commission one node
- Wait 20 seconds: enough so that the power operation is done (and the lock removed) but not enough to get the commissioning operation done
- Commission the second node
- Quickly after this (i.e. while the lock of the second node is still being held): abort the commissioning operation of the two nodes
- The bulk operation will fail (because of the lock held on the second node) but the first node will also be powered down.

=> the second node is now in an inconsistent state: it's commissioning but it has been powered down.

Related branches

Revision history for this message
Raphaël Badin (rvb) wrote :

I think the semantics of a bulk operation should be: similar to what deferredList does (with consumeErrors=True): a bulk operation should always succeed and report back about the number of suboperations that succeeded/failed.

Under the hood, we can probably achieve this by using sub transactions.

description: updated
Revision history for this message
Gavin Panella (allenap) wrote :

How about:

- Prepare nodes for commissioning.

- Commit transaction.

- Fire off RPC calls to power nodes on.

- Collect failures, then for each:

  - If the failure is because there's an existing power change in
    progress, transition back to previous status.

  - Otherwise transition node to a failed state.

?

Revision history for this message
Gavin Panella (allenap) wrote :

Actually, there's nothing in the PowerOn call currently that would
warrant transitioning a node to a failed state. Here's a revised
process:

- Prepare nodes for commissioning.

- Commit transaction.

- Fire off RPC calls to power nodes on.

- Collect failures, then for each:

  - Transition node back to previous node.

  - Tell the user.

?

Revision history for this message
Raphaël Badin (rvb) wrote :

> - Prepare nodes for commissioning.
> - Commit transaction.
> - Fire off RPC calls to power nodes on.
> - Collect failures, then for each:
> - Transition node back to previous node.
> - Tell the user.

Sounds like a good plan (but note that it means changing each method that does things in bulk).

> I think the semantics of a bulk operation should be: similar to what deferredList does (with consumeErrors=True): a bulk operation > should always succeed and report back about the number of suboperations that succeeded/failed.

(replying to myself here) Actually, it would probably be more consistent to have a bulk operation fail only if all the sub operations fail. If a single one of the sub operation succeeds, the bulk operation is considered a success (albeit partial) but should report back about the failures.

Changed in maas:
milestone: none → next
Graham Binns (gmb)
Changed in maas:
status: Triaged → Fix Committed
assignee: nobody → Graham Binns (gmb)
milestone: next → 1.7.0
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.