MAAS

Bulk operation leaves nodes in inconsistent state

Bug #1377099 reported by Raphaël Badin on 2014-10-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Graham Binns	MAAS 1.7.0

Bug Description

I tested this in the UI but I think the same goes for the API: when performing a bulk operation, the operation only succeeds if all the sub operations succeed. This leads to:
- poor user experience: if I perform 4 operations and one of them fails all the other operations will be rolled back
- possibly leaving nodes in an inconsistent state: if only one in 4 operations fails, the power operations of the remaining 3 operations will still be performed although the DB side of things will be rolled back. (This is because the RPC operations —the power actions in this instance— are not part of the transaction and thus don't get rolled back).

= How to reproduce =

- Get two nodes 'READY'
- Commission one node
- Wait 20 seconds: enough so that the power operation is done (and the lock removed) but not enough to get the commissioning operation done
- Commission the second node
- Quickly after this (i.e. while the lock of the second node is still being held): abort the commissioning operation of the two nodes
- The bulk operation will fail (because of the lock held on the second node) but the first node will also be powered down.

=> the second node is now in an inconsistent state: it's commissioning but it has been powered down.

See original description

Tags:

Related branches

lp:~gmb/maas/catch-errors-in-power-callsites

Merged into lp:~maas-committers/maas/trunk at revision 3228

Raphaël Badin (community): Approve on 2014-10-09

Gavin Panella (community): Approve on 2014-10-09

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-10-03:

I think the semantics of a bulk operation should be: similar to what deferredList does (with consumeErrors=True): a bulk operation should always succeed and report back about the number of suboperations that succeeded/failed.

Under the hood, we can probably achieve this by using sub transactions.

description:

updated

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-03:

How about:

- Prepare nodes for commissioning.

- Commit transaction.

- Fire off RPC calls to power nodes on.

- Collect failures, then for each:

- If the failure is because there's an existing power change in
progress, transition back to previous status.

- Otherwise transition node to a failed state.

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-03:

Actually, there's nothing in the PowerOn call currently that would
warrant transitioning a node to a failed state. Here's a revised
process:

- Prepare nodes for commissioning.

- Commit transaction.

- Fire off RPC calls to power nodes on.

- Collect failures, then for each:

- Transition node back to previous node.

- Tell the user.

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-10-03:

> - Prepare nodes for commissioning.
> - Commit transaction.
> - Fire off RPC calls to power nodes on.
> - Collect failures, then for each:
> - Transition node back to previous node.
> - Tell the user.

Sounds like a good plan (but note that it means changing each method that does things in bulk).

> I think the semantics of a bulk operation should be: similar to what deferredList does (with consumeErrors=True): a bulk operation > should always succeed and report back about the number of suboperations that succeeded/failed.

(replying to myself here) Actually, it would probably be more consistent to have a bulk operation fail only if all the sub operations fail. If a single one of the sub operation succeeds, the bulk operation is considered a success (albeit partial) but should report back about the failures.

Julian Edwards (julian-edwards) on 2014-10-08

Changed in maas:
milestone:	none → next

Graham Binns (gmb) on 2014-10-15

Changed in maas:
status:	Triaged → Fix Committed
assignee:	nobody → Graham Binns (gmb)
milestone:	next → 1.7.0

Julian Edwards (julian-edwards) on 2014-11-19

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.