Default Waiting Policy for power commands for power retries might lead to incorrectly determining failures

Bug #1384758 reported by Andres Rodriguez on 2014-10-23
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Graham Binns
1.7
Critical
Graham Binns

Bug Description

THe default waiting policy for power command retries is too low and can cause machines that are in the process of being powered on/off to fail.

Currently, the default waiting policy is:

 (1, 1, 1, 1, 1, 3, 5)

In some scenarios, this can cause BMC lockup, or can completely mistake a node being powered down as failing to power down, leading to a failure of releasing:

For example, some IPMI based BMC's do not actually power off the system right away. The BMC can be doing the process, or unresponsive for a few seconds (for example, 10-15 seconds).

So when the power off request comes in, the machine does not power off right a way. Sometimes it takes a few seconds (which are longer than the totality of the wait time) to actually power off.

In this case, what it is causing is that in some cases, the machine is being powered off, but MAAS thinks it failed because MAAS waits 13 secodns tops, where it can be taking the BMC 15 seconds to report that the node is actually off.

I'd suggest we increase the waiting time to something like:

(2, 2, 2, 2, 2, 4, 6)

This would mean that we have a waiting policy in total of 20 seconds before we decide whether the node failed to power on/off.

Related branches

Changed in maas:
importance: Undecided → Critical
Graham Binns (gmb) on 2014-10-23
Changed in maas:
status: New → Triaged
tags: added: trivial
Christian Reis (kiko) on 2014-10-23
Changed in maas:
assignee: nobody → Andres Rodriguez (andreserl)
milestone: none → next
Graham Binns (gmb) on 2014-10-23
Changed in maas:
assignee: Andres Rodriguez (andreserl) → Graham Binns (gmb)
status: Triaged → In Progress
Christian Reis (kiko) wrote :

Is this suitable for 1.7?

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: next → none
Changed in maas:
status: Fix Committed → Fix Released

Hello Andres, or anyone else affected,

Accepted maas into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/maas/1.7.5+bzr3369-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Andres Rodriguez (andreserl) wrote :

This issue has been verified to work both on upgrade and fresh install, and has been QA'd. Marking verification-done.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers