[2.3+] Testing Fails due to Power Issue with Manual Power Type

Bug #1768659 reported by KingJ on 2018-05-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Lee Trager
2.3
High
Lee Trager

Bug Description

I have an old machine that does not have any form of IPMI or remote power control. I created a new machine in MAAS with the "Manual" power type, expecting that I would still be able to use it with MAAS but that I would need to manually power the machine on and off.

However, while MAAS successfully performs comissioning, all hardware tests fail due to "BMC never transitioned from unknown to on.". It appears that when executing tests, MAAS expects to control the machine's power but can't - and eventually it times out and the test fails. Consequently, even though I have manually powered on the machine as soon as it completes a boot from PXE to the login screen, it shuts down again as there are no tests to run.

May 02 21:32:08 maas maas.drivers.power.manual[929]: [info] You need to check power state of sfsh84 manually.
May 02 21:32:08 maas maas.drivers.power.manual[929]: [info] You need to power on sfsh84 manually.
May 02 21:32:20 maas maas.drivers.power.manual[929]: [info] You need to check power state of sfsh84 manually.
May 02 21:32:20 maas maas.power[929]: [error] Error changing power state (cycle) of node: nagisa (sfsh84)
May 02 21:32:20 maas maas.node[1089]: [info] nagisa: Status transition from TESTING to FAILED_TESTING
May 02 21:32:20 maas maas.node[1089]: [error] nagisa: Marking node failed: Power cycle for the node failed: Failed talking to node's BMC: Failed to power sfsh84. BMC never transitioned from unknown to on.
May 02 21:32:20 maas sh[927]: 2018-05-02 21:32:20 provisioningserver.rpc.power: [critical] nagisa: Power cycle failed.
May 02 21:32:20 maas sh[927]: Traceback (most recent call last):
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 459, in callback
May 02 21:32:20 maas sh[927]: self._startRunCallbacks(result)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
May 02 21:32:20 maas sh[927]: self._runCallbacks()
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
May 02 21:32:20 maas sh[927]: current.result = callback(current.result, *args, **kw)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1442, in gotResult
May 02 21:32:20 maas sh[927]: _inlineCallbacks(r, g, deferred)
May 02 21:32:20 maas sh[927]: --- <exception caught here> ---
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
May 02 21:32:20 maas sh[927]: result = result.throwExceptionIntoGenerator(g)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
May 02 21:32:20 maas sh[927]: return g.throw(self.type, self.value, self.tb)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/provisioningserver/rpc/power.py", line 287, in change_power_state
May 02 21:32:20 maas sh[927]: system_id, hostname, power_type, power_change, context)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
May 02 21:32:20 maas sh[927]: result = result.throwExceptionIntoGenerator(g)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
May 02 21:32:20 maas sh[927]: return g.throw(self.type, self.value, self.tb)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/provisioningserver/drivers/power/__init__.py", line 326, in cycle
May 02 21:32:20 maas sh[927]: yield self.perform_power(self.power_on, "on", system_id, context)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
May 02 21:32:20 maas sh[927]: result = g.send(result)
May 02 21:32:20 maas sh[927]: File "/usr/lib/python3/dist-packages/provisioningserver/drivers/power/__init__.py", line 415, in perform_power
May 02 21:32:20 maas sh[927]: % (system_id, state, state_desired))
May 02 21:32:20 maas sh[927]: provisioningserver.drivers.power.PowerError: Failed to power sfsh84. BMC never transitioned from unknown to on.

Related branches

KingJ (kj-kingj) wrote :
Changed in maas:
importance: Undecided → Medium
status: New → Triaged
milestone: none → 2.4.0rc1
summary: - Testing Fails due to Power Issue with Manual Power Type
+ [2.3+] Testing Fails due to Power Issue with Manual Power Type
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
KingJ (kj-kingj) wrote :

Are there any potential workarounds for this? This prevents me from running tests after commissioning at the moment.

Lee Trager (ltrager) on 2018-05-07
Changed in maas:
status: Triaged → In Progress
Lee Trager (ltrager) wrote :

I was able to run commissioning with testing with multiple tests running to completion on a machine with manual power state.

@kj-kingj What tests are you trying to run?

Changed in maas:
status: In Progress → Incomplete
KingJ (kj-kingj) wrote :

@ltrager I've been playing around with this a bit more and if I perform the tests as part of the commissioning , it works. However, if I commission a machine and then command MAAS to run extra tests, I trigger this bug. At a guess, it seems when you test as part of commissioning, MAAS doesn't care about the power state as the node reboots itself and starts the tests. However, when triggering tests post-commission, MAAS explicitly wants to control the power state and ultimately fails the testing because it can't confirm that the machine has powered on.

The failure scenario is independent of what tests are selected - I can trigger it no matter what tests I select (e.g. just the default smartctl-validate, or a huge barrage of badblocks-destructed, smartctl-long, network tests etc).

Lee Trager (ltrager) wrote :

Thanks for the clarification. I was able to reproduce when I only started testing. This appears to be a bug with the power cycle code. A defer is created which has 30 seconds to complete the power cycle. Since it can't verify the power cycle occurs it times out causing the failure. The power cycle code is also used by rescue mode so machines with manual power control can't use rescue mode either.

Changed in maas:
status: Incomplete → In Progress
Lee Trager (ltrager) on 2018-05-08
Changed in maas:
importance: Medium → High
KingJ (kj-kingj) wrote :

Apologies for the confusion there in #3 Lee, I only discovered that tests worked during deployment with your previous reply. Thank you for the quick fix!

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers