[2.5] MAchines get stuck (Releasing, Exiting Rescue Mode) due to power management failures

Bug #1789426 reported by Andres Rodriguez on 2018-08-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Unassigned

Bug Description

Machines get stuck in 'Releasing' or 'Exiting Rescue Mode' due to apparent power management failures.

I have 2 region/racks, 2 rack controllers. Some of the rack controllers do not have the power management tools necessary, but yet, they are still chosen to do power management when there are other rack controllers which can actually do.

Changed in maas:
importance: Undecided → High
milestone: none → 2.5.0beta1
tags: added: track
description: updated
Changed in maas:
status: New → Triaged
importance: High → Critical
Andres Rodriguez (andreserl) wrote :

Also noticed that after various attempts, other racks and region/rack started disconnecting from MAAS>

Dean Henrichsmeyer (dean) wrote :

I think having MAAS regions know which rack controller to use based on power management tools is the wrong approach. We should require the same power management tools across all rack controllers. If we allow divergence we're asking for confusion and a debugging nightmare down the road. Even if MAAS is smart enough to just use the right one - what happens when that rack controller dies and now the remaining one can't do its job because we didn't enforce consistency?

I think at the very least we should continue surfacing warnings about power management tools missing and surface the controllers to which it applies.

Andres Rodriguez (andreserl) wrote :

MAAS doesn't know which rack controller to use based on power management tools. MAAS has a mechanism in which it finds a rack controller that can communicate to the machine via the address of the BMC (e.g. checks that MAAS can reach to the machine via IP).

That said, we cannot enforce consistency because we cannot depend on packages that are not in main. In my particular issue, we use amt related tooling that is not in main, and hence, we cannot installed in the default package installation. This issue is solved with the use of the snap, because the snap includes all required packages.

And yes, I agree that we should surface. IN fact, I think we should actually do a check per rack controller to ensure it has all expected power tools and add warnings telling the user of the missing ones.

Blake Rouse (blake-rouse) wrote :

Actually the mechanism to determine when rack controllers can communicate to a machine is only for Layer 3 communication. If maas determines that the BMC is located on the same Layer 2 as other rack controllers it will just pick one of those are random to power control that machine.

I do agree that we should want all rack controllers to be consistent in access to tools. The snap helps with this because its all bundled, so its a none issue with snaps.

Andres Rodriguez (andreserl) wrote :

AFter further investigation, seems that the actual issue wasn't the fact that amt related decencies weren't installed (they were).

It seems AMT reset itself and power management is not working for any of the machines.

seems I'll need to reset it in the bios manually.

That said, the issue is still valid as I have been able to reproduce on a working environment.

Andres Rodriguez (andreserl) wrote :
Download full text (6.0 KiB)

I tried releasing 5 machines for which power management wasn't working (e.g. MAAS just couldn't contact the BMC's). 1 machine failed to release immediately, while the other 4 got stuck in 'Releasing':

=> /var/log/maas/maas.log <==
Aug 28 20:35:42 maas00 maas.node: [info] node01: Releasing node
Aug 28 20:35:42 maas00 maas.node: [info] node03: Releasing node
Aug 28 20:35:42 maas00 maas.node: [info] node01: Status transition from DEPLOYED to RELEASING
Aug 28 20:35:42 maas00 maas.node: [info] node03: Status transition from DEPLOYED to RELEASING
Aug 28 20:35:42 maas00 maas.node: [info] node02: Releasing node
Aug 28 20:35:42 maas00 maas.node: [info] node04: Releasing node
Aug 28 20:35:42 maas00 maas.node: [info] node02: Status transition from DEPLOYED to RELEASING
Aug 28 20:35:42 maas00 maas.node: [info] node04: Status transition from FAILED_RELEASING to RELEASING
Aug 28 20:35:43 maas00 maas.power: [info] Changing power state (off) of node: node03 (ncyhd6)
Aug 28 20:35:43 maas00 maas.power: [info] Changing power state (off) of node: node01 (8r76bg)
Aug 28 20:35:43 maas00 maas.power: [info] Changing power state (off) of node: node04 (mnwpg8)
Aug 28 20:35:43 maas00 maas.power: [info] Changing power state (off) of node: node02 (6y8g6m)
Aug 28 20:35:43 maas00 maas.node: [info] node05: Releasing node
Aug 28 20:35:43 maas00 maas.node: [info] node05: Status transition from FAILED_DEPLOYMENT to RELEASING
Aug 28 20:35:43 maas00 maas.power: [info] Changing power state (off) of node: node05 (d4wqsc)

==> /var/log/maas/regiond.log <==
2018-08-28 20:35:45 regiond: [info] 10.90.90.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)

==> /var/log/maas/maas.log <==
Aug 28 20:35:54 maas00 maas.power: [error] node01: Power state could not be queried: Could not connect to BMC. Check BMC configuration and try again.
Aug 28 20:35:54 maas00 maas.power: [error] node01: Could not query power state: Could not connect to BMC. Check BMC configuration and try again..

==> /var/log/maas/regiond.log <==
2018-08-28 20:36:15 regiond: [info] 10.90.90.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)

==> /var/log/maas/maas.log <==
Aug 28 20:36:36 maas00 maas.power: [error] Error changing power state (off) of node: node01 (8r76bg)
Aug 28 20:36:36 maas00 maas.node: [info] node01: Status transition from RELEASING to FAILED_RELEASING
Aug 28 20:36:36 maas00 maas.node: [error] node01: Marking node failed: Power off for the node failed: Could not contact node's BMC: Could not connect to BMC. Check BMC configuration and try again.

==> /var/log/maas/rackd.log <==
2018-08-28 20:36:36 provisioningserver.rpc.power: [critical] node01: Power off failed.
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 459, in callback
            self._startRunCallbacks(result)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 653, in ...

Read more...

Andres Rodriguez (andreserl) wrote :

After a while, the other machines were marked failed releasing, because they timed out. That means that power management failure/traceback may be blocking other operations from continuing to be able to perform:

==> /var/log/maas/maas.log <==
Aug 28 20:40:43 maas00 maas.power: [error] Error changing power state (off) of node: node03 (ncyhd6)
Aug 28 20:40:43 maas00 maas.node: [info] node03: Status transition from RELEASING to FAILED_RELEASING
Aug 28 20:40:43 maas00 maas.node: [error] node03: Marking node failed: Timed out

==> /var/log/maas/rackd.log <==
2018-08-28 20:40:43 provisioningserver.rpc.power: [info] node04: Power could not be set to off; timed out.

==> /var/log/maas/maas.log <==
Aug 28 20:40:43 maas00 maas.power: [error] Error changing power state (off) of node: node04 (mnwpg8)

==> /var/log/maas/rackd.log <==
2018-08-28 20:40:43 provisioningserver.rpc.power: [info] node02: Power could not be set to off; timed out.

==> /var/log/maas/maas.log <==
Aug 28 20:40:43 maas00 maas.power: [error] Error changing power state (off) of node: node02 (6y8g6m)
Aug 28 20:40:43 maas00 maas.node: [info] node02: Status transition from RELEASING to FAILED_RELEASING
Aug 28 20:40:43 maas00 maas.node: [info] node04: Status transition from RELEASING to FAILED_RELEASING
Aug 28 20:40:43 maas00 maas.node: [error] node02: Marking node failed: Timed out
Aug 28 20:40:43 maas00 maas.node: [error] node04: Marking node failed: Timed out

==> /var/log/maas/rackd.log <==
2018-08-28 20:40:43 provisioningserver.rpc.power: [info] node05: Power could not be set to off; timed out.

==> /var/log/maas/maas.log <==
Aug 28 20:40:43 maas00 maas.power: [error] Error changing power state (off) of node: node05 (d4wqsc)
Aug 28 20:40:43 maas00 maas.node: [info] node05: Status transition from RELEASING to FAILED_RELEASING
Aug 28 20:40:43 maas00 maas.node: [error] node05: Marking node failed: Timed out

tags: added: rack-proxy
Changed in maas:
milestone: 2.5.0beta1 → 2.5.0beta2
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
tags: added: sprint
Changed in maas:
importance: Critical → High
status: Triaged → Incomplete
importance: High → Critical
assignee: Blake Rouse (blake-rouse) → nobody
Changed in maas:
milestone: 2.5.0beta2 → 2.5.0rc1
Andres Rodriguez (andreserl) wrote :

I think this is the same issue as this: https://bugs.launchpad.net/maas/+bug/1771777

Changed in maas:
milestone: 2.5.0rc1 → 2.5.x
Changed in maas:
milestone: 2.5.x → 2.5.0
Changed in maas:
milestone: 2.5.0 → 2.5.x
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers