After commissioning, nodes are ready but power is red

Bug #1378536 reported by Andreas Hasenack
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
High
Unassigned

Bug Description

I enlisted 14 nodes, by powering them up manually and watching them appear in maas. I selected all and chose enlist.

Maas took over, enlisted, and they were powered down at the end. So far, so good.

I then selected all of them and chose "commission".

Maas powered them up and ran the commissioning scripts. In the end, the machines were "Ready", but several had "error" in the power state. See attached screenshot.

The logs were also suddenly full with messages like:

==> maas.log <==
Oct 7 21:21:21 atlas maas.node: [INFO] shawmut: Stopping monitor: node-8a60b50c-4e65-11e4-91e2-2c59e54ace74
Oct 7 21:21:22 atlas maas.node: [INFO] elkhart: Stopping monitor: node-6b2d304c-4e66-11e4-91e2-2c59e54ace74
Oct 7 21:21:22 atlas maas.node: [INFO] hendel: Stopping monitor: node-79513ec0-4e66-11e4-b73a-2c59e54ace74
Oct 7 21:21:22 atlas maas.node: [INFO] clipper: Stopping monitor: node-62c7fdce-4e66-11e4-b73a-2c59e54ace74
Oct 7 21:21:22 atlas maas.node: [INFO] amco: Stopping monitor: node-6a3e5332-4e66-11e4-b73a-2c59e54ace74
Oct 7 21:21:23 atlas maas.node: [INFO] albany: Stopping monitor: node-6119b332-4e66-11e4-b73a-2c59e54ace74
Oct 7 21:21:26 atlas maas.power: [ERROR] Node could not be queried node-817370d2-4e66-11e4-b73a-2c59e54ace74 (sekine) ipmi failed with return code 2:#012Invalid password
Oct 7 21:21:26 atlas maas.power: [ERROR] sekine: Failed to query power state: ipmi failed with return code 2:#012Invalid password.
Oct 7 21:21:39 atlas maas.power: [ERROR] Node could not be queried node-622d7876-4e66-11e4-91e2-2c59e54ace74 (correja) ipmi failed with return code 2:#012Invalid password
Oct 7 21:21:39 atlas maas.power: [ERROR] correja: Failed to query power state: ipmi failed with return code 2:#012Invalid password.

That's very unsettling, because these machines were *just* commissioned by maas itself, using the ipmi credentials that maas created.

I then clicked on one of the problematic nodes and hit the "check power state" button. To my surprise, it worked and said the machine was off, as it should be. Going back to the node list, the power status was correct and the error condition was reset.

Turns out after a while, all of them were good again.

This bug is about that ERROR state that happened right after commissioning. I can understand the power status in the web ui being a bit delayed, but the password was never incorrect, so I don't understand where the ERROR state came from.

Tags: sts internal
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Blake Rouse (blake-rouse) wrote :

The error state came from the fact that commissioning changes the IPMI password to the BMC. Power monitoring works by the cluster asking the region for all of its nodes and its power information, it then goes through the list of returned nodes updating each one by one. The list will contain an invalid power password for ipmi once commissioning has changed it, so on next request from the cluster to the region asking for the nodes and power information it will get the new password and it will succeed.

This would be a hard fix, I think the best thing to do would be to only update the BMC password on enlistment and not on commissioning, but that has been in debate for a while.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Christian Reis (kiko)
Changed in maas:
milestone: none → next
Christian Reis (kiko)
Changed in maas:
milestone: next → 1.7.1
Changed in maas:
milestone: 1.7.1 → 1.7.2
Changed in maas:
milestone: 1.7.2 → 1.7.3
Changed in maas:
milestone: 1.7.3 → 2.0.0
importance: Medium → Critical
tags: added: sts
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Ed,

If you are being affected by this issue in 2.0, can you provide logs?

1. cloud-init logs from the commissioning environment itself
2. /var/log/maas/rsyslog/<machine-name>/.../messages.

Thanks.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Sean Wheller (seanwhe) wrote :

Hi Ed,

I found the same thing with HP ILO [IPMI 2] but then I realized that the username and password which is auto completed by maas during the commissioning is not the username and password supplied by me in the ILO setup.

The commissioning is a kind of smoke test and I dont think it is able to obtain the details of the ilo user/password. Therefore it auto completes the smoke test with the default user/randompass.
I've done this a number of times and the result is the same.

After I changed the values in MAAS to the correct values I supplied in ilo it works. Of course, if I do a commission on the host, then the correct ilo values must be supplied again together with the power mac.

Fortunately you only need to run the commission once.

tags: added: internal
Changed in maas:
status: Incomplete → Invalid
status: Invalid → Triaged
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi!

We believe this issue has now been fixed in the latest version of MAAS. As such, we are marking this bug report as Invalid. If you believe this issue is still present, please re-open the bug report.

Changed in maas:
status: Triaged → Invalid
importance: Critical → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.