After commissioning, nodes are ready but power is red
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Invalid
|
High
|
Unassigned |
Bug Description
I enlisted 14 nodes, by powering them up manually and watching them appear in maas. I selected all and chose enlist.
Maas took over, enlisted, and they were powered down at the end. So far, so good.
I then selected all of them and chose "commission".
Maas powered them up and ran the commissioning scripts. In the end, the machines were "Ready", but several had "error" in the power state. See attached screenshot.
The logs were also suddenly full with messages like:
==> maas.log <==
Oct 7 21:21:21 atlas maas.node: [INFO] shawmut: Stopping monitor: node-8a60b50c-
Oct 7 21:21:22 atlas maas.node: [INFO] elkhart: Stopping monitor: node-6b2d304c-
Oct 7 21:21:22 atlas maas.node: [INFO] hendel: Stopping monitor: node-79513ec0-
Oct 7 21:21:22 atlas maas.node: [INFO] clipper: Stopping monitor: node-62c7fdce-
Oct 7 21:21:22 atlas maas.node: [INFO] amco: Stopping monitor: node-6a3e5332-
Oct 7 21:21:23 atlas maas.node: [INFO] albany: Stopping monitor: node-6119b332-
Oct 7 21:21:26 atlas maas.power: [ERROR] Node could not be queried node-817370d2-
Oct 7 21:21:26 atlas maas.power: [ERROR] sekine: Failed to query power state: ipmi failed with return code 2:#012Invalid password.
Oct 7 21:21:39 atlas maas.power: [ERROR] Node could not be queried node-622d7876-
Oct 7 21:21:39 atlas maas.power: [ERROR] correja: Failed to query power state: ipmi failed with return code 2:#012Invalid password.
That's very unsettling, because these machines were *just* commissioned by maas itself, using the ipmi credentials that maas created.
I then clicked on one of the problematic nodes and hit the "check power state" button. To my surprise, it worked and said the machine was off, as it should be. Going back to the node list, the power status was correct and the error condition was reset.
Turns out after a while, all of them were good again.
This bug is about that ERROR state that happened right after commissioning. I can understand the power status in the web ui being a bit delayed, but the password was never incorrect, so I don't understand where the ERROR state came from.
Changed in maas: | |
milestone: | none → next |
Changed in maas: | |
milestone: | next → 1.7.1 |
Changed in maas: | |
milestone: | 1.7.1 → 1.7.2 |
Changed in maas: | |
milestone: | 1.7.2 → 1.7.3 |
Changed in maas: | |
milestone: | 1.7.3 → 2.0.0 |
importance: | Medium → Critical |
tags: | added: sts |
tags: | added: internal |
Changed in maas: | |
status: | Incomplete → Invalid |
status: | Invalid → Triaged |
The error state came from the fact that commissioning changes the IPMI password to the BMC. Power monitoring works by the cluster asking the region for all of its nodes and its power information, it then goes through the list of returned nodes updating each one by one. The list will contain an invalid power password for ipmi once commissioning has changed it, so on next request from the cluster to the region asking for the nodes and power information it will get the new password and it will succeed.
This would be a hard fix, I think the best thing to do would be to only update the BMC password on enlistment and not on commissioning, but that has been in debate for a while.