Failed power off/status for multiple nodes within a SM15K chassis
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Critical
|
Unassigned | ||
1.8 |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
I have an environment with 192 nodes. Majority (180+) were powered on and deployed. All these nodes are part of 3 SM15k chassis. Each chassis has its own power control. When releasing all of the nodes at once (juju destroy-
Jun 30 03:59:29 maas maas.power: [ERROR] Error changing power state (off) of node: kindhearted-sofa (node-6fe1195a-
After additional 5 minutes, I marked all 'RELEASING' nodes as 'BROKEN':
Jun 30 04:06:38 maas maas.node: [INFO] kindhearted-sofa: Status transition from RELEASING to BROKEN
I hoped I'd be able to initiate power off again. And I was able to do that:
Jun 30 04:07:16 maas maas.power: [INFO] Changing power state (off) of node: kindhearted-sofa (node-6fe1195a-
But again, 5 minutes later, querying power status still returns errors:
Jun 30 04:12:16 maas maas.power: [ERROR] Error changing power state (off) of node: kindhearted-sofa (node-6fe1195a-
After additional 5-10 minutes, I decided to reboot maas (for another reason), and now all broken nodes are reported as powered off:
Jun 30 04:20:42 maas maas.power: [INFO] Changing power state (off) of node: kindhearted-sofa (node-6fe1195a-
Unfortunately, I haven't looked at the actual status of the node, but if this happens again (and it happens every time), I'll make sure to check actual power state of these nodes.
Related branches
- Gavin Panella (community): Approve
-
Diff: 193 lines (+44/-10)6 files modifiedsrc/maasserver/api/nodes.py (+2/-3)
src/maasserver/plugin.py (+14/-0)
src/maasserver/tests/test_plugin.py (+14/-2)
src/maasserver/websockets/handlers/node.py (+3/-4)
src/provisioningserver/drivers/power/__init__.py (+9/-0)
src/provisioningserver/rpc/power.py (+2/-1)
- Raphaël Badin (community): Approve
-
Diff: 193 lines (+44/-10)6 files modifiedsrc/maasserver/api/nodes.py (+2/-3)
src/maasserver/plugin.py (+14/-0)
src/maasserver/tests/test_plugin.py (+14/-2)
src/maasserver/websockets/handlers/node.py (+3/-4)
src/provisioningserver/drivers/power/__init__.py (+9/-0)
src/provisioningserver/rpc/power.py (+2/-1)
summary: |
- MAAS 1.8 - failed power off/status for majority of nodes + Failed power off/status for multiple nodes within a SM15K chassis |
Changed in maas: | |
status: | New → Triaged |
Changed in maas: | |
status: | Triaged → Fix Committed |
Changed in maas: | |
status: | Fix Committed → Fix Released |
Can you please also attach clusterd. log/regiond. log?