Failed power off/status for multiple nodes within a SM15K chassis

Bug #1470013 reported by Ante Karamatić on 2015-06-30
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Unassigned
1.8
Undecided
Unassigned

Bug Description

I have an environment with 192 nodes. Majority (180+) were powered on and deployed. All these nodes are part of 3 SM15k chassis. Each chassis has its own power control. When releasing all of the nodes at once (juju destroy-environment), MAAS powers off number of them, but majority stays in 'RELEASING' state. 5 minutes after RELEASE was initiated, MAAS reports power change state errors for all (still) RELEASING nodes. Example:

Jun 30 03:59:29 maas maas.power: [ERROR] Error changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

After additional 5 minutes, I marked all 'RELEASING' nodes as 'BROKEN':

Jun 30 04:06:38 maas maas.node: [INFO] kindhearted-sofa: Status transition from RELEASING to BROKEN

I hoped I'd be able to initiate power off again. And I was able to do that:

Jun 30 04:07:16 maas maas.power: [INFO] Changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

But again, 5 minutes later, querying power status still returns errors:

Jun 30 04:12:16 maas maas.power: [ERROR] Error changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

After additional 5-10 minutes, I decided to reboot maas (for another reason), and now all broken nodes are reported as powered off:

Jun 30 04:20:42 maas maas.power: [INFO] Changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

Unfortunately, I haven't looked at the actual status of the node, but if this happens again (and it happens every time), I'll make sure to check actual power state of these nodes.

Related branches

Ante Karamatić (ivoks) wrote :
Raphaël Badin (rvb) wrote :

Can you please also attach clusterd.log/regiond.log?

Ante Karamatić (ivoks) wrote :

Sure...

Um, regiond is 70MB, so log rotation should have been enforced, shouldn't it?

/var/log/maas/regiond.log {
 rotate 5
 weekly
 compress
 missingok
 # copytruncate may lose log messages at the moment of rotation, but
 # there is no better way to integrate twistd and logrotate.
 copytruncate
 # The logs are all owned by the `maas` user, so drop privs.
 su maas maas
 # Don't rotate unless the log is at least 10MB.
 minsize 10M
 # Force rotation if the log grows beyond 50MB.
 maxsize 50M
}

Ante Karamatić (ivoks) wrote :

I can confirm that nodes are indeed powered off, but MAAS doesn't update the status. Looking at TCP traffic, I can see that MAAS does talk to the power management.

Raphaël Badin (rvb) wrote :

> Um, regiond is 70MB, so log rotation should have been enforced, shouldn't it?

I'm going to ask Andres to confirm but I think this depends how often logrotate is configured to run (AFAIK, it's *daily* by default). Logrotate runs every so often, it's not constantly monitoring the log file.

Raphaël Badin (rvb) wrote :

The power actions times out after 5 minutes.

There is no artificial contention done by MAAS when powering off a large number of nodes using the same chassis. In other words, MAAS will literally hammer the chassis when performing a power action on a vast number of nodes.

Reading the code in src/provisioningserver/drivers/hardware/seamicro.py I see:
            # Chance that multiple login's are at once, the api
            # only supports one at a time. So lets try again after
            # a second, up to max retry count.

This makes me think that we end up in a deadlock because we have many sessions in parallel.

Changed in maas:
importance: Undecided → Critical
Raphaël Badin (rvb) wrote :

If this is indeed the problem, we could serialize the power actions at the driver level.

Raphaël Badin (rvb) wrote :

Here is a patch that will serialize (using a file lock) access to the API to verify the hypothesis explained above: http://paste.ubuntu.com/11798657/.

Raphaël Badin (rvb) on 2015-06-30
summary: - MAAS 1.8 - failed power off/status for majority of nodes
+ Failed power off/status for multiple nodes within a SM15K chassis
Raphaël Badin (rvb) on 2015-07-06
Changed in maas:
status: New → Triaged
Raphaël Badin (rvb) wrote :

After some debugging I found a couple of problems:

- MAAS uses various timeouts to limit the time it takes to check the power state of nodes. The seamicro chassis sometimes takes as much as 35 seconds to reply to a power query and MAAS has a 15s timeout when the check is performed using the UI (using the "Check now" link). We should make the timeout bigger.

- The main problem that's happening when releasing a large number of nodes is that power.py:power_state_update ends up in a deadlock (thread starvation). After releasing 64 nodes I could see that the 10 threads allowed per region where busy (stuck in epoll_wait) waiting for something and the queue was growing and growing. Increasing the number of threads per region (to 20) solved the problem instantly. Conclusion: there is a deadlock in the code.

Raphaël Badin (rvb) wrote :

Some more debugging and it's now clear that it's the `deallocate_static_ip_addresses` (called as part of releasing) that's stuck, waiting for a thread.

Raphaël Badin (rvb) wrote :

The two branches that I landed improved the situation a bit but this bug is far from being completely fixed. We can't have an unbounded number of threads because it will make MAAS hit the maximum number of DB connections. Conversely, a very limited number of threads leads to deadlocks as explained above.
Two (non-exclusive) options here:
- go through each asynchronous task that MAAS runs and make sure no task being run in a thread requires another thread to complete (i.e. remove all the potential deadlocks).
- implement an intelligent thread pool which "detects" deadlock situations and allocates more threads when it has to (without going above a certain limit).

Ante Karamatić (ivoks) wrote :

Here's another datapoint...

If, instead of releasing all nodes, I mark all nodes as broken, they all power off just fine.

Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers