MAAS

Failed power off/status for multiple nodes within a SM15K chassis

Series 1.8
Bug #1470013

Bug #1470013 reported by Ante Karamatić on 2015-06-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Unassigned
	1.8	Fix Released	Undecided	Unassigned

Bug Description

I have an environment with 192 nodes. Majority (180+) were powered on and deployed. All these nodes are part of 3 SM15k chassis. Each chassis has its own power control. When releasing all of the nodes at once (juju destroy-environment), MAAS powers off number of them, but majority stays in 'RELEASING' state. 5 minutes after RELEASE was initiated, MAAS reports power change state errors for all (still) RELEASING nodes. Example:

Jun 30 03:59:29 maas maas.power: [ERROR] Error changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

After additional 5 minutes, I marked all 'RELEASING' nodes as 'BROKEN':

Jun 30 04:06:38 maas maas.node: [INFO] kindhearted-sofa: Status transition from RELEASING to BROKEN

I hoped I'd be able to initiate power off again. And I was able to do that:

Jun 30 04:07:16 maas maas.power: [INFO] Changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

But again, 5 minutes later, querying power status still returns errors:

Jun 30 04:12:16 maas maas.power: [ERROR] Error changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

After additional 5-10 minutes, I decided to reboot maas (for another reason), and now all broken nodes are reported as powered off:

Jun 30 04:20:42 maas maas.power: [INFO] Changing power state (off) of node: kindhearted-sofa (node-6fe1195a-1751-11e5-9b0f-002299369007)

Unfortunately, I haven't looked at the actual status of the node, but if this happens again (and it happens every time), I'll make sure to check actual power state of these nodes.

Tags:

Related branches

lp:~rvb/maas/tweak-timeouts

Merged into lp:~maas-committers/maas/trunk at revision 4072

Gavin Panella (community): Approve on 2015-07-07

lp:~rvb/maas/pool-size-1.8

Merged into lp:maas/1.8 at revision 4021

Raphaël Badin (community): Approve on 2015-07-10

Revision history for this message

Ante Karamatić (ivoks) wrote on 2015-06-30:

maas.log Edit (143.4 KiB, text/plain)

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-06-30:

Can you please also attach clusterd.log/regiond.log?

Revision history for this message

Ante Karamatić (ivoks) wrote on 2015-06-30:

logs.tar.bz2 Edit (2.0 MiB, application/x-tar)

Sure...

Um, regiond is 70MB, so log rotation should have been enforced, shouldn't it?

/var/log/maas/regiond.log {
rotate 5
weekly
compress
missingok
# copytruncate may lose log messages at the moment of rotation, but
# there is no better way to integrate twistd and logrotate.
copytruncate
# The logs are all owned by the `maas` user, so drop privs.
su maas maas
# Don't rotate unless the log is at least 10MB.
minsize 10M
# Force rotation if the log grows beyond 50MB.
maxsize 50M
}

Revision history for this message

Ante Karamatić (ivoks) wrote on 2015-06-30:

I can confirm that nodes are indeed powered off, but MAAS doesn't update the status. Looking at TCP traffic, I can see that MAAS does talk to the power management.

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-06-30:

> Um, regiond is 70MB, so log rotation should have been enforced, shouldn't it?

I'm going to ask Andres to confirm but I think this depends how often logrotate is configured to run (AFAIK, it's *daily* by default). Logrotate runs every so often, it's not constantly monitoring the log file.

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-06-30:

The power actions times out after 5 minutes.

There is no artificial contention done by MAAS when powering off a large number of nodes using the same chassis. In other words, MAAS will literally hammer the chassis when performing a power action on a vast number of nodes.

Reading the code in src/provisioningserver/drivers/hardware/seamicro.py I see:
            # Chance that multiple login's are at once, the api
            # only supports one at a time. So lets try again after
            # a second, up to max retry count.

This makes me think that we end up in a deadlock because we have many sessions in parallel.

Changed in maas:
importance:	Undecided → Critical

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-06-30:

If this is indeed the problem, we could serialize the power actions at the driver level.

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-06-30:

Here is a patch that will serialize (using a file lock) access to the API to verify the hypothesis explained above: http://paste.ubuntu.com/11798657/.

Raphaël Badin (rvb) on 2015-06-30

summary:

- MAAS 1.8 - failed power off/status for majority of nodes
+ Failed power off/status for multiple nodes within a SM15K chassis

Raphaël Badin (rvb) on 2015-07-06

Changed in maas:
status:	New → Triaged

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-07-06:

After some debugging I found a couple of problems:

- MAAS uses various timeouts to limit the time it takes to check the power state of nodes. The seamicro chassis sometimes takes as much as 35 seconds to reply to a power query and MAAS has a 15s timeout when the check is performed using the UI (using the "Check now" link). We should make the timeout bigger.

- The main problem that's happening when releasing a large number of nodes is that power.py:power_state_update ends up in a deadlock (thread starvation). After releasing 64 nodes I could see that the 10 threads allowed per region where busy (stuck in epoll_wait) waiting for something and the queue was growing and growing. Increasing the number of threads per region (to 20) solved the problem instantly. Conclusion: there is a deadlock in the code.

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-07-06:

#10

Some more debugging and it's now clear that it's the `deallocate_static_ip_addresses` (called as part of releasing) that's stuck, waiting for a thread.

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-07-15:

#11

The two branches that I landed improved the situation a bit but this bug is far from being completely fixed. We can't have an unbounded number of threads because it will make MAAS hit the maximum number of DB connections. Conversely, a very limited number of threads leads to deadlocks as explained above.
Two (non-exclusive) options here:
- go through each asynchronous task that MAAS runs and make sure no task being run in a thread requires another thread to complete (i.e. remove all the potential deadlocks).
- implement an intelligent thread pool which "detects" deadlock situations and allocates more threads when it has to (without going above a certain limit).

Revision history for this message

Ante Karamatić (ivoks) wrote on 2015-07-20:

#12

Here's another datapoint...

If, instead of releasing all nodes, I mark all nodes as broken, they all power off just fine.

Andres Rodriguez (andreserl) on 2015-10-27

Changed in maas:
status:	Triaged → Fix Committed

Andres Rodriguez (andreserl) on 2016-01-22

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.