Bug #1357073 “power state changes are not reflected quickly enou...” : Bugs : MAAS

Andres Rodriguez (andreserl) on 2014-08-14

Changed in maas:
assignee:	nobody → Blake Rouse (blake-rouse)

Revision history for this message

Gavin Panella (allenap) wrote on 2014-08-14:

#1

> The interval is too long, went through commissioning and it never
> really displayed the actual status of the machine. INterval should
> probably be just 1 minute.

This should probably not be solved by decreasing the interval. Instead,
those parts of MAAS that are able to infer the power status of a machine
(e.g. where an HTTP request comes in from its IP address) should update
the machine's power status with what they know. Times like commissioning
and deploying are those times when power is most in flux and also when
there's the most activity from which to infer the machine's power
status.

For extra points, we could modify the interval based on the machine's
lifecycle status. An unallocated machine could be checked once every
15-30 minutes, or even less frequently, for example. I think we're most
interested in the power status of machines going through a
ready->allocated->deploying->deployed transition, followed by an ongoing
power status of deployed machines, followed by the power status of
machines in other lifecycle states.

For even more points, we may want to actively power off machines in
those lifecycle states where they ought to be dormant.

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-08-14:

#2

> This should probably not be solved by decreasing the interval. Instead,
> those parts of MAAS that are able to infer the power status of a machine
> (e.g. where an HTTP request comes in from its IP address) should update
> the machine's power status with what they know. Times like commissioning
> and deploying are those times when power is most in flux and also when
> there's the most activity from which to infer the machine's power
> status.

+1. The twisted-based PowerOn/PowerOff already update the node's power state. This means that whenever MAAS is actively changing the power-state of a node, the power state change should be reflected instantly in MAAS (i.e. we don't have to wait for the periodic monitor to run). Now, of course, we need to have the twisted-based PowerOn/PowerOff tasks used instead of the celery tasks ;).

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-08-14: Re: [Bug 1357073] Re: src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval too long

#3

Querying for power state every 5 minutes is too long. It does not provide
and accurate status of the power state of the machine and will mislead
administrators. As we discussed in the sprint, every 60 seconds would be a
good measurement.
On Aug 14, 2014 6:25 PM, "Raphaël Badin" <email address hidden> wrote:

> > This should probably not be solved by decreasing the interval. Instead,
> > those parts of MAAS that are able to infer the power status of a machine
> > (e.g. where an HTTP request comes in from its IP address) should update
> > the machine's power status with what they know. Times like commissioning
> > and deploying are those times when power is most in flux and also when
> > there's the most activity from which to infer the machine's power
> > status.
>
> +1. The twisted-based PowerOn/PowerOff already update the node's power
> state. This means that whenever MAAS is actively changing the power-
> state of a node, the power state change should be reflected instantly in
> MAAS (i.e. we don't have to wait for the periodic monitor to run). Now,
> of course, we need to have the twisted-based PowerOn/PowerOff tasks used
> instead of the celery tasks ;).
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1357073
>
> Title:
>
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
> too long
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1357073/+subscriptions
>

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-08-14:

#4

Say a machine succeeds to change power state and it changes instantly. Say
machine dies for whatever reason. MAAS will continue to report machine is
on even when it is not until the next check. We need to ensure that the
check is frequent enough to ensure we display accurate status of the
machine. If we are not able to do so then we will fail miserably. I have
tested this case myself and if we keep the current time we won't never be
able to accurately display the power status.
On Aug 14, 2014 6:25 PM, "Raphaël Badin" <email address hidden> wrote:

> > This should probably not be solved by decreasing the interval. Instead,
> > those parts of MAAS that are able to infer the power status of a machine
> > (e.g. where an HTTP request comes in from its IP address) should update
> > the machine's power status with what they know. Times like commissioning
> > and deploying are those times when power is most in flux and also when
> > there's the most activity from which to infer the machine's power
> > status.
>
> +1. The twisted-based PowerOn/PowerOff already update the node's power
> state. This means that whenever MAAS is actively changing the power-
> state of a node, the power state change should be reflected instantly in
> MAAS (i.e. we don't have to wait for the periodic monitor to run). Now,
> of course, we need to have the twisted-based PowerOn/PowerOff tasks used
> instead of the celery tasks ;).
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1357073
>
> Title:
>
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
> too long
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1357073/+subscriptions
>

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2014-08-15:

#5

I agree that the time between checks needs to increased. The main issue is
for chassis with multiple nodes, this would cause it to be queried multiple
times in a row. I think if the power drivers handle multiple nodes per
request, that would remove that issue. With many nodes it would also be
good to just continue to query an not pause.

I also like Gavin's idea of marking the node as on if a web request comes
in from that node.

I will work on the implementation to make it update the status quicker.
On Aug 14, 2014 7:55 PM, "Andres Rodriguez" <email address hidden> wrote:

> Say a machine succeeds to change power state and it changes instantly. Say
> machine dies for whatever reason. MAAS will continue to report machine is
> on even when it is not until the next check. We need to ensure that the
> check is frequent enough to ensure we display accurate status of the
> machine. If we are not able to do so then we will fail miserably. I have
> tested this case myself and if we keep the current time we won't never be
> able to accurately display the power status.
> On Aug 14, 2014 6:25 PM, "Raphaël Badin" <email address hidden>
> wrote:
>
> > > This should probably not be solved by decreasing the interval. Instead,
> > > those parts of MAAS that are able to infer the power status of a
> machine
> > > (e.g. where an HTTP request comes in from its IP address) should update
> > > the machine's power status with what they know. Times like
> commissioning
> > > and deploying are those times when power is most in flux and also when
> > > there's the most activity from which to infer the machine's power
> > > status.
> >
> > +1. The twisted-based PowerOn/PowerOff already update the node's power
> > state. This means that whenever MAAS is actively changing the power-
> > state of a node, the power state change should be reflected instantly in
> > MAAS (i.e. we don't have to wait for the periodic monitor to run). Now,
> > of course, we need to have the twisted-based PowerOn/PowerOff tasks used
> > instead of the celery tasks ;).
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1357073
> >
> > Title:
> >
> >
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
> > too long
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1357073/+subscriptions
> >
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1357073
>
> Title:
>
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
> too long
>
> Status in MAAS:
> New
>
> Bug description:
> The interval is too long, went through commissioning and it never
> really displayed the actual status of the machine. INterval should
> probably be just 1 minute.
>
>
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1357073/+subscriptions
>

I agree that the time between checks needs to increased. The main issue is
for chassis with multiple nodes, this would cause it to be queried multiple
times in a row. I think if the power drivers handle multiple nodes per
request, that would remove that issue. With many nodes it would also be
good to just continue to query an not pause.

I also like Gavin's idea of marking the node as on if a web request comes
in from that node.

I will work on the implementation to make it update the status quicker.
On Aug 14, 2014 7:55 PM, "Andres Rodriguez" <andreserl@ubuntu-pe.org> wrote:

> Say a machine succeeds to change power state and it changes instantly. Say
> machine dies for whatever reason. MAAS will continue to report machine is
> on even when it is not until the next check. We need to ensure that the
> check is frequent enough to ensure we display accurate status of the
> machine. If we are not able to do so then we will fail miserably. I have
> tested this case myself and if we keep the current time we won't never be
> able to accurately display the power status.
> On Aug 14, 2014 6:25 PM, "Raphaël Badin" <1357073@bugs.launchpad.net>
> wrote:
>
> > > This should probably not be solved by decreasing the interval. Instead,
> > > those parts of MAAS that are able to infer the power status of a
> machine
> > > (e.g. where an HTTP request comes in from its IP address) should update
> > > the machine's power status with what they know. Times like
> commissioning
> > > and deploying are those times when power is most in flux and also when
> > > there's the most activity from which to infer the machine's power
> > > status.
> >
> > +1.  The twisted-based PowerOn/PowerOff already update the node's power
> > state.  This means that whenever MAAS is actively changing the power-
> > state of a node, the power state change should be reflected instantly in
> > MAAS (i.e. we don't have to wait for the periodic monitor to run).  Now,
> > of course, we need to have the twisted-based PowerOn/PowerOff tasks used
> > instead of the celery tasks ;).
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1357073
> >
> > Title:
> >
> >
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
> >   too long
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1357073/+subscriptions
> >
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1357073
>
> Title:
>
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
>   too long
>
> Status in MAAS:
>   New
>
> Bug description:
>   The interval is too long, went through commissioning and it never
>   really displayed the actual status of the machine. INterval should
>   probably be just 1 minute.
>
>
> src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1357073/+subscriptions
>

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-08-15:

#6

Every operation in MAAS, where possible, needs to be set-based. As
Blake points out, not doing this will potentially cause query storms on
chassis controllers.

Also, we need to think carefully about how well this will scale/perform
on large clusters.

summary:	- src/provisioningserver/rpc/power.py:NodePowerMonitorService.check_interval - too long + power state changes are not reflected quickly enough in the UI
Changed in maas:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-09-05:

#7

So I raise the issue again, with latest trunk as of today:

1. Commissioned machine.
2. Machine finished commissioning and turned itself off.
3. MAAS still shows as the machine being powered on.

We really need to do power status monitoring in a smarter way and we need to track this more often.... it is been almost 5 minutes and the machine continues to say it is ON when in reality, it is OFF

Julian Edwards (julian-edwards) on 2014-09-08

Changed in maas:
milestone:	none → 1.7.0

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-09-09:

#8

The periodic refresh of a node's power status is mostly designed to cope with deployed nodes that can be powered on or off outside of MAAS. For the stages before this (enlistment, commissioning) I think we should refresh the power state deliberately and not wait for the periodic refresh to kick in.

Right now, we need to refresh the power state *after* enlistment and *after* commissioning. These are the only two places where a machine is powered down without using the RPC powerDown method (that method updates the power state).

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-10:

#9

I think this and bug 1370897 are almost one and the same; certainly it's a large part of the lack of response. One that one's fixed, this one will be alleviated.

Blake Rouse (blake-rouse) on 2014-10-10

Changed in maas:
status:	Triaged → In Progress

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-10-10:

#10

I think 5 minutes is fine. The periodic checker is only there to detect power changes happening when the machine is deployed or if the machine gets powered up or down by some external thing.

But you (Andres) have a point: although we update the power state when we actively change the power of a machine (for instance when the machine goes from New to Commissioning) we don't when the power state is changed by MAAS but *not* by a power command (for instance when a node finishes commissioning and goes from 'Commissioning' to 'Ready').

What I think we should do is:
- keep the interval to 5 minutes for the periodic checker (and fix bug 1370897)
- check the power of a node each time its status changes (maybe with a 20 seconds delay, to let the time to the node to actually be powered up or down).

Blake Rouse (blake-rouse) on 2014-10-14

Changed in maas:
status:	In Progress → Triaged

Christian Reis (kiko) on 2014-10-16

Changed in maas:
milestone:	1.7.0 → next

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-17:

#11

I made a proposal in the duplicate bug 1382272:

When a node transitions states, there is a high likelihood the power status will change very soon afterwards. I understand the process is asynchronous which makes it hard to trigger directly, but here's a suggestion for a way to solve it with the current infrastructure.

AIUI, we currently run a loop to check power status of all nodes, every 5 minutes.
We could have that loop be much faster -- perhaps as low as once every 15 seconds -- AND:

   - Every node, at enlisted gets a countdown timer at, say 15 minutes +/- 1 minute
   - Nodes which change state would get marked as "power status dirty"
  - Every loop run:
      - If there is already a loop running, log a DEBUG message and quit
      - Nodes with no power type get skipped
      - Nodes marked as "power status dirty" would actually get probed for power status
      - Nodes whose timer had expired would also get probed for power status

There is a way to estimate what the ideal loop time is based on the time it takes to process the event, but that's a stab at how to do it.

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-17:

#12

And to address Blake's comment over there, this wouldn't actually poll nodes that frequently, but rather more smartly (because we assume that power changes out of MAAS' control won't happen that often)

MAAS Lander (maas-lander) on 2014-10-23

Changed in maas:
status:	Triaged → Fix Committed

Christian Reis (kiko) on 2014-10-30

Changed in maas:
milestone:	next → 1.7.1

Andres Rodriguez (andreserl) on 2015-02-05

Changed in maas:
status:	Fix Committed → Fix Released

MAAS

power state changes are not reflected quickly enough in the UI

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Remote bug watches