No retries for AMT power?

Bug #1374102 reported by Adam Collard
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Raphaël Badin

Bug Description

I had an AMT controlled node which failed to power on - event log in https://pastebin.canonical.com/117626/.

Whilst debugging I looked at the AMT power template in MAAS 1.7.0~beta3+bzr3043-0ubuntu1~ppa7 and noticed what appears to be some oddities.

There is a comment saying that retries are handled by the "core power driver" - is that accurate? Assuming it is, why is there this:

query_state() {
    # Retry the state if it fails because it often fails the first time.
    local state=
    local count=
    state=$(issue_amt_command info | grep '^Powerstate:' | awk '{print $2}')
    if [ -n "$state" ]
    then
        break
    fi
    # Wait 1 second between queries AMT controllers are generally very
    # light and may not be comfortable with more frequent queries.
    sleep 1
    case "$state" in
    S[0-4])
        # Wide awake (S0), or asleep (S1-S4), but not a clean slate that
        # will lead to a fresh boot.
        echo 'on'
        ;;
    S5)
        echo 'off'
        ;;
    *)
        fail 2 "Got unknown power state from node: '$state'"
        ;;
    esac
}

Note the break without any loop

Note also the sleeping /after/ querying the AMT controller.

What is the API between templates and the power drivers to have the template say "please retry me"? Is it just fail?

Related branches

Revision history for this message
Adam Collard (adam-collard) wrote :

https://pastebin.canonical.com/117626/ this is the output from a failed power on event that led me to look at the template to see if it was still doing retries.

summary: - AMT power template strangeness
+ No retries for AMT power?
description: updated
Revision history for this message
Gavin Panella (allenap) wrote :

That looks like the retry code was only half removed.

tags: added: tech-debt
Changed in maas:
status: New → Triaged
importance: Undecided → High
Changed in maas:
assignee: nobody → Raphaël Badin (rvb)
importance: High → Critical
Revision history for this message
Raphaël Badin (rvb) wrote :

The contract between the code and template is this:
exit code = 2 : fatal error, cannot be retried.
exit code = 1 : temporary error, can be retried.

Raphaël Badin (rvb)
Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Christian Reis (kiko) wrote :

But I thought the original bug was that there was no retrying of AMT -- Adam, can you confirm whether that is the case, or if it's just that you noted the left-over code?

Revision history for this message
Adam Collard (adam-collard) wrote : Re: [Bug 1374102] Re: No retries for AMT power?

Hi Kiko,

On 30 September 2014 14:25, Christian Reis <email address hidden> wrote:

> But I thought the original bug was that there was no retrying of AMT --
> Adam, can you confirm whether that is the case, or if it's just that you
> noted the left-over code?
>

As per Raphael's comment above the retries weren't happening because the
power template was exiting with exit code 2 (fatal, don't attempt to
retry). Looking at the diff that was committed it seems that it would fix
it (making unknown power state exit with 1). As well as removing the cruft
from doing retries in the template.

Revision history for this message
Christian Reis (kiko) wrote :

I see -- testing the patch and reporting success would be neat. It would be even more useful if the AMT nodes could reliably lag, thus requiring the retry!

Revision history for this message
Raphaël Badin (rvb) wrote :

I just had a look at the code that is supposed to do the retry depending on the template exit code and I don't think it's doing the right thing: the query (i.e. if you're just trying to get the power state of a node) will be retries no matter what the exist code and a change won't be retried it the template exit with something different than zero.

I'll investigate and file another bug for this if I'm right.

Revision history for this message
David Britton (dpb) wrote :

Failed to power on node — Node could not be powered on: amt failed with return code 1: host ., powerup [y/N] ? execute: powerup result: pt_status: success yes: standard output: Broken pipe yes: write error 500 read timeout at /usr/bin/amttool line 126. amttool failure Got unknown power state from node: '' Machine is not powering on. Giving up.

FYI -- got this one on an AMT who was responsive minutes before and minutes after. Just wanted to make sure this was included in the fix, and if not, then ping and I'll file a new bug.

Revision history for this message
David Britton (dpb) wrote :

Sep 30 17:06:17 courage maas.node: [INFO] hawking.local: Stopping monitor: node-24428750-e295-11e3-83e9-782bcb8e566b
Sep 30 17:06:17 courage maas.node: [ERROR] hawking.local: Marking node failed: Node could not be powered off: amt failed with return
 code 1:#012host ., powerdown [y/N] ? execute: powerdown#012result: pt_status: success#012yes: standard output: Broken pipe#012yes:
write error#012Machine is not powering off. Giving up.

I'm afraid there isn't any more in the logs about it.

Changed in maas:
status: Fix Committed → New
Revision history for this message
David Britton (dpb) wrote :

Using: 1.7.0~beta4+bzr3139-0ubuntu1~trusty1

Changed in maas:
milestone: none → 1.7.0
Revision history for this message
David Britton (dpb) wrote :

FYI, have hit this multiple times now, the node usually seems to power on or power off fine when this state is presented by the AMT.

Revision history for this message
Raphaël Badin (rvb) wrote :

I've re-enabled tries in the template. I seems what the core power driver was not enough for AMT :/.

Raphaël Badin (rvb)
Changed in maas:
status: New → Fix Committed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Wednesday 01 Oct 2014 21:13:50 you wrote:
> I've re-enabled tries in the template. I seems what the core power
> driver was not enough for AMT :/.

This is surprising. Because when I originally removed the template retries,
it was because I saw a MASSIVE amount of retries - basically the template did
N and the core did M, resulting in N*M retries, which took many minutes.

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.