Cannot connect to AMT firmware occasionaly

Bug #1504023 reported by Tan Lin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ironic-staging-drivers
Confirmed
High
Unassigned

Bug Description

When NUCs are powered off, the AMT firmware will go to sleep to save power and wait for the awake signal. We need to ping its IP address to wake it up before sending a request.

Tan Lin (tan-lin-good)
Changed in ironic:
assignee: nobody → Tan Lin (tan-lin-good)
Dmitry Tantsur (divius)
Changed in ironic:
status: New → Triaged
importance: Undecided → High
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234190

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/234190
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=1ab3c9217edc8ebed7643f8ded7b16c39dd4bc70
Submitter: Jenkins
Branch: master

commit 1ab3c9217edc8ebed7643f8ded7b16c39dd4bc70
Author: Lin Tan <email address hidden>
Date: Thu Nov 5 15:25:26 2015 +0800

    Wake up AMT interface before send request

    AMT interface goes to sleep after a period of time if the host
    is off. Add a new method 'awake_amt_interface' for amt driver
    to awake nodes' AMT interface.
    This method will ping AMT interface to to wake it up before sending
    request. Also set 60s by default as cache time.

    Change-Id: I4d81001c01e7908100a6571b366cb296253f2fc1
    Close-bug: #1504023

Tan Lin (tan-lin-good)
Changed in ironic:
status: In Progress → Fix Committed
aeva black (tenbrae)
Changed in ironic:
status: Fix Committed → In Progress
Revision history for this message
aeva black (tenbrae) wrote :

I am reopening this bug because I have confirmed that it is not fixed adequately, and I am still getting timeout errors against a 5th-gen NUC.

After some investigation, here is the root cause: it typically takes the AMT ME on my NUC about 3 seconds to wake up from a low power state. Some times, it takes longer, and when this happens, what ever command was requested (get power state, set power state, etc) will fail.

According to the Intel AMT ME docs, it can take up to 25 seconds for AMT to wake up from a low power state, and ping shouldn't be used to wake up the interface:

"If the ME is set to respond to Pings, ping the client before the action. Note: There are situations where a ping will not reply for the first 2-3 times. You would not want to use this method if doing this in an automated manner."
- https://software.intel.com/en-us/wake-up-amt

Here are some logs from pinging the AMT ME that demonstrate the behaviour of ME's wakeup.

ping with 0.1s interval: http://paste.openstack.org/show/481861/
ping with 1s interval: http://paste.openstack.org/show/481863/

Note the ARP response in each case -- AMT does not reply to the ICMP echo request until after it has gotten a response to the ARP who-has request.

More importantly, however, is that AMT ME is *not* caching the ARP table between ICMP sessions; at the start of every ping request, ME issues another ARP who-has, which, at least in my lab today, is taking a few seconds. This is causing the AMT driver to fail repeatedly.

For reference, I am testing with commit 64530a6c5bc8091f4960bc582318350e294fac51.

Suggested fix #1:
- document that deployers must configure their AMT devices *not* to enter a low power state

Suggested fix #2:
- when ever an AMT Node is powered off, begin a background thread which issues a slow ICMP ping to prevent ME from going into a low power state

Suggested fix #3:
- increase the timeouts within the driver for waking up ME to the intel-recommended 25 seconds

aeva black (tenbrae)
Changed in ironic:
status: In Progress → Triaged
Revision history for this message
Tan Lin (tan-lin-good) wrote :

Thanks for your report, Devananda. Let me see which one is better.

Revision history for this message
Tan Lin (tan-lin-good) wrote :

But conf options [amt]action_wait and [amt]max_attempts should solve this issue. Admin can set the timeouts depends on their environment. If not, hmmm, then we have a new bug here.

Revision history for this message
Tan Lin (tan-lin-good) wrote :

I reproduce this bug and the problem is we have to wait longer for ping if AMT in a deep sleep, let me see what we can do.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/264106

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Tan Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/264106
Reason: Yes, this should be abandoned, thanks for remind

Dmitry Tantsur (divius)
affects: ironic → ironic-staging-drivers
Changed in ironic-staging-drivers:
status: In Progress → Confirmed
assignee: Tan Lin (tan-lin-good) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.