ironic-staging-drivers

Cannot connect to AMT firmware occasionaly

Bug #1504023 reported by Tan Lin on 2015-10-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	ironic-staging-drivers	Confirmed	High	Unassigned

Bug Description

When NUCs are powered off, the AMT firmware will go to sleep to save power and wait for the awake signal. We need to ping its IP address to wake it up before sending a request.

See original description

Tan Lin (tan-lin-good) on 2015-10-08

Changed in ironic:
assignee:	nobody → Tan Lin (tan-lin-good)

Dmitry Tantsur (divius) on 2015-10-09

Changed in ironic:
status:	New → Triaged
importance:	Undecided → High

John L. Villalovos (happycamp) on 2015-10-09

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-13: Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234190

OpenStack Infra (hudson-openstack) on 2015-10-16

Changed in ironic:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-02: Fix merged to ironic (master)

Reviewed: https://review.openstack.org/234190
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=1ab3c9217edc8ebed7643f8ded7b16c39dd4bc70
Submitter: Jenkins
Branch: master

commit 1ab3c9217edc8ebed7643f8ded7b16c39dd4bc70
Author: Lin Tan <email address hidden>
Date: Thu Nov 5 15:25:26 2015 +0800

Wake up AMT interface before send request

    AMT interface goes to sleep after a period of time if the host
    is off. Add a new method 'awake_amt_interface' for amt driver
    to awake nodes' AMT interface.
    This method will ping AMT interface to to wake it up before sending
    request. Also set 60s by default as cache time.

Change-Id: I4d81001c01e7908100a6571b366cb296253f2fc1
Close-bug: #1504023

Tan Lin (tan-lin-good) on 2015-12-09

Changed in ironic:
status:	In Progress → Fix Committed

aeva black (tenbrae) on 2015-12-14

Changed in ironic:
status:	Fix Committed → In Progress

Revision history for this message

aeva black (tenbrae) wrote on 2015-12-14:

I am reopening this bug because I have confirmed that it is not fixed adequately, and I am still getting timeout errors against a 5th-gen NUC.

After some investigation, here is the root cause: it typically takes the AMT ME on my NUC about 3 seconds to wake up from a low power state. Some times, it takes longer, and when this happens, what ever command was requested (get power state, set power state, etc) will fail.

According to the Intel AMT ME docs, it can take up to 25 seconds for AMT to wake up from a low power state, and ping shouldn't be used to wake up the interface:

"If the ME is set to respond to Pings, ping the client before the action. Note: There are situations where a ping will not reply for the first 2-3 times. You would not want to use this method if doing this in an automated manner."
- https://software.intel.com/en-us/wake-up-amt

Here are some logs from pinging the AMT ME that demonstrate the behaviour of ME's wakeup.

ping with 0.1s interval: http://paste.openstack.org/show/481861/
ping with 1s interval: http://paste.openstack.org/show/481863/

Note the ARP response in each case -- AMT does not reply to the ICMP echo request until after it has gotten a response to the ARP who-has request.

More importantly, however, is that AMT ME is *not* caching the ARP table between ICMP sessions; at the start of every ping request, ME issues another ARP who-has, which, at least in my lab today, is taking a few seconds. This is causing the AMT driver to fail repeatedly.

For reference, I am testing with commit 64530a6c5bc8091f4960bc582318350e294fac51.

Suggested fix #1:
- document that deployers must configure their AMT devices *not* to enter a low power state

Suggested fix #2:
- when ever an AMT Node is powered off, begin a background thread which issues a slow ICMP ping to prevent ME from going into a low power state

Suggested fix #3:
- increase the timeouts within the driver for waking up ME to the intel-recommended 25 seconds

aeva black (tenbrae) on 2015-12-14

Changed in ironic:
status:	In Progress → Triaged

Revision history for this message

Tan Lin (tan-lin-good) wrote on 2015-12-15:

Thanks for your report, Devananda. Let me see which one is better.

Revision history for this message

Tan Lin (tan-lin-good) wrote on 2015-12-15:

But conf options [amt]action_wait and [amt]max_attempts should solve this issue. Admin can set the timeouts depends on their environment. If not, hmmm, then we have a new bug here.

Revision history for this message

Tan Lin (tan-lin-good) wrote on 2016-01-05:

I reproduce this bug and the problem is we have to wait longer for ping if AMT in a deep sleep, let me see what we can do.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-06: Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/264106

OpenStack Infra (hudson-openstack) on 2016-01-18

Changed in ironic:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-29: Change abandoned on ironic (master)

Change abandoned by Tan Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/264106
Reason: Yes, this should be abandoned, thanks for remind

Dmitry Tantsur (divius) on 2016-09-14

affects:	ironic → ironic-staging-drivers
Changed in ironic-staging-drivers:
status:	In Progress → Confirmed
assignee:	Tan Lin (tan-lin-good) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.