baremetal instances can get wedged when deleted

Bug #1177584 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Māris Fogels

Bug Description

Found trying to bring up some odd hardware, we had nodes that stuck after the deploy ramdisk, leaving the state BUILDING.

nova delete of the instance left it:
+--------------------------------------+--------------+--------+------------+-------------+----------+
| b74cbb15-edf8-4dbc-911c-430c0ff60a31 | bmtest3 | BUILD | deleting | NOSTATE | |

restarting nova-compute had no effect, but logged:
2013-05-07 22:39:10,406.406 30255 ERROR nova.compute.manager
                  [req-bd4f5862-0182-4bbc-8823-8573cbfd2eb5 None None] Instance bmtest-cmsj1 found in the
                  hypervisor, but not in the database
...
 2013-05-07 22:39:41,813.813 30255 INFO nova.compute.manager
                  [req-1df4ab37-8beb-4184-bd43-baed73e768e4 None None] [instance:
                  b74cbb15-edf8-4dbc-911c-430c0ff60a31] During sync_power_state the instance has a pending
                  task. Skip.

Tags: baremetal
Revision history for this message
Robert Collins (lifeless) wrote :

oh, IPMI power manager, credentials seem ok - machines power on on demand.

Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (4.3 KiB)

2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 242, in decorated_function
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp e, sys.exc_info())
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp self.gen.next()
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 229, in decorated_function
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp return function(self, context, *args, **kwargs)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 1480, in terminate_instance
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp do_terminate_instance(instance, bdms)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/lockutils.py", line 242, in inner
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp retval = f(*args, **kwargs)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 1472, in do_terminate_instance
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp reservations=reservations)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/hooks.py", line 85, in inner
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp rv = f(*args, **kwargs)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 1435, in _delete_instance
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp project_id=project_id)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp self.gen.next()
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 1408, in _delete_instance
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp self._shutdown_instance(context, instance, bdms)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 1351, in _shutdown_instance
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp block_device_info)
2013-05-08 02:33:35,387.387 22836 TRACE nova.openstack.common.rpc.amqp File "/opt...

Read more...

summary: - baremetal instances can get wedged when deleted during BUILDING
+ baremetal instances can get wedged when deleted
Revision history for this message
Robert Collins (lifeless) wrote :

Devananda suggested this might be due to slow IPMI commands.

Māris Fogels (mars)
Changed in nova:
assignee: nobody → Māris Fogels (mars)
Māris Fogels (mars)
Changed in nova:
status: Triaged → In Progress
Revision history for this message
Māris Fogels (mars) wrote :

I chased this bug up into the nova-compute manager itself: if it is a slow IPMI call, then that is raising an exception which prevents the VM power state from syncing, effectively wedging the instance. This issue is filed as bug 1212420.

I discussed a solution with jog0, and here is what I have so far:

 * short-term: increase the IPMI system call timeout to something longer than 2.5 seconds
 * short-term: add a log message showing the IPMI timeout being triggered so cloud ops know what's going on
 * long-term: fix bug 1212420 and bug 1212347

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/42680

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/42681

Revision history for this message
Robert Collins (lifeless) wrote :

I'm not sure a longer timeout will actually help: 2.5 seconds is /plenty/ of time for the LOM to handle the request - it's on a different CPU to the machine. Timeouts are much more commonly due to lost network packets and so forth AIUI. A longer timeout will just make the system less responsive when something does go wrong.

Revision history for this message
Māris Fogels (mars) wrote :

I may have found the source of the power-state sync code dying, which would cause the wedge. The patch and a detailed explanation of the fix are here: https://review.openstack.org/#/c/43528/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/42681
Committed: http://github.com/openstack/nova/commit/f312713b40955f266489524905e70ddc57d5a30b
Submitter: Jenkins
Branch: master

commit f312713b40955f266489524905e70ddc57d5a30b
Author: Maris Fogels <email address hidden>
Date: Fri Aug 16 19:10:26 2013 -0400

    baremetal: Log IPMI power on/off timeouts

    Added log messages for cloud operators so that they can tell when an
    attempt to turn a device on or off via IPMI has failed after the
    configured number of attempts.

    This should help operators diagnose issues on devices that respond
    slowly to IPMI commands. The failure to repeatedly change the device
    power state is now logged precisely along with the cause, where
    before operators had to diagnose the issue through trace log captures.

    Partial-Bug: #1177584
    Change-Id: Iecd6fadb2b653f4ccbe75882c2ef85812b334e8d

Revision history for this message
Māris Fogels (mars) wrote :

I'm going to leave the default IPMI timeout as-is, given lifeless' feedback about the hardware. The increased logging should give enough information for cloud operators to identify IPMI timeouts as an issue, and also make it possible to determine that increasing the timeouts is a valid workaround.

Revision history for this message
Māris Fogels (mars) wrote :

Increased logging has landed in trunk, and the power-state synchronization fix in https://review.openstack.org/43528 has landed, too.

lifeless, are you able to verify that applying https://github.com/openstack/nova/commit/1e8de59d250eb8374f977e8008386abe9e7ea3db and http://github.com/openstack/nova/commit/f312713b40955f266489524905e70ddc57d5a30b fixes the issue?

Revision history for this message
Joe Gordon (jogo) wrote :

looks like this has been resolved, marking the bug as such.

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.