handling EVENT_LIFECYCLE_STOPPED races with removing instance from database

Bug #1180501 reported by Peter Feiner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Peter Feiner

Bug Description

When an instance is deleted (ComputeManager.terminate_instance), the compute driver is told to destroy the instance and wait for its destruction, then the instance is removed from the database. Using the libvirt+kvm compute driver on stable/grizzly and libvirt-1.0.4, when the instance is destroyed, an EVENT_LIFECYCLE_STOPPED is asynchronously delivered to nova-compute by libvirt. If the EVENT_LIFECYCLE_STOPPED is handled after the instance has been removed from the database, then CopmuteManager's lifecycle event handler bombs when it tries to look the instance up in the database (see stack trace below).

The error is benign since the lifecycle event handler just syncs the power state of instances and this instance already has the correct power state (i.e., not in the database at all). However, the stack trace is scary and should probably be fixed.

The more stuff happening in parallel makes the bug easier to observe:
* It seems easiest to reproduce this bug when lots (10+) of instances are being created and deleted in parallel.
* I'm fairly certain this race exists regardless of libvirt version (i.e., asynchronous delivery of an event, handling of which queries a potentially deleted record), however I haven't been able to reproduce it with an older version (libvirt-0.9.8). This is probably explained by libvirt-1.0.3+ having a bunch of bottlenecks removed (see Daniel Berrange's patch series on libvirt-devel).
* I'm working on concurrency bottlenecks in OpenStack. With a bunch of these fixed, the bug happens almost all of the time.

I haven't tested this on the master branch, but the faulting bits (ComputeManager.handle_lifecycle_event) haven't changed.

This problem could be fixed in a couple of ways:

1) Wait for the EVENT_LIFECYCLE_STOPPED in LibvirtDriver.destroy rather than polling libvirt (see _wait_for_destroy).
2) Ignore events in ComputeManager.handle_lifecycle_event for non-existant instances.
3) maintain a deleted list (i.e., list of instance uuids for which events can be ignored).

I think (2) is simplest to implement and most robust. Moreover, (2) will handle other the race between any event delivery and the deletion of a domain. Although (3) has the advantage that it won't mask database deletion bugs, it introduces a garbage collection problem.

Relevant blueprint: https://blueprints.launchpad.net/nova/+spec/compute-driver-events

InstanceNotFound: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver Traceback (most recent call last):
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/virt/driver.py", line 902, in emit_event
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver self._compute_event_callback(event)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/compute/manager.py", line 577, in handle_events
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver self.handle_lifecycle_event(event)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/compute/manager.py", line 556, in handle_lifecycle_event
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver context, event.get_instance_uuid())
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/conductor/api.py", line 402, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver instance_uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/conductor/rpcapi.py", line 122, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return self.call(context, msg, version='1.2')
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/proxy.py", line 83, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return rpc.call(context, self._get_topic(topic), msg, timeout)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/__init__.py", line 95, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return fn(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/__init__.py", line 162, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return _get_impl().call(CONF, context, topic, msg, timeout)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/impl_kombu.py", line 798, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver rpc_amqp.get_connection_pool(conf, Connection))
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 612, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver rv = list(rv)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 561, in __iter__
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver raise result
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver InstanceNotFound_Remote: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver Traceback (most recent call last):
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 407, in catch_client_exception
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return func(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/conductor/manager.py", line 107, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver self.db.instance_get_by_uuid(context, instance_uuid))
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/api.py", line 598, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return IMPL.instance_get_by_uuid(context, uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/trace/__init__.py", line 258, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver r = fn(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 115, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return f(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1531, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return _instance_get_by_uuid(context, uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 115, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return f(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1541, in _instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver raise exception.InstanceNotFound(instance_id=uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver InstanceNotFound: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.742 3270 DEBUG nova.virt.driver [-] Emitting event <nova.virt.event.LifecycleEvent object at 0x4029c50> emit_event /opt/stack/nova/nova/virt/driver.py:901

Revision history for this message
Peter Feiner (pete5) wrote :

I will submit a patch in the next couple of days to implement approach (2).

Changed in nova:
assignee: nobody → Peter Feiner (pete5)
Michael Still (mikal)
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33855

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/33855
Committed: http://github.com/openstack/nova/commit/3d3affc752613f0f12cd25109b1d5be05bfe0c6f
Submitter: Jenkins
Branch: master

commit 3d3affc752613f0f12cd25109b1d5be05bfe0c6f
Author: Peter Feiner <email address hidden>
Date: Wed May 15 14:03:56 2013 -0500

    Ignore lifecycle events for non-existent instances

    Since event delivery is asynchronous, there's an inherent yet benign race in the
    event handler querying the database for a deleted instance.

    Fixes bug #1180501.

    Change-Id: I8e478fb13bedfc86730018ba849ecc71d4286f0d

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.