xenapi returns HANDLE_INVALID randomly

Bug #887708 reported by Chris Behrens on 2011-11-08
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Johannes Erdfelt

Bug Description

(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/exception.py", line 115, in wrapped
(nova.rpc): TRACE: return f(*args, **kw)
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/compute/manager.py", line 495, in run_instance
(nova.rpc): TRACE: self._run_instance(context, instance_id, **kwargs)
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/compute/manager.py", line 409, in _run_instance
(nova.rpc): TRACE: if instance['name'] in self.driver.list_instances():
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/virt/xenapi_conn.py", line 191, in list_instances
(nova.rpc): TRACE: return self._vmops.list_instances()
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/virt/xenapi/vmops.py", line 136, in list_instances
(nova.rpc): TRACE: vm_rec = self._session.call_xenapi("VM.get_record", vm_ref)
(nova.rpc): TRACE: File "/home/cbehrens/openstack/nova/nova.git/nova/virt/xenapi_conn.py", line 482, in call_xenapi
(nova.rpc): TRACE: return tpool.execute(f, *args)
(nova.rpc): TRACE: File "/usr/local/lib/python2.6/dist-packages/eventlet/tpool.py", line 76, in tworker
(nova.rpc): TRACE: rv = meth(*args,**kwargs)
(nova.rpc): TRACE: File "/usr/local/lib/python2.6/dist-packages/XenAPI.py", line 229, in __call__
(nova.rpc): TRACE: return self.__send(self.__name, args)
(nova.rpc): TRACE: File "/usr/local/lib/python2.6/dist-packages/XenAPI.py", line 133, in xenapi_request
(nova.rpc): TRACE: result = _parse_result(getattr(self, methodname)(*full_params))
(nova.rpc): TRACE: File "/usr/local/lib/python2.6/dist-packages/XenAPI.py", line 203, in _parse_result
(nova.rpc): TRACE: raise Failure(result['ErrorDescription'])
(nova.rpc): TRACE: Failure: ['HANDLE_INVALID', 'VM', 'OpaqueRef:be249227-83c4-ff6e-7d70-f1000a857548']
(nova.rpc): TRACE:

Chris Behrens (cbehrens) wrote :


The instance is left in 'BUILD' vm_state... it should go to ERROR when we get unexpected XenAPI responses/exceptions.

Can you attach the rest of the log for creating this instance?

Chris Behrens (cbehrens) wrote :

There's really nothing before this regarding this instance. However, there was a delete operation running at the same time that completed right before this trace.

Attached is the most I could get out of my screen history buffer... but it does go back a ways, if it helps. My logging was to stdout.

I've seen this a few times lately as well. It was always when I was deleting an instance. Looking at the code, it appears to be a race condition in nova/virt/xenapi/vmops.py:VMOps.list_instances.

It calls the XenAPI command VM.get_all and then calls VM.get_record on each of the returned VM refs. If an instance gets deleted between the VM.get_all command and the VM.get_record, then I could easily see a HANDLE_INVALID being returned.

I think the code needs to be tolerant of HANDLE_INVALID exceptions and continue on, assuming we lost the race on that VM.

Thierry Carrez (ttx) on 2011-12-02
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
tags: added: xenapi
Matt Dietz (cerberus) on 2011-12-06
Changed in nova:
assignee: nobody → Matt Dietz (cerberus)

Reviewed: https://review.openstack.org/2151
Committed: http://github.com/openstack/nova/commit/fb27cc6979dcc8b4c0ac6595dae0c6e3e413e00f
Submitter: Jenkins
Branch: master

 status fixcommitted

commit fb27cc6979dcc8b4c0ac6595dae0c6e3e413e00f
Author: Johannes Erdfelt <email address hidden>
Date: Wed Dec 7 18:20:03 2011 +0000

    Fix race condition in XenAPI when using <object>.get_all

    Fixes bug 887708

    There are a handful of places where <object>.get_all is followed by a
    <object>.get_record calls that are potentially racey. This patch fixes
    all of these cases to use common code that is tolerant of HANDLE_INVALID
    errors that would be indicative of a race between get_all and delete

    Change-Id: Ib94adb6d21b6b55e7b26fc1da52ed46d9dba8275

Changed in nova:
status: Confirmed → Fix Committed
Changed in nova:
assignee: Matt Dietz (cerberus) → Johannes Erdfelt (johannes.erdfelt)
Thierry Carrez (ttx) on 2011-12-14
Changed in nova:
milestone: none → essex-2
status: Fix Committed → Fix Released
Mark McLoughlin (markmc) wrote :

Is this bug applicable to Diablo too? Anyone care to backport the fix, if so?

I'm pretty sure it is applicable to Diablo. It's been in nova for as long as I remember.

Chris Behrens (cbehrens) wrote :

Yeah, this was fixed post-diablo. Should probably be back-ported, but my team @ RAX probably won't have time in the next 2 weeks.

Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-2 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers