nova boot fails to attach vmdk in multi-host environments without DRS properly enabled

Bug #1180897 reported by Shawn Hartsock
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Shawn Hartsock

Bug Description

When a VMware vCenter manages more than one ESXi Hosts the nova boot command will fail at the point during which the image (VMDK) is being attached to the VM. The error will be that the "file is not found" inspection of the datastore (you have to suspend or halt the nova process before it performs cleanup activities to observe this) will show that the VMDK was properly placed in the shared datastore but that the host may not be able to see the path at which the VMDK was stored. If you move the VMDK and attach it (using vSphere's own management tools) the VM will recover.

* If you have only one host in the cluster this problem goes away.
* If you only have one host in the vCenter this problem goes away.
* If you have DRS with automatic placement turned on the problem goes away.

2013-05-16 09:22:29.473 ERROR nova.compute.manager [req-9e61185b-5444-4f97-b711-6c65b716a2a0 demo demo] [instance: aef16488-88c7-4952-99d4-f55377c410e9] Error: ['Traceback (most recent call last):\n', ' File "/opt/stack/nova/nova/compute/manager.py", line 941, in _build_instance\n set_access_ip=set_access_ip)\n', ' File "/opt/stack/nova/nova/compute/manager.py", line 1203, in _spawn\n LOG.exception(_(\'Instance failed to spawn\'), instance=instance)\n', ' File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__\n self.gen.next()\n', ' File "/opt/stack/nova/nova/compute/manager.py", line 1199, in _spawn\n block_device_info)\n', ' File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 176, in spawn\n block_device_info)\n', ' File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 405, in spawn\n vmdk_file_size_in_kb, linked_clone)\n', ' File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 68, in attach_disk_to_vm\n self._session._wait_for_task(instance_name, reconfig_task)\n', ' File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 559, in _wait_for_task\n ret_val = done.wait()\n', ' File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait\n return hubs.get_hub().switch()\n', ' File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch\n return self.greenlet.switch()\n', "NovaException: Invalid configuration for device '1'.\n"]
2013-05-16 09:22:29.474 DEBUG nova.openstack.common.rpc.amqp [req-9e61185b-5444-4f97-b711-6c65b716a2a0 demo demo] Making synchronous call on conductor ... from (pid=32776) multicall /opt/stack/nova/nova/openstack/common/rpc/amqp.py:586
2013-05-16 09:22:29.474 DEBUG nova.openstack.common.rpc.amqp [req-9e61185b-5444-4f97-b711-6c65b716a2a0 demo demo] MSG_ID is ed357119f3cf4d5c82f1679808a5f185 from (pid=32776) multicall /opt/stack/nova/nova/openstack/common/rpc/amqp.py:589
2013-05-16 09:22:29.474 DEBUG nova.openstack.common.rpc.amqp [req-9e61185b-5444-4f97-b711-6c65b716a2a0 demo demo] UNIQUE_ID is e63169e368eb4b9cb5121c801e640d9d. from (pid=32776) _add_unique_id /opt/stack/nova/nova/openstack/common/rpc/amqp.py:337
2013-05-16 09:22:29.475 DEBUG amqp [-] Closed channel #1 from (pid=32776) _do_close /usr/local/lib/python2.7/dist-packages/amqp/channel.py:88
2013-05-16 09:22:29.475 DEBUG amqp [-] using channel_id: 1 from (pid=32776) __init__ /usr/local/lib/python2.7/dist-packages/amqp/channel.py:70
2013-05-16 09:22:29.476 DEBUG amqp [-] Channel open from (pid=32776) _open_ok /usr/local/lib/python2.7/dist-packages/amqp/channel.py:420
2013-05-16 09:22:29.477 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_rebooting_instances from (pid=32776) run_periodic_tasks /opt/stack/nova/nova/openstack/common/periodic_task.py:175
2013-05-16 09:22:29.478 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._reclaim_queued_deletes from (pid=32776) run_periodic_tasks /opt/stack/nova/nova/openstack/common/periodic_task.py:175
2013-05-16 09:22:29.479 DEBUG nova.compute.manager [-] CONF.reclaim_instance_interval <= 0, skipping... from (pid=32776) _reclaim_queued_deletes /opt/stack/nova/nova/compute/manager.py:3980
2013-05-16 09:22:29.480 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._report_driver_status from (pid=32776) run_periodic_tasks /opt/stack/nova/nova/openstack/common/periodic_task.py:175
2013-05-16 09:22:29.481 INFO nova.compute.manager [-] Updating host status
2013-05-16 09:22:29.496 DEBUG amqp [-] Closed channel #1 from (pid=32776) _do_close /usr/local/lib/python2.7/dist-packages/amqp/channel.py:88
2013-05-16 09:22:29.497 DEBUG amqp [-] using channel_id: 1 from (pid=32776) __init__ /usr/local/lib/python2.7/dist-packages/amqp/channel.py:70
2013-05-16 09:22:29.498 DEBUG amqp [-] Channel open from (pid=32776) _open_ok /usr/local/lib/python2.7/dist-packages/amqp/channel.py:420
2013-05-16 09:22:29.606 WARNING nova.virt.vmwareapi.driver [-] Task [DeleteDatastoreFile_Task] (returnval){
   value = "task-273"
   _type = "Task"
 } status: error File [datastore01] instance-00000009 was not found
2013-05-16 09:22:29.607 WARNING nova.virt.vmwareapi.driver [-] In vmwareapi:_poll_task, Got this error Trying to re-send() an already-triggered event.
2013-05-16 09:22:29.607 ERROR nova.openstack.common.loopingcall [-] in fixed duration looping call
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall Traceback (most recent call last):
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall File "/opt/stack/nova/nova/openstack/common/loopingcall.py", line 78, in _inner
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall self.f(*self.args, **self.kw)
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 585, in _poll_task
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall done.send_exception(excep)
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 208, in send_exception
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall return self.send(None, args)
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 150, in send
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall assert self._result is NOT_USED, 'Trying to re-send() an already-triggered event.'
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall AssertionError: Trying to re-send() an already-triggered event.
2013-05-16 09:22:29.607 TRACE nova.openstack.common.loopingcall
2013-05-16 09:22:30.867 DEBUG nova.openstack.common.rpc.amqp [-] Making synchronous call on conductor ... from (pid=32776) multicall /opt/stack/nova/nova/openstack/common/rpc/amqp.py:586

Tags: vmware
Changed in nova:
assignee: nobody → Shawn Hartsock (hartsock)
Michael Still (mikal)
Changed in nova:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Shawn Hartsock (hartsock) wrote :

Note: I am actively working on this.

Changed in nova:
status: Confirmed → In Progress
summary: - nova compute fails when vmware cluster has more than one ESXi Host
+ nova compute fails when vmware cluster has more than one ESXi Host and
+ NO shared datastores
summary: - nova compute fails when vmware cluster has more than one ESXi Host and
- NO shared datastores
+ nova compute fails when vmware cluster has NO shared datastores
Changed in nova:
assignee: Shawn Hartsock (hartsock) → nobody
Changed in nova:
assignee: nobody → Shawn Hartsock (hartsock)
Changed in nova:
milestone: none → havana-2
summary: - nova compute fails when vmware cluster has NO shared datastores
+ nova boot fails to attach vmdk in multi-host-cluster
Revision history for this message
Shawn Hartsock (hartsock) wrote : Re: nova boot fails to attach vmdk in multi-host-cluster

This only occurs with local storage.

Changed in nova:
importance: Critical → High
importance: High → Medium
summary: - nova boot fails to attach vmdk in multi-host-cluster
+ nova boot fails to attach vmdk in multi-host-cluster without DRS
Changed in nova:
importance: Medium → High
Revision history for this message
Shawn Hartsock (hartsock) wrote : Re: nova boot fails to attach vmdk in multi-host-cluster without DRS

I am currently working on a traversal spec for this problem. My current solution has introduced a new bug, so I'm troubleshooting that before I post a fix.

summary: - nova boot fails to attach vmdk in multi-host-cluster without DRS
+ nova boot fails to attach vmdk in multi-host environments
description: updated
Revision history for this message
Shawn Hartsock (hartsock) wrote : Re: nova boot fails to attach vmdk in multi-host environments

We've narrowed this problem down to situations where the vCenter inventory is not using *only* Clusters with DRS and automatic placement turned on. So, after working this bug for a while, it does not seem as critical as it did.

Revision history for this message
dan wendlandt (danwent) wrote :

can you update the title to indicate that this is less severe?

Revision history for this message
Shawn Hartsock (hartsock) wrote :

Follow up: Does this issue occur when there are multiple datacenters?

summary: - nova boot fails to attach vmdk in multi-host environments
+ nova boot fails to attach vmdk in multi-host environments without DRS
+ properly enabled
Changed in nova:
milestone: havana-2 → none
importance: High → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.