VMware: resize fails when there is more than one compute node

Bug #1345460 reported by Gary Kotton
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Gary Kotton
Juno
Fix Released
High
Gary Kotton
VMwareAPI-Team
New
Critical
Gary Kotton

Bug Description

Doing a nova resize on an instance when using the vsphere driver will cause the instance to go in to error state.

The problem is that the scheduler will pick another host to spin up a new resized instance and when the user confirms nova will fail because its looking for the instance on the old compute.

Here is the traceback.

2014-07-16 18:14:55.271 13228 DEBUG amqp [-] Closed channel #1 _do_close /usr/lib/python2.7/dist-packages/amqp/channel.py:88
2014-07-16 18:14:55.271 13228 DEBUG amqp [-] using channel_id: 1 __init__ /usr/lib/python2.7/dist-packages/amqp/channel.py:70
2014-07-16 18:14:55.273 13228 DEBUG amqp [-] Channel open _open_ok /usr/lib/python2.7/dist-packages/amqp/channel.py:420
2014-07-16 18:14:55.299 13228 ERROR nova.openstack.common.rpc.amqp [req-3cae0e1d-2cf4-4da7-9aac-0ea5279b829d cherkasj 37af63b6867d4fe38ac312ca626ce186] Exception during message handling
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 461, in _process_data
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp **args)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 353, in decorated_function
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp return function(self, context, *args, **kwargs)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 90, in wrapped
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp payload)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 73, in wrapped
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp return f(self, context, *args, **kw)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 294, in decorated_function
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp function(self, context, *args, **kwargs)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 271, in decorated_function
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp e, sys.exc_info())
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 258, in decorated_function
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp return function(self, context, *args, **kwargs)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2683, in confirm_resize
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp do_confirm_resize(context, instance, migration_id)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 246, in inner
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp return f(*args, **kwargs)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2680, in do_confirm_resize
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp migration=migration)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2707, in _confirm_resize
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp network_info)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/virt/vmwareapi/driver.py", line 465, in confirm_migration
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp _vmops = self._get_vmops_for_compute_node(instance['node'])
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/virt/vmwareapi/driver.py", line 567, in _get_vmops_for_compute_node
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp resource = self._get_resource_for_node(nodename)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/virt/vmwareapi/driver.py", line 559, in _get_resource_for_node
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp raise exception.NotFound(msg)
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp NotFound: The resource domain-c1122(compute02) does not exist
2014-07-16 18:14:55.299 13228 TRACE nova.openstack.common.rpc.amqp
2014-07-16 18:14:57.595 13228 DEBUG nova.openstack.common.vmware.api [-] Waiting for function _invoke_api to return. func /usr/lib/python2.7/dist-packages/nova/openstack/common/vmware/api.py:120
2014-07-16 18:14:57.598 13228 DEBUG nova.openstack.common.vmware.api [-] Invoking _invoke_api; retry count is 0. _func /usr/lib/python2.7/dist-packages/nova/openstack/common/vmware/api.py:83
2014-07-16 18:14:57.598 13228 DEBUG nova.openstack.common.vmware.api [-] Invoking method <module 'nova.virt.vmwareapi.vim_util' from '/usr/lib/python2.7/dist-packages/nova/virt/vmwareap

Gary Kotton (garyk)
Changed in nova:
importance: Undecided → Critical
assignee: nobody → Gary Kotton (garyk)
milestone: none → juno-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/108225

Changed in nova:
status: New → In Progress
Revision history for this message
Gary Kotton (garyk) wrote :

The problem is that certain operations update the instance node. This is validated to check that it matched the cluster configuration on the compute node. This is wrong as the compute node just needs to perform a VC opertaion.

tags: added: havana-backport-potential icehouse-backport-potential vmware
Revision history for this message
John Garbutt (johngarbutt) wrote :

This only affects VMware, so downgrading to High. Please see:
https://wiki.openstack.org/wiki/Bugs#Importance

Changed in nova:
importance: Critical → High
Revision history for this message
John Garbutt (johngarbutt) wrote :

Doesn't seem like this should block J-2, so remove J-2 target for now, please shout on IRC if this is the wrong call.

Changed in nova:
milestone: juno-2 → none
Tracy Jones (tjones-i)
Changed in openstack-vmwareapi-team:
importance: Undecided → High
Tracy Jones (tjones-i)
Changed in openstack-vmwareapi-team:
assignee: nobody → Gary Kotton (garyk)
Gary Kotton (garyk)
Changed in nova:
importance: High → Critical
Changed in openstack-vmwareapi-team:
importance: High → Critical
Changed in nova:
milestone: none → juno-3
Revision history for this message
Kanagaraj Manickam (kanagaraj-manickam) wrote :

Hi Gary,

I have investigated this issue and found an different solution here:

in nova.compute.api, on resize() method, by using resize_allow_on_same_host = True, following code could help to solve the issue:

       if not CONF.allow_resize_to_same_host:
            filter_properties['ignore_hosts'].append(instance['host'])
        else:
            if instance['host'] != instance['node']:
                filter_properties['force_nodes'] = [instance['node']]

Revision history for this message
Tracy Jones (tjones-i) wrote :

as this only affects 1 hypervisor by definition it cannot be critical

Changed in nova:
importance: Critical → High
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-3 → juno-rc1
Revision history for this message
Michael Still (mikal) wrote :

Is anyone working on this bug? If not we might need to untarget it from rc1.

Michael Still (mikal)
Changed in nova:
milestone: juno-rc1 → none
Revision history for this message
Gary Kotton (garyk) wrote :

A patch has been in review - https://review.openstack.org/108225.

Changed in nova:
importance: High → Critical
Revision history for this message
Gary Kotton (garyk) wrote :

Hopefully we can have this for RC2.

Changed in nova:
milestone: none → 2014.1.3
Changed in nova:
milestone: 2014.1.3 → none
Yaguang Tang (heut2008)
tags: removed: havana-backport-potential
Changed in nova:
assignee: Gary Kotton (garyk) → Mike Durnosvistov (mdurnosvistov)
Revision history for this message
Joe Gordon (jogo) wrote :

please stop marking this is critical.

Changed in nova:
importance: Critical → High
Changed in nova:
assignee: Mike Durnosvistov (mdurnosvistov) → Gary Kotton (garyk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/108225
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8e4a9156f4dccf003970848c28b8a9d15c55212d
Submitter: Jenkins
Branch: master

commit 8e4a9156f4dccf003970848c28b8a9d15c55212d
Author: Gary Kotton <email address hidden>
Date: Sat Jul 19 23:24:54 2014 -0700

    VMware: fix exception when multiple compute nodes are running

    A number of operations in the VMwareVCDriver class first validate
    that instances node is the cluster that is mapped to the compute
    node. This is problematic when the compute nodes have different
    configurations, for example, each compute node is mapped to a
    different cluster.

    In this case many operations that are just performing instance operations
    will fail. This patch ensure that all instance operations that do not
    require a cluster or volume will make use of the base _vmops class.
    This is due to the fact that it only requires the instance details
    to interface with the VC and there are no specific cluster operations.

    Change-Id: I2bc38a480f2feb12ea41e7d28f80b29dd49a79b8
    Closes-bug: #1345460

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/137386

Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/137386
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=acefbcb20408ec6ee1fe666524a2a20449969f1f
Submitter: Jenkins
Branch: stable/juno

commit acefbcb20408ec6ee1fe666524a2a20449969f1f
Author: Gary Kotton <email address hidden>
Date: Sat Jul 19 23:24:54 2014 -0700

    VMware: fix exception when multiple compute nodes are running

    A number of operations in the VMwareVCDriver class first validate
    that instances node is the cluster that is mapped to the compute
    node. This is problematic when the compute nodes have different
    configurations, for example, each compute node is mapped to a
    different cluster.

    In this case many operations that are just performing instance operations
    will fail. This patch ensure that all instance operations that do not
    require a cluster or volume will make use of the base _vmops class.
    This is due to the fact that it only requires the instance details
    to interface with the VC and there are no specific cluster operations.

    Change-Id: I2bc38a480f2feb12ea41e7d28f80b29dd49a79b8
    Closes-bug: #1345460
    (cherry picked from commit 8e4a9156f4dccf003970848c28b8a9d15c55212d)

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.