VMware driver: Nova compute fails to start when multiple nova compute services are running per vCenter.

Bug #1272286 reported by Chinmaya Bharadwaj
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Sabari Murugesan
Havana
Fix Released
Undecided
Unassigned
VMwareAPI-Team
In Progress
High
Sabari Murugesan

Bug Description

Nova Compute fails to start when there are multiple nova compute services running on different VMs(nova compute VMs) and each Vm managing multiple cluster in a vCenter and instances are provisioned on them.

Explanation:

Lets say, one nova compute vm(C1) is managing 5 clusters, and another(C2) managing 5 clusters. C1 manages n number of instances. Suppose in C2 compute service gets restarted, it fails to start.

Reason:
on the start up of the nova-compute, it checks the instances reported by the driver are still associated with this host. If they are not, it destroys them.
method _destroy_evacuated_instances calls the driver list_instances, which lists all the instances in the vCenter, though they are managed by different compute. Instead it should return only the vms which are managed by c1/c2.

log file attached.

Revision history for this message
Chinmaya Bharadwaj (acbharadwaj) wrote :
summary: - VMware driver: Nova compute fails to start when two proxies are present
- per vcenter
+ VMware driver: Nova compute fails to start when multiple nova compute
+ services are running per vCenter.
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Gary Kotton (garyk) wrote :

The stacktrace of the failure is:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 107, in switch
    self.greenlet.switch(value)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
    result = function(*args, **kwargs)
  File "/opt/stack/nova/nova/openstack/common/service.py", line 480, in run_service
    service.start()
  File "/opt/stack/nova/nova/service.py", line 172, in start
    self.manager.init_host()
  File "/opt/stack/nova/nova/compute/manager.py", line 805, in init_host
    self._init_instance(context, instance)
  File "/opt/stack/nova/nova/compute/manager.py", line 684, in _init_instance
    self.driver.plug_vifs(instance, net_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 703, in plug_vifs
    _vmops = self._get_vmops_for_compute_node(instance['node'])
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 522, in _get_vmops_for_compute_node
    resource = self._get_resource_for_node(nodename)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 514, in _get_resource_for_node
    raise exception.NotFound(msg)
NotFound: The resource domain-c7(Cluster31) does not exist

Changed in nova:
status: New → Confirmed
Gary Kotton (garyk)
Changed in nova:
importance: Undecided → High
milestone: none → icehouse-3
assignee: nobody → Gary Kotton (garyk)
Gary Kotton (garyk)
Changed in nova:
importance: High → Critical
Revision history for this message
Gary Kotton (garyk) wrote :
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Joe Gordon (jogo) wrote :

we are trying to do a better job of making critical bugs mean 'all hands on deck' which means we can't have 13 of them open.

Changed in nova:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/69209
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d46f4a5b6d17f20a9c1de9367af11c9f3e7a7ef0
Submitter: Jenkins
Branch: master

commit d46f4a5b6d17f20a9c1de9367af11c9f3e7a7ef0
Author: Gary Kotton <email address hidden>
Date: Sun Jan 26 08:57:14 2014 -0800

    VMware: fix exception when using multiple compute nodes

    When there is more than one compute node running and one of the
    nodes restarts it terminates on an exception that the resource is
    not found.

    The cause of the issue is that a vif plug was being attempted for
    a resource that did not exist. The vif plug should have raised an
    "NotImplemented" exception.

    Change-Id: I5a3f1cc73a981173b6c2fa493de3aad10a7e97fd
    Closes-bug: #1272286

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/76317

Revision history for this message
Sabari Murugesan (smurugesan) wrote :

This bug also needs the patch in review :- https://review.openstack.org/#/c/69262/4

Tracy Jones (tjones-i)
Changed in openstack-vmwareapi-team:
status: New → Fix Committed
importance: Undecided → High
Tracy Jones (tjones-i)
tags: added: havana-backport-potential
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Sabari Murugesan (smurugesan) wrote :

The completely fix this bug we also need https://review.openstack.org/#/c/69262/. This patch makes sure that the driver lists only the instances of it's nodes. Currently, it lists all the vm's that are with the hypervisor even the ones not managed by its nodes.

Changed in nova:
assignee: Gary Kotton (garyk) → Sabari Murugesan (smurugesan)
milestone: icehouse-3 → icehouse-rc1
status: Fix Released → In Progress
Changed in openstack-vmwareapi-team:
assignee: nobody → Sabari Murugesan (smurugesan)
status: Fix Committed → In Progress
Revision history for this message
Russell Bryant (russellb) wrote :

It appears that this is not a regression, so I'm moving it to the "icehouse-rc-potential" list.

tags: added: icehouse-rc-potential
Changed in nova:
milestone: icehouse-rc1 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/69262
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=923c38c51fcd858daa4e909121d0142bd1fc3f08
Submitter: Jenkins
Branch: master

commit 923c38c51fcd858daa4e909121d0142bd1fc3f08
Author: Sabari Kumar Murugesan <email address hidden>
Date: Fri Oct 18 15:42:49 2013 -0700

    VMware: fix list_instances for multi-node driver

    VMwareVCDriver should only list instances in the nodes managed by
    it. Currently, it uses the an implementation that lists instances
    in the vCenter server inventory even if they are not in the nodes
    managed by the driver.

    Closes-bug: #1272286
    Change-Id: I56c81a759eacc8c595e97ac5ca372834b675ebff

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/82866

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/82866
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2885173947bb566eacb7b7970b7d6c3953b7d6d4
Submitter: Jenkins
Branch: stable/havana

commit 2885173947bb566eacb7b7970b7d6c3953b7d6d4
Author: Gary Kotton <email address hidden>
Date: Sun Jan 26 08:57:14 2014 -0800

    VMware: fix exception when using multiple compute nodes

    When there is more than one compute node running and one of the
    nodes restarts it terminates on an exception that the resource is
    not found.

    The cause of the issue is that a vif plug was being attempted for
    a resource that did not exist. The vif plug should have raised an
    "NotImplemented" exception.

    Change-Id: I5a3f1cc73a981173b6c2fa493de3aad10a7e97fd
    Closes-bug: #1272286
    (cherry picked from commit d46f4a5b6d17f20a9c1de9367af11c9f3e7a7ef0)

tags: added: in-stable-havana
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.