Incorrect host stats reported by VMWare VCDriver

Bug #1190515 reported by Sabari Murugesan
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Sabari Murugesan
Grizzly
Fix Released
Medium
Sabari Murugesan
VMwareAPI-Team
Fix Committed
High
Unassigned

Bug Description

Host stats for VCDriver should collect aggregate cluster stats
rather than that of a single host in the cluster.

Because it collects stats from each individual host,
nova-compute service fails to start when a cluster contains disconnected ESXi hosts.

AttributeError: 'Text' object has no attribute 'overallMemoryUsage'
Removing descriptor: 6
2013-06-11 18:52:29.184 ERROR nova.openstack.common.threadgroup [-] 'Text' object has no attribute 'overallMemoryUsage'
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/openstack/common/threadgroup.py", line 117, in wait
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup x.wait()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/openstack/common/threadgroup.py", line 49, in wait
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup return self.thread.wait()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup return self._exit_event.wait()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup return hubs.get_hub().switch()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup return self.greenlet.switch()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup result = function(*args, **kwargs)
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/openstack/common/service.py", line 65, in run_service
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup service.start()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/service.py", line 155, in start
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup self.manager.init_host()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/compute/manager.py", line 646, in init_host
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup self._report_driver_status(context)
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/compute/manager.py", line 3816, in _report_driver_status
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup capabilities = self.driver.get_host_stats(refresh=True)
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 305, in get_host_stats
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup return self.host_state.get_host_stats(refresh=refresh)
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 368, in host_state
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup self._cluster)
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/virt/vmwareapi/host.py", line 155, in __init__
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup self.update_status()
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/virt/vmwareapi/host.py", line 201, in update_status
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup summary.quickStats.overallMemoryUsage
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup AttributeError: 'Text' object has no attribute 'overallMemoryUsage'
2013-06-11 18:52:29.184 TRACE nova.openstack.common.threadgroup
2013-06-11 18:52:29.249 DEBUG amqp [-] Closed channel #1 from (pid=28770) _do_close /usr/local/lib/python2.7/dist-packages/amqp/channel.py:88

Tags: vmware
Revision history for this message
Sabari Murugesan (smurugesan) wrote :

Update:

When there are hosts in a cluster that are in the disconnected state, some of the dynamic properties are not available. VCHost tries to use these properties. One way is to provide a fix which checks if the host is disconnected.

Revision history for this message
Sabari Murugesan (smurugesan) wrote :

I can provide a fix for this issue.

Changed in nova:
assignee: nobody → Sabari Kumar Murugesan (smurugesan)
Revision history for this message
Sabari Murugesan (smurugesan) wrote :

http://pubs.vmware.com/vsphere-50/index.jsp#com.vmware.wssdk.apiref.doc_50/vim.host.Summary.QuickStats.html

provides the list of properties that are not available when the host is in the disconnected state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33100

Changed in nova:
status: New → In Progress
Revision history for this message
Sabari Murugesan (smurugesan) wrote : Re: disconnected ESXi Hosts cause VMWare driver failure

Initial Triage : VCState tries to retrieve a property (overallMemoryUsage) from the host which is not available while it's in the disconnected mode. Ref :- http://pubs.vmware.com/vsphere-50/index.jsp#com.vmware.wssdk.apiref.doc_50/vim.host.Summary.QuickStats.html

Updated Triage: On a closer look, VCDriver was altogether reporting incorrect host stats. Rather than reporting the stats of a particular host within the cluster, it should report the aggregate stats of the cluster.

Making this change would automatically fix the disconnected hosts issue. That is because, the cluster provides an 'effectiveMemoryUsage' property which is always available. This property is set to 0 if all the hosts are in the disconnected mode.

Changed in nova:
importance: Undecided → High
milestone: none → havana-2
Revision history for this message
Shawn Hartsock (hartsock) wrote :

Note: this could cause the scheduler to over-commit vCenter/ESX

Changed in nova:
milestone: havana-2 → havana-3
tags: added: grizzly-backport-potential
summary: - disconnected ESXi Hosts cause VMWare driver failure
+ Incorrect host stats reported by VMWare VCDriver
description: updated
Revision history for this message
Sabari Murugesan (smurugesan) wrote :

For new reviewers :-

What was broken ?
VCDriver connects to a vCenter Cluster which is an aggregate of physical hosts. VCDriver incorrectly reported the resource statistics to the tracker. It randomly picked up a host in the cluster and reported the resource stats.

How was it fixed ?
Let me summarize on how we calculate the metrics :-

vcpus = sum(pCPUs of hosts in the cluster)
vcpus_used : 0 (Not fixed by the patch)

host_memory_total = Effective Memory of a cluster, which is defined as the memory available to run virtual machines. This is the aggregated effective resource level from all running hosts. Hosts that are in maintenance mode or are unresponsive are not counted. Resources used by the VMware Service Console are not included in the aggregate.

host_memory_free = effectiveMemory - consumedMemory. The consumed memory is the current memory usage of all VM's across the cluster.

Note:- Currently, the nova scheduler does not honor the vcpus_used reported by the driver. It logs the hypervisor's view of resource consumption and computes the vcpus_used directly from the instances provisioned on the compute node. Because reporting vcpus_used was not critical, it's not fixed here.

Tracy Jones (tjones-i)
tags: added: vmware-co-preferred
Revision history for this message
Shawn Hartsock (hartsock) wrote :

This bug can cause the nova scheduler to accidentally over provision a vCenter. That can cause all kinds of mayhem.

tags: removed: vmware-co-preferred
Changed in openstack-vmwareapi-team:
status: New → In Progress
importance: Undecided → Critical
Changed in openstack-vmwareapi-team:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/43582

Thierry Carrez (ttx)
Changed in nova:
milestone: havana-3 → havana-rc1
Changed in nova:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/33100
Committed: http://github.com/openstack/nova/commit/92983257bb8e19dff54cc8a2188cc692dcafe5b8
Submitter: Jenkins
Branch: master

commit 92983257bb8e19dff54cc8a2188cc692dcafe5b8
Author: Sabari Kumar Murugesan <email address hidden>
Date: Thu Jun 13 01:30:47 2013 -0700

    Fixes host stats for VMWareVCDriver

    Host stats for VCDriver should collect aggregate cluster stats
    rather than that of a single host in the cluster.

    Fixes: bug #1190515
    Change-Id: I37e46995c5da2e3052e8178098afee7c8061bb3c

Changed in nova:
status: In Progress → Fix Committed
Changed in openstack-vmwareapi-team:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/50147

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/grizzly)

Reviewed: https://review.openstack.org/50147
Committed: http://github.com/openstack/nova/commit/7b013a59a625944980ccee9d933b5fd59d7970ec
Submitter: Jenkins
Branch: stable/grizzly

commit 7b013a59a625944980ccee9d933b5fd59d7970ec
Author: Sabari Kumar Murugesan <email address hidden>
Date: Thu Jun 13 01:30:47 2013 -0700

    Fixes host stats for VMWareVCDriver

    Host stats for VCDriver should collect aggregate cluster stats
    rather than that of a single host in the cluster.

    Fixes: bug #1190515
    Change-Id: I37e46995c5da2e3052e8178098afee7c8061bb3c
    (cherry picked from commit 92983257bb8e19dff54cc8a2188cc692dcafe5b8)

tags: added: in-stable-grizzly
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-rc1 → 2013.2
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential in-stable-grizzly
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.