vmware driver fails when setting vmware.maximum_objects to small value

Bug #1940399 reported by Fabian Wiesel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned

Bug Description

Description
===========

There are two functions in the code, which call (indirectly) RetrievePropertyEx with at most vmware.maximum_objects per result, but do not iterate over the results if there are more, nor does it
By default the value is 100, which is usually sufficient, but if someone should set the value lower, it will cause the following problems.

Both are in the module nova.virt.vmwarapi.vm_util
1. get_all_cluster_mors
   The function is called to find the cluster reference for named cluster. If there are more clusters than maximum_objects, then there is a chance that the cluster won't be found by the driver despite being configured in the VCenter.

2. get_stats_from_cluster
   The function gets the statistics for all the hosts in a cluster. Since VSphere 7.0U1, the limit is 96 hosts per cluster, and quite likely the limit will increase past the default of 100, leaving the stats inaccurate. On top of it, the function does not call `cancel_retrieval` causing a leak in the VCenter.

Steps to reproduce
==================

Take any VCenter version with more than one cluster and more than one host (I would suggest three or more), and any release of Nova configured to run against the VCenter.
I am not 100% according what the order is by which the clusters are returned, possibly in the order they have been created, or alphabetically. I would suggest to create first a cluster-a, and then a cluster-b, and add the hosts to cluster-b. That reflects how our clusters are created (chronologically alphabetically sorted).

* Configure additionally
  [vmware]
  maximum_objects=1
* Try to start nova-compute

Expected result
===============
The nova-compute service would start up and get the stats from all the hosts in the configured cluster.

Actual result
=============

nova-compute fails to start with the error message:
> The specified cluster '<clustername>' was not found in vCenter

If you have to clusters, and three hosts, then increasing maximum_objects to two will get you around that failure, and will trigger the second problem.

You can verify that by checking the resources of the nova-compute node, which will report only the resources (CPUs,RAM...) of two of the ESXi-hosts in the cluster.

Environment
===========
1. Exact version of OpenStack you are running.
   370830e944 Merge "libvirt: Enable 'vmcoreinfo' feature by default"

   As far as I can see, all Nova releases are affected by this behavior.

2. Which hypervisor did you use?
   VMware VSphere
   What's the version of that?
   7.0.1-17327586

3. Which networking type did you use?
   Neutron with NSX-T (https://github.com/sapcc/networking-nsx-t)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/804968

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/804968
Committed: https://opendev.org/openstack/nova/commit/d31b26e09d469748f2ae2cc1e5f6f57401834adc
Submitter: "Zuul (22348)"
Branch: master

commit d31b26e09d469748f2ae2cc1e5f6f57401834adc
Author: Fabian Wiesel <email address hidden>
Date: Tue Aug 17 16:36:06 2021 +0200

    VMWare: Use WithRetrieval to get all results

    In various places, own version of iterating over the results are implemented,
    sometimes even faulty.
    The following functions where only getting up to vmware.maximum_objects objects (100 by default)
    vm_util.get_all_cluster_mors, vm_util.get_stats_from_cluster.

    Previously, the results were fetched in batches of up to vmware.maximum_objects items.
    Using WithRetrieval yields an iterator to the results, which pages transparently to
    the next request.
    Consumers of the output of the results where changed to work on an iterator, where easily
    possible.

    Replaced the quadratic algorithm in `ds_util._filter_datastores_matching_storage_policy`
    with one of O(n log(n)) runtime

    Closes-Bug: #1940399
    Change-Id: I8283c3e76c595cb32527d1b8745933d044e22734

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.