Instance multi-create doesn't support available resources spread between children RPs

Bug #1874664 reported by Brin Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Wenping Song
Ussuri
New
Medium
Unassigned

Bug Description

If a flavor asks for resources that are provided by nested Resource Provider inventories (eg. VGPU) and the user wants multi-create (ie. say --max 2) then the scheduler could be returning a NoValidHosts exception even if each nested Resource Provider can support at least specific instance, if the total wanted capacity is not supported by only one nested RP.

For example, if two children RP have 4 VGPU inventories :
 - you can ask for a flavor with 2 VGPU with --max 2
 - but you can't ask for a flavor with 4 VGPU and --max 2

======
Original report :

When boot more than one instance with accelerator, and the accelerators are in one compute node, there will be two problems as below:

One problem is as we always get the first item(alloc_reqs[0]) in alloc_reqs, when we iterator the second instance, it will throw conflict exception when putting the allocations.

Another is as we always get the first item in alloc_reqs_by_rp_uuid.get(selected_host.uuid), the selected_alloc_req is always stable, that will cause the values in selections_to_return are same . In fact, it's not right for subsequent operations.

More details you can see: https://etherpad.opendev.org/p/filter_scheduler_issue_with_accelerators

Brin Zhang (zhangbailin)
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → Wenping Song (wenping1)
tags: added: schedul
tags: added: cyborg scheduler
removed: schedul
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Given we are after RC1 (which means that we only accept regression bugfixes for RC2 and later versions), I think we should just document the current caveat in https://docs.openstack.org/api-guide/compute/accelerator-support.html and trying to backport the bugfix for a later Ussuri release (say 21.0.1).

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I confirmed with the reporter that this is an issue with instance multi create (creating more than one server with a single POST /servers request)

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I agree with Sylvain that this is not an Ussuri GA stopper. But let's document the limitation before the Ussuri GA.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Also Sylvain in the meantime reproduced the same issue with VGPUs

summary: - Boot more than one instances failed with accelerators in its flavor
+ Instance multi-create doesn't support available resources spread between
+ children RPs
tags: added: vgpu
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/723858
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=32bbbd698a2a9c5ca6f0b01662d94c64e21422b1
Submitter: Zuul
Branch: master

commit 32bbbd698a2a9c5ca6f0b01662d94c64e21422b1
Author: Sylvain Bauza <email address hidden>
Date: Tue Apr 28 12:17:08 2020 +0200

    Test multi create with vGPUs

    We had a bug in Rocky where multicreate wasn't working correctly, but given
    in Stein we provided Resource Providers for each pGPU, this is fixed now.

    NOTE: We have a related bug #1874664 because multicreate doesn't work with
    nested Resource Providers.
    We could btw. move the regression test to a specific module in the
    regressions tests subdirectory.

    Change-Id: I8154917ff142987e80dc711e3b2b3965a21f08d0
    Related-Bug: #1780225
    Related-Bug: #1874664

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/723884
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c19de075e1c6dcaeac1af25e77aa87c277fa295a
Submitter: Zuul
Branch: master

commit c19de075e1c6dcaeac1af25e77aa87c277fa295a
Author: zhangbailin <email address hidden>
Date: Tue Apr 28 20:09:00 2020 +0800

    Add nested resource providers limit for multi create

    In 21.0.0 Ussuri we were completed the nova-cyborg interaction feature,
    but there are some issue when multiple create instances.

    Creating servers with accelerators provisioned with the Cyborg service,
    if a flavor asks for resources that are provided by nested Resource
    Provider inventories (eg. VGPU) and the user wants multi-create (ie. say
    --max 2) then the scheduler could be returning a NoValidHosts exception
    even if each nested Resource Provider can support at least one specific
    instance, if the total wanted capacity is not supported by only one
    nested RP.

    For example,creating servers with accelerators provisioned with the
    Cyborg service, if two children RP have 4 VGPU inventories each:
     - you can ask for a flavor with 2 VGPU with --max 2
     - but you can't ask for a flavor with 4 VGPU and --max 2

    Related-Bug: #1874664
    Change-Id: I64647a6ba79c47c891134cedb49f03d3c61e8824

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/845747

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/845757

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/846786

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.