Multi instance creation rescheduling fails due to a lack of alternates

Bug #1787606 reported by Lee Yarwood on 2018-08-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Lee Yarwood

Bug Description

Description
===========

When creating more than a single instance in the same request the filter scheduler will skip any host that has already been selected when attempting to find alternates. The lack of alternates will lead to instances not being rescheduled and entering an ERROR state if issues are encountered when spawning on their selected host.

For example, given a simple two node environment and a request to create 5 instances the following nested lists of selections is returned:

[
[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')],

[Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')],

[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')],

[Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')],

[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')]
]

The above lists a single selection for each instance being created with no alternates present. Compare that to the following list from a request to create a single instance:

[
[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com'),
Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')]
]

Here we have two selections, the original selected host and an alternate. AFAICT the following conditional is at fault here as it currently checks if the potential alternate has been selected for any other instance within the request:

https://github.com/openstack/nova/blob/83574f7e07f6a67b09226971dd8fb0ed5436f86e/nova/scheduler/filter_scheduler.py#L400

Steps to reproduce
==================

* Launch more than one instance in a single request using min_count/max_count.
* Ensure instances are unable to spawn on at least one compute host.

Expected result
===============

Instances that are unable to spawn on one compute host are rescheduled elsewhere.

Actual result
=============

Instances that are unable to spawn on one compute host are not rescheduled and end up in an ERROR state.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   master

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Related fix proposed to branch: master
Review: https://review.openstack.org/593073

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress
Chris Dent (cdent) on 2018-08-17
tags: added: placement scheduler
Changed in nova:
importance: Undecided → Medium

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.opendev.org/593073

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.opendev.org/593074

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers