Nova won't reschedule when specific hypervisor is set and request failed

Bug #1717916 reported by Vasyl Saienko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Undecided
Unassigned

Bug Description

Ironic CI is blocked due to frequent failures of
tempest.scenario.test_server_multinode.TestServerMultinode.test_schedule_to_all_nodes

The cause is that nova will not reschedule failed instances when hypervisor is specified [0]

[0] https://github.com/openstack/nova/blob/master/nova/scheduler/utils.py#L375-L381

Tags: ironic
Revision history for this message
Vasyl Saienko (vsaienko) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

To be clear, the issue here is that force_hosts is set but force_nodes is not, correct? So what you want is for the reschedule to try other nodes on the forced host. The linked code is not accounting for the 1:M relationship between host:node for Ironic.

tags: added: ironic
Changed in nova:
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

The problem in this utility code is we don't know if there is 1 or more nodes on the host...and we don't want to look that up every time, but maybe that could be optimized to only check if force_hosts is specified and force_nodes isn't.

Revision history for this message
Matt Riedemann (mriedem) wrote :

tempest.scenario.test_server_multinode.TestServerMultinode.test_schedule_to_all_nodes has existed in Tempest for almost 2 years now, why is this just a recent issue? Or was the test always blacklisted before this for Ironic CI jobs and is just now being investigated?

Revision history for this message
Vasyl Saienko (vsaienko) wrote :

@Matt thanks for looking on this.

I confirm this test was working before, but during last time (I can't say for sure near cutting pike release) we start experiencing problems with races in scheduler. We increased scheduler/host_subset_size to 9999 recently (https://review.openstack.org/#/q/I0874fe3b3628cb3e662ee01f24c4599247fdc82d) to stabilize concurrent tests, but now test_schedule_to_all_nodes is failing frequently and looks like it is the same race in scheduler.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.