OpenStack Compute (nova)

nova-scheduler does not honor max_instances_per_host set to a host aggregate

Bug #1740320 reported by Supreeth Shivanand on 2017-12-27

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Undecided	vinay harsha mitta

Bug Description

Description: nova-scheduler schedules more than max_instances_per_host tag set for an host aggregate.

Root cause: nova-scheduler has NumInstancesFilter which filters out hosts exceeding max_instances_per_host instances. But, for that it relies on host_state.num_instances which is retrieved from stats of compute_node object. Due to some race condition, this stats of compute_node object is retrieved as {}, which sets the host_state.num_instance to 0. And because of this nova-scheduler schedules more than max_instances_per_host instances thinking the current host has 0 instances.

Workaround Fix: Changed the NumInstancesFilter to rely on len(host_state.instances) when host_state.num_instances is 0.

Steps to reproduce:
1. I created 3 heat autoscaling stacks, forced to create instances on 1 host aggregate.
2. Load the cpu on instances of each of these stacks, also set the cooldown value for autoscale to 30 secs(typically some low value so that the we force the race condition).
3. Once the num of instances crosses max_instances_per_host in all the hosts in an host aggregate, instances start ending up in error state(filtered by AggregateNumInstancesFilter).
4. But sometimes(due to some race condition), stats in compute_node objects becomes {} and AggregateNumInstancesFilter doesn't filter those hosts with num instances >= max_instances_per_host.
5. I start seeing more then max_instances_per_host instances in these hosts.

Expected result:
Hosts in Host aggregates having max_instances_per_host tag, should never be scheduled with instances more than max_instances_per_host.

Actual Result:
Hosts in Host aggregates having max_instances_per_host tag, are scheduled with instances more than max_instances_per_host.

Environment:
openstack nova version: Newton release
Hypervisor: Libvirt + KVM

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-04:

You might be the person who I was talking with about this issue in IRC. I pushed a couple of debug patches to see if we can see the same thing in our CI testing, but I haven't dug into the logs yet:

https://review.openstack.org/#/q/status:open+branch:master+topic:num-instances-filter

tags:

added: scheduler

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-04:

The test does show the issue:

http://logs.openstack.org/67/529867/1/check/tempest-full/23d2919/controller/logs/screen-n-sch.txt.gz#_Dec_22_18_14_13_693048

Dec 22 18:14:13.693048 ubuntu-xenial-ovh-bhs1-0001579658 nova-scheduler[9206]: WARNING nova.scheduler.filters.num_instances_filter [None req-75ecdad4-ef94-46c0-bd60-293038eadbd2 tempest-ServerRescueTestJSON-754723157 tempest-ServerRescueTestJSON-754723157] Reported number of instances (0) does not match the tracked number of instances (1).

That shows up twice in that one Tempest run.

I don't know what the root cause is, and I'm not sure we should just paper over it by checking for 0 and then using the other value if they aren't the same.

Changed in nova:
status:	New → Confirmed

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-04:

In this case it's not even close:

http://logs.openstack.org/67/529867/1/check/legacy-tempest-dsvm-neutron-multinode-full/35b6e42/logs/screen-n-sch.txt.gz#_Dec_22_18_32_11_248662

Dec 22 18:32:11.248662 ubuntu-xenial-ovh-bhs1-0001579701 nova-scheduler[32076]: WARNING nova.scheduler.filters.num_instances_filter [None req-683d676a-ca4d-408a-8e9e-6ac7ff7c06d8 tempest-ServerRescueNegativeTestJSON-1276252261 tempest-ServerRescueNegativeTestJSON-1276252261] Reported number of instances (0) does not match the tracked number of instances (3).

Revision history for this message

Supreeth Shivanand (supreeth90) wrote on 2018-01-04:

Yes I think you are the one I spoke to in IRC. I added a patch that was suggested by you in our environment and so far its been good. I am sharing that below here:

It goes like:
num_instances = host_state.num_instances or len(host_state.instances)
if len(host_state.instances) != host_state.num_instances:
LOG.warn("NumInstancesFilter: num_instances doesn't match host_state.instances for host: {}, num_instances: {}, len_instances: {} ".format(host_state.host, host_state.num_instances, len(host_state.instances)))

Revision history for this message

vinay harsha mitta (vinay7) wrote on 2020-06-04:

hi
i'm new to bug fixing , i'm likely to assign this bug to me ,before that can i get the log url's placed in above comments. As i can't open them now.

vinay harsha mitta (vinay7) on 2020-06-05

Changed in nova:
assignee:	nobody → vinay harsha mitta (vinay7)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2020-06-10:

> before that can i get the log url's placed in above comments. As i can't open them now.

The logs are only retained for a limited amount of time (I'm not sure how long that is anymore) so those are gone. You could recreate by getting these changes restored and rebased:

https://review.opendev.org/#/q/topic:num-instances-filter