nova-scheduler does not honor max_instances_per_host set to a host aggregate

Bug #1740320 reported by Supreeth Shivanand
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Undecided
vinay harsha mitta

Bug Description

Description: nova-scheduler schedules more than max_instances_per_host tag set for an host aggregate.

Root cause: nova-scheduler has NumInstancesFilter which filters out hosts exceeding max_instances_per_host instances. But, for that it relies on host_state.num_instances which is retrieved from stats of compute_node object. Due to some race condition, this stats of compute_node object is retrieved as {}, which sets the host_state.num_instance to 0. And because of this nova-scheduler schedules more than max_instances_per_host instances thinking the current host has 0 instances.

Workaround Fix: Changed the NumInstancesFilter to rely on len(host_state.instances) when host_state.num_instances is 0.

Steps to reproduce:
1. I created 3 heat autoscaling stacks, forced to create instances on 1 host aggregate.
2. Load the cpu on instances of each of these stacks, also set the cooldown value for autoscale to 30 secs(typically some low value so that the we force the race condition).
3. Once the num of instances crosses max_instances_per_host in all the hosts in an host aggregate, instances start ending up in error state(filtered by AggregateNumInstancesFilter).
4. But sometimes(due to some race condition), stats in compute_node objects becomes {} and AggregateNumInstancesFilter doesn't filter those hosts with num instances >= max_instances_per_host.
5. I start seeing more then max_instances_per_host instances in these hosts.

Expected result:
Hosts in Host aggregates having max_instances_per_host tag, should never be scheduled with instances more than max_instances_per_host.

Actual Result:
Hosts in Host aggregates having max_instances_per_host tag, are scheduled with instances more than max_instances_per_host.

Environment:
openstack nova version: Newton release
Hypervisor: Libvirt + KVM

Tags: scheduler
Revision history for this message
Matt Riedemann (mriedem) wrote :

You might be the person who I was talking with about this issue in IRC. I pushed a couple of debug patches to see if we can see the same thing in our CI testing, but I haven't dug into the logs yet:

https://review.openstack.org/#/q/status:open+branch:master+topic:num-instances-filter

tags: added: scheduler
Revision history for this message
Matt Riedemann (mriedem) wrote :

The test does show the issue:

http://logs.openstack.org/67/529867/1/check/tempest-full/23d2919/controller/logs/screen-n-sch.txt.gz#_Dec_22_18_14_13_693048

Dec 22 18:14:13.693048 ubuntu-xenial-ovh-bhs1-0001579658 nova-scheduler[9206]: WARNING nova.scheduler.filters.num_instances_filter [None req-75ecdad4-ef94-46c0-bd60-293038eadbd2 tempest-ServerRescueTestJSON-754723157 tempest-ServerRescueTestJSON-754723157] Reported number of instances (0) does not match the tracked number of instances (1).

That shows up twice in that one Tempest run.

I don't know what the root cause is, and I'm not sure we should just paper over it by checking for 0 and then using the other value if they aren't the same.

Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

In this case it's not even close:

http://logs.openstack.org/67/529867/1/check/legacy-tempest-dsvm-neutron-multinode-full/35b6e42/logs/screen-n-sch.txt.gz#_Dec_22_18_32_11_248662

Dec 22 18:32:11.248662 ubuntu-xenial-ovh-bhs1-0001579701 nova-scheduler[32076]: WARNING nova.scheduler.filters.num_instances_filter [None req-683d676a-ca4d-408a-8e9e-6ac7ff7c06d8 tempest-ServerRescueNegativeTestJSON-1276252261 tempest-ServerRescueNegativeTestJSON-1276252261] Reported number of instances (0) does not match the tracked number of instances (3).

Revision history for this message
Supreeth Shivanand (supreeth90) wrote :

Yes I think you are the one I spoke to in IRC. I added a patch that was suggested by you in our environment and so far its been good. I am sharing that below here:

It goes like:
num_instances = host_state.num_instances or len(host_state.instances)
if len(host_state.instances) != host_state.num_instances:
    LOG.warn("NumInstancesFilter: num_instances doesn't match host_state.instances for host: {}, num_instances: {}, len_instances: {} ".format(host_state.host, host_state.num_instances, len(host_state.instances)))

Revision history for this message
vinay harsha mitta (vinay7) wrote :

hi
i'm new to bug fixing , i'm likely to assign this bug to me ,before that can i get the log url's placed in above comments. As i can't open them now.

Changed in nova:
assignee: nobody → vinay harsha mitta (vinay7)
Revision history for this message
Matt Riedemann (mriedem) wrote :

> before that can i get the log url's placed in above comments. As i can't open them now.

The logs are only retained for a limited amount of time (I'm not sure how long that is anymore) so those are gone. You could recreate by getting these changes restored and rebased:

https://review.opendev.org/#/q/topic:num-instances-filter

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Here is a local reproduction of the warning log :

http://paste.openstack.org/show/795651/

Revision history for this message
vinay harsha mitta (vinay7) wrote :

i tried to reproduce the bug using same script but not able to reproduce in my setup

Revision history for this message
vinay harsha mitta (vinay7) wrote :

After running script for a num of times , i could reproduce only once.

http://paste.openstack.org/show/795943/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.