Steps to reproduce with devstack, on devstack master commit 9be4ceeaa10f6ed92291e77ec52794acfb67c147 The `AggregateInstanceExtraSpecsFilter` is only added to trigger a log message and/or scheduling failures from the stale aggregate info, extra debug logging in _update_aggregates will show the inconsistent state even without the added filter. ### Adding logging to the host_manager helps to see what's going on: ``` diff --git a/nova/scheduler/host_manager.py b/nova/scheduler/host_manager.py index 8cb775a923..c9894c79fa 100644 --- a/nova/scheduler/host_manager.py +++ b/nova/scheduler/host_manager.py @@ -392,6 +392,8 @@ class HostManager(object): def _update_aggregate(self, aggregate): self.aggs_by_id[aggregate.id] = aggregate + + LOG.debug(f"update for {aggregate.id} called with {aggregate.hosts}") for host in aggregate.hosts: self.host_aggregates_map[host].add(aggregate.id) # Refreshing the mapping dict to remove all hosts that are no longer ``` ### Local.conf: ``` [[local|localrc]] ADMIN_PASSWORD=secret DATABASE_PASSWORD=$ADMIN_PASSWORD RABBIT_PASSWORD=$ADMIN_PASSWORD SERVICE_PASSWORD=$ADMIN_PASSWORD VIRT_DRIVER=fake NUMBER_FAKE_NOVA_COMPUTE=10 [[post-config|$NOVA_CONF]] # just addition of AggregateInstanceExtraSpecsFilter to exercise the issue [filter_scheduler] enabled_filters = ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,SameHostFilter,DifferentHostFilter,AggregateInstanceExtraSpecsFilter ``` ### aggregate and flavor setup for AggregateInstanceExtraSpecsFilter ``` openstack aggregate create test_agg openstack aggregate set --property "test=true" test_agg openstack flavor create --ram 512 --disk 1 --vcpus 1 test_flavor openstack flavor set --property "aggregate_instance_extra_specs:test=true" test_flavor ``` ### add hosts to aggregate in parallel It is not guaranteed to trigger the issue, so several attempts may be needed. Looking at the debug logs from host manager will show if the last applied RPC has an incomplete list of hosts in the aggregate. The issue seems easier to trigger the more closely spaced in time the requests are, such as doing it via openstacksdk and reusing the session and avoiding the python startup time. ``` openstack hypervisor list -c "Hypervisor Hostname" -f value \ | xargs -I {} -P 10 -n 1 \ openstack aggregate add host test_agg -c hosts -f value {} ``` This will show responses like the following: ``` ['devstack8'] ['devstack8', 'devstack1'] ['devstack8', 'devstack1', 'devstack2'] ['devstack8', 'devstack3'] ['devstack8', 'devstack1', 'devstack7'] ['devstack8', 'devstack4'] ['devstack8', 'devstack6'] ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] ``` At this point, viewing the aggregate info directly does show the correct memebership ``` $ openstack aggregate show test_agg --max-width=80 +-------------------+----------------------------------------------------------+ | Field | Value | +-------------------+----------------------------------------------------------+ | availability_zone | None | | created_at | 2024-04-25T15:43:45.000000 | | deleted_at | None | | hosts | devstack1, devstack10, devstack2, devstack3, devstack4, | | | devstack5, devstack6, devstack7, devstack8, devstack9 | | id | 1 | | is_deleted | False | | name | test_agg | | properties | test='true' | | updated_at | None | | uuid | 6700b896-34fb-4e49-9057-e1d40ce185ec | +-------------------+----------------------------------------------------------+ ``` If the extra logging was applied, we will now see the following in the nova scheduler debug logs: ``` ... Apr 25 15:48:01 devstack nova-scheduler[172360]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172360) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172320]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172320) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172316]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172316) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172326]: DEBUG nova.scheduler.host_manager [None req-6d438ed6-e35d-48c3-b618-d3c62de50ac0 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] {{(pid=172326) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172273]: DEBUG nova.scheduler.host_manager [None req-6d438ed6-e35d-48c3-b618-d3c62de50ac0 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] {{(pid=172273) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172360]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172360) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172320]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172320) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172273]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172273) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172326]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172326) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172320]: DEBUG nova.scheduler.host_manager [None req-6d438ed6-e35d-48c3-b618-d3c62de50ac0 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] {{(pid=172320) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} Apr 25 15:48:01 devstack nova-scheduler[172273]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172273) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}} ``` and if we now schedule some instances, we'll see log entries indicating that the host_state is still inconsistent. ``` openstack server create \ --image cirros-0.6.2-x86_64-disk \ --network private \ --min=10 --max=10 \ --flavor test_flavor \ instance1 ``` ``` Apr 25 15:50:56 devstack nova-scheduler[172268]: DEBUG nova.filters [None req-4afd37c0-ec15-4aae-8f2b-a5ae7920aac8 admin admin] Filter AggregateInstanceExtraSpecsFilter returned 7 host(s) {{(pid=172268) get_filtered_objects /opt/stack/nova/nova/filters.py:102}} ``` There's still something I'm not understanding about the nova_fake driver and/or the AggregateInstanceExtraSpecsFilter, as all 10 hosts still become "active", even though the filter is excluding 3 of them.