OpenStack Compute (nova)

Scheduler update_aggregates race causes incorrect aggregate information

Bug #1542491 reported by James Dennis on 2016-02-05

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	Medium	Unassigned
	Ubuntu	Invalid	Undecided	Unassigned

Bug Description

It appears that if nova-api receives simultaneous requests to add a server to a host aggregate, then a race occurs that can lead to nova-scheduler having incorrect aggregate information in memory.

One observed effect of this is that sometimes nova-scheduler will think a smaller number of hosts are a member of the aggregate than is in the nova database and will filter out a host that should not be filtered.

Restarting nova-scheduler fixes the issue, as it reloads the aggregate information on startup.

Nova package versions: 1:2015.1.2-0ubuntu2~cloud0

Reproduce steps:

Create a new os-aggregate and then populate an os-aggregate with simultaneous API POSTs, note timestamps:

2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server [req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 439 time: 0.1865470
2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server [req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.2995598
2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server [req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 519 time: 0.3140590
2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server [req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3759601
2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server [req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3433032

Schedule a VM

Expected Result:
nova-scheduler Availability Zone filter returns all members of the aggregate

Actual Result:
nova-scheduler believes there is only one hypervisor in the aggregate. The number will vary as it is a race:

2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70
2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84
2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
2016-02-05 07:48:04.413 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv1, oshv1) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
2016-02-05 07:48:04.413 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter AvailabilityZoneFilter returned 1 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84

Nova API calls show the correct number of members.

I suspect that it is caused by the simultaneous processing or out-of-order receipt of update_aggregates RPC calls.

Tags:

James Dennis (jtmes) on 2016-02-05

tags:

added: race-condition scheduler

Sylvain Bauza (sylvain-bauza) on 2016-02-15

tags:	removed: race-condition
Changed in nova:
importance:	Undecided → Medium
importance:	Medium → Undecided
Changed in ubuntu:
status:	New → Invalid
Changed in nova:
status:	New → Incomplete

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2016-02-16:

Could you please tell us which Nova version is corresponding to the Ubuntu package 1:2015.1.2-0ubuntu2~cloud0 ?

Also, could you please tell us if another request coming in would get the accurate number of hosts within the Aggregate ? In general, you don't need to restart the scheduler service, because updates are RPC'd (fanout) to the scheduler which should get the update anyway.

Revision history for this message

James Dennis (jtmes) wrote on 2016-02-16:

It's nova version 2015.1.2

All requests to create a VM are scheduled using the wrong aggregate members until nova-scheduler is restarted.

My understanding is that the scheduler only receives an update via RPC when an aggregate's members change. Therefore if no further changes are made it will continue to use the incorrect aggregate members.

Revision history for this message

yuntongjin (yuntongjin) wrote on 2016-02-23:

@James Dennis (jtmes)
I believe your understanding on #2 is correct.

The root cause for this issue is that the aggregate's operation sequence on DB may not exactly as same as on scheduler.

for example:
operation on DB:
1. add host
2. del host

however, on scheduler, if the operation's sequence reversed as:
1. del host
2. add host

it will case the DB/scheduler outer-sync.

Changed in nova:
assignee:	nobody → yuntongjin (yuntongjin)

Revision history for this message

yuntongjin (yuntongjin) wrote on 2016-02-25:

propose solution:
pass DB operation timestamp to scheduler, set it as scheduler update timestamp, if the timestamps of a new update is older than scheduler update timestamp, drop this update.

Revision history for this message

James Dennis (jtmes) wrote on 2016-02-28:

Timestamps do have the benefit of being simple to implement, but also have some flaws:

1. Relying on the standard system clock means they aren't monotonic, i.e. if the clock is adjusted backwards no updates may be accepted by nova-scheduler
2. If you have nova-api running on two different hosts it requires the hosts times to be synchronized or updates could be lost
3. Very unlikely with microsecond precision but two updates could have the same identifier.

While more complex, and requiring a database change, an integer that is updated by nova-api in the database, and sent every time with the scheduler message would avoid the above problems. The use of the database would guarantee monotonic increase for every update, and also the durability of the value across nova-api restarts. Without durability some notification to nova-scheduler that it needs to reset its expected sequence number would be needed.

It would probably make sense to make this sequence number larger in scope than just host aggregation changes so that it can be reused for other future scheduler messages.

Revision history for this message

yuntongjin (yuntongjin) wrote on 2016-02-29:

your considering make lots of sense, and they are common issues in distributed system,
OpenStack use timestamp update method all over the codes like:
https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L185
for your considering:
#1, synchronized timing is a basic assumption in distributed system like OS, even it make system prone to going wrong.
#2, timestamp is from DB operation, as long as mutli nova-api service operate to same single DB, this won't be a problem.
#3, very helpful heads up, the logic will be updating cache as long as DB timestamps = cache timestamps.

Revision history for this message

James Dennis (jtmes) wrote on 2016-02-29:

Is it correct that the example use of a timestamp is for a message that will only be sent by one agent, i.e. the nova-compute instance? If so then I don't think it is a comparable scenario.

#1 I agree that typically synchronized timing is expected, but I wouldn't expect timing to be used as the basis of versioning data that can be updated at high frequency as it can be non-deterministic due to clock jitter etc.
#2 I missed the detail of a DB-derived timestamp, which means there is a "master lock". I have not tested this but use of MariaDB Galera Cluster might return the timestamp of whatever node executes the query?

I am not familiar with the nova code; if the use of timestamps is prevalent in this scenario then this particular bug is unlikely to be severe enough to warrant a change in approach.

yuntongjin (yuntongjin) on 2016-03-02

Changed in nova:
status:	Incomplete → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-07: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289207

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-25: Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/289207
Reason: This patch is quite old, so I am abandoning it to keep the review queue manageable. Feel free to restore the change if you're still interested in working on it.

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-23:

#10

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status:	In Progress → Confirmed
assignee:	yuntongjin (yuntongjin) → nobody

jingtao (liang888) on 2018-12-19

Changed in nova:
assignee:	nobody → jingtao (liang888)

Revision history for this message

jingtao (liang888) wrote on 2018-12-19:

#11

Is there anybody else concerned about this question?

Maybe we can try to synchronize add_host_to_aggregate and remove_host_from_aggregate.

sean mooney (sean-k-mooney) on 2024-04-24

Changed in nova:
status:	Confirmed → Opinion

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2024-04-24 (last edit on 2024-04-25):

#12

I believe this is the same bug here: https://bugs.launchpad.net/nova/+bug/1947813

and related to this one? https://bugs.launchpad.net/nova/+bug/1699054

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2024-04-25 (last edit on 2024-04-25):

#13

Download full text (8.7 KiB)

Steps to reproduce with devstack, on devstack master commit 9be4ceeaa10f6ed92291e77ec52794acfb67c147

The `AggregateInstanceExtraSpecsFilter` is only added to trigger a log message and/or scheduling failures from the stale aggregate info, extra debug logging in _update_aggregates will show the inconsistent state even without the added filter.

### Adding logging to the host_manager helps to see what's going on:

```
diff --git a/nova/scheduler/host_manager.py b/nova/scheduler/host_manager.py
index 8cb775a923..c9894c79fa 100644
--- a/nova/scheduler/host_manager.py
+++ b/nova/scheduler/host_manager.py
@@ -392,6 +392,8 @@ class HostManager(object):

     def _update_aggregate(self, aggregate):
         self.aggs_by_id[aggregate.id] = aggregate
+
+ LOG.debug(f"update for {aggregate.id} called with {aggregate.hosts}")
         for host in aggregate.hosts:
             self.host_aggregates_map[host].add(aggregate.id)
         # Refreshing the mapping dict to remove all hosts that are no longer
```

### Local.conf:

```
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD

VIRT_DRIVER=fake
NUMBER_FAKE_NOVA_COMPUTE=10

[[post-config|$NOVA_CONF]]

# just addition of AggregateInstanceExtraSpecsFilter to exercise the issue
[filter_scheduler]
enabled_filters = ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,SameHostFilter,DifferentHostFilter,AggregateInstanceExtraSpecsFilter
```

### aggregate and flavor setup for AggregateInstanceExtraSpecsFilter

```
openstack aggregate create test_agg
openstack aggregate set --property "test=true" test_agg

openstack flavor create --ram 512 --disk 1 --vcpus 1 test_flavor
openstack flavor set --property "aggregate_instance_extra_specs:test=true" test_flavor
```

### add hosts to aggregate in parallel
It is not guaranteed to trigger the issue, so several attempts may be needed.
Looking at the debug logs from host manager will show if the last applied RPC has an incomplete list of hosts in the aggregate.
The issue seems easier to trigger the more closely spaced in time the requests are, such as doing it via openstacksdk and reusing the session and avoiding the python startup time.

```
openstack hypervisor list -c "Hypervisor Hostname" -f value \
| xargs -I {} -P 10 -n 1 \
openstack aggregate add host test_agg -c hosts -f value {}
```

This will show responses like the following:
```
['devstack8']
['devstack8', 'devstack1']
['devstack8', 'devstack1', 'devstack2']
['devstack8', 'devstack3']
['devstack8', 'devstack1', 'devstack7']
['devstack8', 'devstack4']
['devstack8', 'devstack6']
['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10']
['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9']
['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5']
```

Steps to reproduce with devstack, on devstack master commit 9be4ceeaa10f6ed92291e77ec52794acfb67c147

### Adding logging to the host_manager helps to see what's going on:

def _update_aggregate(self, aggregate):
         self.aggs_by_id[aggregate.id] = aggregate
+
+        LOG.debug(f"update for {aggregate.id} called with {aggregate.hosts}")
         for host in aggregate.hosts:
             self.host_aggregates_map[host].add(aggregate.id)
         # Refreshing the mapping dict to remove all hosts that are no longer
```

### Local.conf:

```
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD

VIRT_DRIVER=fake
NUMBER_FAKE_NOVA_COMPUTE=10

[[post-config|$NOVA_CONF]]

### aggregate and flavor setup for AggregateInstanceExtraSpecsFilter

```
openstack aggregate create test_agg
openstack aggregate set --property "test=true" test_agg

openstack flavor create --ram 512 --disk 1 --vcpus 1 test_flavor
openstack flavor set --property "aggregate_instance_extra_specs:test=true" test_flavor
```

### add hosts to aggregate in parallel
It is not guaranteed to trigger the issue, so several attempts may be needed. 
Looking at the debug logs from host manager will show if the last applied RPC has an incomplete list of hosts in the aggregate.
The issue seems easier to trigger the more closely spaced in time the requests are, such as doing it via openstacksdk and reusing the session and avoiding the python startup time.

```
openstack hypervisor list -c "Hypervisor Hostname" -f value \
| xargs -I {} -P 10 -n 1 \
 openstack aggregate add host test_agg -c hosts -f value {}
```

If the extra logging was applied, we will now see the following in the nova scheduler debug logs:

```
...
Apr 25 15:48:01 devstack nova-scheduler[172360]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172360) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172320]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172320) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172316]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172316) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172326]: DEBUG nova.scheduler.host_manager [None req-6d438ed6-e35d-48c3-b618-d3c62de50ac0 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] {{(pid=172326) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172273]: DEBUG nova.scheduler.host_manager [None req-6d438ed6-e35d-48c3-b618-d3c62de50ac0 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] {{(pid=172273) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172360]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172360) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172320]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172320) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172273]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172273) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172326]: DEBUG nova.scheduler.host_manager [None req-3126c7d9-b7f3-4408-aaec-800a78236bb6 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9'] {{(pid=172326) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172320]: DEBUG nova.scheduler.host_manager [None req-6d438ed6-e35d-48c3-b618-d3c62de50ac0 admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10'] {{(pid=172320) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
Apr 25 15:48:01 devstack nova-scheduler[172273]: DEBUG nova.scheduler.host_manager [None req-37a6f8bd-f1b1-4fb9-af2a-b0f66aff54cf admin admin] update for 1 called with ['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5'] {{(pid=172273) _update_aggregate /opt/stack/nova/nova/scheduler/host_manager.py:396}}
```

and if we now schedule some instances, we'll see log entries indicating that the host_state is still inconsistent.

```
openstack server create \
 --image cirros-0.6.2-x86_64-disk \
 --network private \
 --min=10 --max=10 \
 --flavor test_flavor \
 instance1
 ```
 
```
Apr 25 15:50:56 devstack nova-scheduler[172268]: DEBUG nova.filters [None req-4afd37c0-ec15-4aae-8f2b-a5ae7920aac8 admin admin] Filter AggregateInstanceExtraSpecsFilter returned 7 host(s) {{(pid=172268) get_filtered_objects /opt/stack/nova/nova/filters.py:102}}
```

There's still something I'm not understanding about the nova_fake driver and/or the AggregateInstanceExtraSpecsFilter, as all 10 hosts still become "active", even though the filter is excluding 3 of them.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2024-04-29:

#14

setting this to medium severity

there is an existing race in how the cache is updated.
the workaround is to periodically restart the scheduled to clear the cache.

this looks like it affects all stable releases of OpenStack.
however its unlikely but not impossible that a fix for this can be backported.

given the above I'm marking this as medium as there is a relatively simple workaround even if the detection of the
isuee is not trivial.

Changed in nova:
assignee:	jingtao (liang888) → nobody
importance:	Undecided → Medium
status:	Opinion → Triaged
tags:	added: api

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1947813

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.