Scheduler update_aggregates race causes incorrect aggregate information

Bug #1542491 reported by James Dennis
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Medium
Unassigned
Ubuntu
Invalid
Undecided
Unassigned

Bug Description

It appears that if nova-api receives simultaneous requests to add a server to a host aggregate, then a race occurs that can lead to nova-scheduler having incorrect aggregate information in memory.

One observed effect of this is that sometimes nova-scheduler will think a smaller number of hosts are a member of the aggregate than is in the nova database and will filter out a host that should not be filtered.

Restarting nova-scheduler fixes the issue, as it reloads the aggregate information on startup.

Nova package versions: 1:2015.1.2-0ubuntu2~cloud0

Reproduce steps:

Create a new os-aggregate and then populate an os-aggregate with simultaneous API POSTs, note timestamps:

2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server [req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 439 time: 0.1865470
2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server [req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.2995598
2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server [req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 519 time: 0.3140590
2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server [req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3759601
2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server [req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3433032

Schedule a VM

Expected Result:
nova-scheduler Availability Zone filter returns all members of the aggregate

Actual Result:
nova-scheduler believes there is only one hypervisor in the aggregate. The number will vary as it is a race:

2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70
2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84
2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
2016-02-05 07:48:04.413 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv1, oshv1) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
2016-02-05 07:48:04.413 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter AvailabilityZoneFilter returned 1 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84

Nova API calls show the correct number of members.

I suspect that it is caused by the simultaneous processing or out-of-order receipt of update_aggregates RPC calls.

Tags: api scheduler
James Dennis (jtmes)
tags: added: race-condition scheduler
tags: removed: race-condition
Changed in nova:
importance: Undecided → Medium
importance: Medium → Undecided
Changed in ubuntu:
status: New → Invalid
Changed in nova:
status: New → Incomplete
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Could you please tell us which Nova version is corresponding to the Ubuntu package 1:2015.1.2-0ubuntu2~cloud0 ?

Also, could you please tell us if another request coming in would get the accurate number of hosts within the Aggregate ? In general, you don't need to restart the scheduler service, because updates are RPC'd (fanout) to the scheduler which should get the update anyway.

Revision history for this message
James Dennis (jtmes) wrote :

It's nova version 2015.1.2

All requests to create a VM are scheduled using the wrong aggregate members until nova-scheduler is restarted.

My understanding is that the scheduler only receives an update via RPC when an aggregate's members change. Therefore if no further changes are made it will continue to use the incorrect aggregate members.

Revision history for this message
yuntongjin (yuntongjin) wrote :

@James Dennis (jtmes)
I believe your understanding on #2 is correct.

The root cause for this issue is that the aggregate's operation sequence on DB may not exactly as same as on scheduler.

for example:
operation on DB:
1. add host
2. del host

however, on scheduler, if the operation's sequence reversed as:
1. del host
2. add host

it will case the DB/scheduler outer-sync.

Changed in nova:
assignee: nobody → yuntongjin (yuntongjin)
Revision history for this message
yuntongjin (yuntongjin) wrote :

propose solution:
pass DB operation timestamp to scheduler, set it as scheduler update timestamp, if the timestamps of a new update is older than scheduler update timestamp, drop this update.

Revision history for this message
James Dennis (jtmes) wrote :

Timestamps do have the benefit of being simple to implement, but also have some flaws:

1. Relying on the standard system clock means they aren't monotonic, i.e. if the clock is adjusted backwards no updates may be accepted by nova-scheduler
2. If you have nova-api running on two different hosts it requires the hosts times to be synchronized or updates could be lost
3. Very unlikely with microsecond precision but two updates could have the same identifier.

While more complex, and requiring a database change, an integer that is updated by nova-api in the database, and sent every time with the scheduler message would avoid the above problems. The use of the database would guarantee monotonic increase for every update, and also the durability of the value across nova-api restarts. Without durability some notification to nova-scheduler that it needs to reset its expected sequence number would be needed.

It would probably make sense to make this sequence number larger in scope than just host aggregation changes so that it can be reused for other future scheduler messages.

Revision history for this message
yuntongjin (yuntongjin) wrote :

your considering make lots of sense, and they are common issues in distributed system,
OpenStack use timestamp update method all over the codes like:
https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L185
for your considering:
#1, synchronized timing is a basic assumption in distributed system like OS, even it make system prone to going wrong.
#2, timestamp is from DB operation, as long as mutli nova-api service operate to same single DB, this won't be a problem.
#3, very helpful heads up, the logic will be updating cache as long as DB timestamps = cache timestamps.

Revision history for this message
James Dennis (jtmes) wrote :

Is it correct that the example use of a timestamp is for a message that will only be sent by one agent, i.e. the nova-compute instance? If so then I don't think it is a comparable scenario.

#1 I agree that typically synchronized timing is expected, but I wouldn't expect timing to be used as the basis of versioning data that can be updated at high frequency as it can be non-deterministic due to clock jitter etc.
#2 I missed the detail of a DB-derived timestamp, which means there is a "master lock". I have not tested this but use of MariaDB Galera Cluster might return the timestamp of whatever node executes the query?

I am not familiar with the nova code; if the use of timestamps is prevalent in this scenario then this particular bug is unlikely to be severe enough to warrant a change in approach.

yuntongjin (yuntongjin)
Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289207

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/289207
Reason: This patch is quite old, so I am abandoning it to keep the review queue manageable. Feel free to restore the change if you're still interested in working on it.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: yuntongjin (yuntongjin) → nobody
jingtao (liang888)
Changed in nova:
assignee: nobody → jingtao (liang888)
Revision history for this message
jingtao (liang888) wrote :

Is there anybody else concerned about this question?

Maybe we can try to synchronize add_host_to_aggregate and remove_host_from_aggregate.

Changed in nova:
status: Confirmed → Opinion
Revision history for this message
Michael Sherman (msherman-uchicago) wrote (last edit ):

I believe this is the same bug here: https://bugs.launchpad.net/nova/+bug/1947813

and related to this one? https://bugs.launchpad.net/nova/+bug/1699054

Revision history for this message
Michael Sherman (msherman-uchicago) wrote (last edit ):
Download full text (8.7 KiB)

Steps to reproduce with devstack, on devstack master commit 9be4ceeaa10f6ed92291e77ec52794acfb67c147

The `AggregateInstanceExtraSpecsFilter` is only added to trigger a log message and/or scheduling failures from the stale aggregate info, extra debug logging in _update_aggregates will show the inconsistent state even without the added filter.

### Adding logging to the host_manager helps to see what's going on:

```
diff --git a/nova/scheduler/host_manager.py b/nova/scheduler/host_manager.py
index 8cb775a923..c9894c79fa 100644
--- a/nova/scheduler/host_manager.py
+++ b/nova/scheduler/host_manager.py
@@ -392,6 +392,8 @@ class HostManager(object):

     def _update_aggregate(self, aggregate):
         self.aggs_by_id[aggregate.id] = aggregate
+
+ LOG.debug(f"update for {aggregate.id} called with {aggregate.hosts}")
         for host in aggregate.hosts:
             self.host_aggregates_map[host].add(aggregate.id)
         # Refreshing the mapping dict to remove all hosts that are no longer
```

### Local.conf:

```
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD

VIRT_DRIVER=fake
NUMBER_FAKE_NOVA_COMPUTE=10

[[post-config|$NOVA_CONF]]

# just addition of AggregateInstanceExtraSpecsFilter to exercise the issue
[filter_scheduler]
enabled_filters = ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,SameHostFilter,DifferentHostFilter,AggregateInstanceExtraSpecsFilter
```

### aggregate and flavor setup for AggregateInstanceExtraSpecsFilter

```
openstack aggregate create test_agg
openstack aggregate set --property "test=true" test_agg

openstack flavor create --ram 512 --disk 1 --vcpus 1 test_flavor
openstack flavor set --property "aggregate_instance_extra_specs:test=true" test_flavor
```

### add hosts to aggregate in parallel
It is not guaranteed to trigger the issue, so several attempts may be needed.
Looking at the debug logs from host manager will show if the last applied RPC has an incomplete list of hosts in the aggregate.
The issue seems easier to trigger the more closely spaced in time the requests are, such as doing it via openstacksdk and reusing the session and avoiding the python startup time.

```
openstack hypervisor list -c "Hypervisor Hostname" -f value \
| xargs -I {} -P 10 -n 1 \
 openstack aggregate add host test_agg -c hosts -f value {}
```

This will show responses like the following:
```
['devstack8']
['devstack8', 'devstack1']
['devstack8', 'devstack1', 'devstack2']
['devstack8', 'devstack3']
['devstack8', 'devstack1', 'devstack7']
['devstack8', 'devstack4']
['devstack8', 'devstack6']
['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack10']
['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack9']
['devstack8', 'devstack1', 'devstack3', 'devstack4', 'devstack2', 'devstack6', 'devstack5']
```

At this point, viewing the aggregate info directly does show the correct memebership
```
$ openstack aggregate show test_agg --max-width=80
+-------------------+----------------------------------------------------------+
| Field...

Read more...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

setting this to medium severity

there is an existing race in how the cache is updated.
the workaround is to periodically restart the scheduled to clear the cache.

this looks like it affects all stable releases of OpenStack.
however its unlikely but not impossible that a fix for this can be backported.

given the above I'm marking this as medium as there is a relatively simple workaround even if the detection of the
isuee is not trivial.

Changed in nova:
assignee: jingtao (liang888) → nobody
importance: Undecided → Medium
status: Opinion → Triaged
tags: added: api
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.