Scheduler update_aggregates race causes incorrect aggregate information
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
Medium
|
Unassigned | ||
Ubuntu |
Invalid
|
Undecided
|
Unassigned |
Bug Description
It appears that if nova-api receives simultaneous requests to add a server to a host aggregate, then a race occurs that can lead to nova-scheduler having incorrect aggregate information in memory.
One observed effect of this is that sometimes nova-scheduler will think a smaller number of hosts are a member of the aggregate than is in the nova database and will filter out a host that should not be filtered.
Restarting nova-scheduler fixes the issue, as it reloads the aggregate information on startup.
Nova package versions: 1:2015.
Reproduce steps:
Create a new os-aggregate and then populate an os-aggregate with simultaneous API POSTs, note timestamps:
2016-02-04 20:17:08.538 13648 INFO nova.osapi_
2016-02-04 20:17:09.204 13648 INFO nova.osapi_
2016-02-04 20:17:09.243 13648 INFO nova.osapi_
2016-02-04 20:17:09.273 13649 INFO nova.osapi_
2016-02-04 20:17:09.275 13649 INFO nova.osapi_
Schedule a VM
Expected Result:
nova-scheduler Availability Zone filter returns all members of the aggregate
Actual Result:
nova-scheduler believes there is only one hypervisor in the aggregate. The number will vary as it is a race:
2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-
2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-
2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.
2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.
2016-02-05 07:48:04.413 13600 DEBUG nova.scheduler.
2016-02-05 07:48:04.413 13600 DEBUG nova.filters [req-c24338b5-
Nova API calls show the correct number of members.
I suspect that it is caused by the simultaneous processing or out-of-order receipt of update_aggregates RPC calls.
tags: | added: race-condition scheduler |
tags: | removed: race-condition |
Changed in nova: | |
importance: | Undecided → Medium |
importance: | Medium → Undecided |
Changed in ubuntu: | |
status: | New → Invalid |
Changed in nova: | |
status: | New → Incomplete |
Changed in nova: | |
status: | Incomplete → Confirmed |
Changed in nova: | |
assignee: | nobody → jingtao (liang888) |
Changed in nova: | |
status: | Confirmed → Opinion |
Could you please tell us which Nova version is corresponding to the Ubuntu package 1:2015. 1.2-0ubuntu2~ cloud0 ?
Also, could you please tell us if another request coming in would get the accurate number of hosts within the Aggregate ? In general, you don't need to restart the scheduler service, because updates are RPC'd (fanout) to the scheduler which should get the update anyway.