Bug #1341420 “gap between scheduler selection and claim causes s...” : Bugs : OpenStack Compute (nova)

Robert Collins (lifeless) on 2014-07-14

tags:

added: ironic scheduler

Robert Collins (lifeless) on 2014-07-14

Changed in nova:
status:	New → Triaged
importance:	Undecided → High

Steve Kowalik (stevenk) on 2014-07-14

description:

updated

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#1

Ok, I've now confirmed that the weighed list order is the same each time, which means the hypothesis isn't disproved.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#2

Download full text (9.3 KiB)

Here is a snippet showing the scheduler thinking that cc0a5455-0978-4687-988e-8f3161cbc521 has no instances on it [host: (undercloud, cc0a5455-0978-4687-988e-8f3161cbc521) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], returning it in a list, and then an instance rescheduling because it actually has no resources:

scheduler.log:
2014-07-14 07:59:34.478 3915 DEBUG nova.scheduler.filter_scheduler [req-d701a25b-7bbc-44ed-9b31-57245e291ae1 None] Weighed [WeighedHost [host: (undercloud, aa4c5e32-46a5-46f2-b312-262a2df23980) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c63bc9b2-5340-4d46-837a-a4f9a77a00d3) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 1b44fa97-a41f-4aa1-a578-08e82b6ed624) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a5104c42-37e1-48ef-a5b6-0ba2d8808d8e) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 75e03645-caa6-4d16-9f9d-eb66bf0df769) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, cc0a5455-0978-4687-988e-8f3161cbc521) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 47bbaf38-9e0d-43a0-932b-eb0c9ec7af26) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 9686995e-3ff0-4a10-8c3c-bf3e09ffa770) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 578cf11e-3d79-45ae-883e-583a83f46e23) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a3f05b92-e693-4338-843b-62c8ccaad397) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 69dc8c40-dd79-4ed6-83a9-374dcb18c39b) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a76e82f6-b454-4b24-accb-b260fa361f27) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 0270cb74-4784-46ef-8521-38697daafb1d) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, d69b7468-9b3a-44b1-b7b0-4fcf7e3452d5) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, dc86f4ac-b416-410b-ab03-7912f1b9a1b8) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a9d71ef3-3a7f-4932-a952-534c1128a1e6) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, b07295ee-1c09-484c-9447-10b9efee340c) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c37db96f-e689-4789-929c-4d45067ba654) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 101a1fba-acd7-4c9e-9e3c-8436c35e4cef) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 7097eae3-c049-4e5a-ad90-dcb54d82d575) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 17fdb7b4-a7cf-4f67-a736-cb84ec3d49e9) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (unde...

Here is a snippet showing the scheduler thinking that cc0a5455-0978-4687-988e-8f3161cbc521 has no instances on it [host: (undercloud, cc0a5455-0978-4687-988e-8f3161cbc521) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], returning it in a list, and then an instance rescheduling because it actually has no resources:

scheduler.log:
2014-07-14 07:59:34.478 3915 DEBUG nova.scheduler.filter_scheduler [req-d701a25b-7bbc-44ed-9b31-57245e291ae1 None] Weighed [WeighedHost [host: (undercloud, aa4c5e32-46a5-46f2-b312-262a2df23980) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c63bc9b2-5340-4d46-837a-a4f9a77a00d3) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 1b44fa97-a41f-4aa1-a578-08e82b6ed624) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a5104c42-37e1-48ef-a5b6-0ba2d8808d8e) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 75e03645-caa6-4d16-9f9d-eb66bf0df769) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, cc0a5455-0978-4687-988e-8f3161cbc521) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 47bbaf38-9e0d-43a0-932b-eb0c9ec7af26) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 9686995e-3ff0-4a10-8c3c-bf3e09ffa770) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 578cf11e-3d79-45ae-883e-583a83f46e23) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a3f05b92-e693-4338-843b-62c8ccaad397) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 69dc8c40-dd79-4ed6-83a9-374dcb18c39b) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a76e82f6-b454-4b24-accb-b260fa361f27) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 0270cb74-4784-46ef-8521-38697daafb1d) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, d69b7468-9b3a-44b1-b7b0-4fcf7e3452d5) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, dc86f4ac-b416-410b-ab03-7912f1b9a1b8) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a9d71ef3-3a7f-4932-a952-534c1128a1e6) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, b07295ee-1c09-484c-9447-10b9efee340c) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c37db96f-e689-4789-929c-4d45067ba654) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 101a1fba-acd7-4c9e-9e3c-8436c35e4cef) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 7097eae3-c049-4e5a-ad90-dcb54d82d575) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 17fdb7b4-a7cf-4f67-a736-cb84ec3d49e9) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, aede7973-6209-4144-a243-308a03d07e0b) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 3a2ab650-6daf-4a79-9eb4-2ea3f8af8faf) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, dd6eaa66-5289-4689-9034-0d7f706add47) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 381fd1a4-fb1b-411c-a0c8-538781afb48e) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c4b0f4cc-ca2c-4225-9e84-87c1f4b08ebd) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 38595174-e830-46ce-855e-f54d080b5971) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 7a3ccba1-6db6-4b57-a11f-524ac23bf672) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 34ece3c2-7a0d-42fe-acc3-9e429f587c60) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 9c4dfded-92ef-45e7-9db4-e81ae0d9a249) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 16c374af-2919-4d7d-a281-a69d461adc59) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 69f95874-ddf9-4ede-a6ff-6178d4dcd67c) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 7196471a-12af-4430-89cb-01321f95c942) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 719312a2-f7f7-470b-8cc4-da399af66237) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, ed8bd319-d238-4679-b2c3-0f93d7604bbd) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 8ba5f5ff-3c15-4cb5-b0af-e3ac0f593b95) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 30837020-4728-459e-a5bd-dd19827ac1b5) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 53a0adf4-802c-4e52-8af4-2205e5d8343e) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 75fad715-3029-43b8-a47c-891b680fe94e) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0]] _schedule /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:274
2014-07-14 07:59:37.940 3915 DEBUG nova.filters [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] Starting with 46 host(s) get_filtered_objects /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/filters.py:70
2014-07-14 07:59:37.941 3915 DEBUG nova.scheduler.filters.retry_filter [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] Host [u'undercloud', u'cc0a5455-0978-4687-988e-8f3161cbc521'] fails.  Previously tried hosts: [[u'undercloud', u'cc0a5455-0978-4687-988e-8f3161cbc521']] host_passes /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/scheduler/filters/retry_filter.py:42

nova-compute.log:
2014-07-14 07:59:37.845 20886 DEBUG nova.openstack.common.lockutils [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] Got semaphore / lock "instance_claim" inner /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:324
2014-07-14 07:59:37.845 20886 DEBUG nova.compute.resource_tracker [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] Memory overhead for 98304 MB instance; 0 MB instance_claim /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:117
2014-07-14 07:59:37.847 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Attempting claim: memory 98304 MB, disk 1600 GB, VCPUs 24
2014-07-14 07:59:37.847 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Total memory: 98304 MB, used: 98304.00 MB
2014-07-14 07:59:37.847 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] memory limit: 98304.00 MB, free: 0.00 MB
2014-07-14 07:59:37.847 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Total disk: 1600 GB, used: 1600.00 GB
2014-07-14 07:59:37.847 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] disk limit not specified, defaulting to unlimited
2014-07-14 07:59:37.847 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Total CPUs: 24 VCPUs, used: 24.00 VCPUs
2014-07-14 07:59:37.848 20886 AUDIT nova.compute.claims [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] CPUs limit not specified, defaulting to unlimited
2014-07-14 07:59:37.848 20886 DEBUG nova.openstack.common.lockutils [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] Semaphore / lock released "instance_claim" inner /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:328
2014-07-14 07:59:37.848 20886 DEBUG nova.compute.manager [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Insufficient compute resources: Free memory 0.00 MB < requested 98304 MB. _build_and_run_instance /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py:1981
2014-07-14 07:59:37.849 20886 DEBUG nova.compute.utils [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Insufficient compute resources: Free memory 0.00 MB < requested 98304 MB. notify_about_instance_usage /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/utils.py:291
2014-07-14 07:59:37.850 20886 DEBUG nova.compute.manager [req-0015ebce-2a43-45d6-9dda-e7c0d1b61df9 None] [instance: 564bbc63-8bf2-4fa0-8353-ba246bd2ba15] Build of instance 564bbc63-8bf2-4fa0-8353-ba246bd2ba15 was re-scheduled: Insufficient compute resources: Free memory 0.00 MB < requested 98304 MB. do_build_and_run_instance /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py:1896

Revision history for this message

Alexis Lee (alexisl) wrote on 2014-07-14:

#3

It's possible to request multiple instances at once. They all have to be the same flavor IIRC. If you do this, the scheduler will ensure the whole request can be met. However if you make multiple requests then as you've seen, it kind of assumes all prior requests failed unless the DB has been updated to say they succeeded.

Two obvious fixes occur to me. One, sync the scheduler with the compute hosts. Generally unacceptable but might be OK in an early TripleO context. Two, make the scheduler more optimistic. Either way is going to be tricky, work on converting to use Nova objects and splitting out Gantt has basically frozen all other work on the scheduler.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#4

Right, N at once works because the list of hosts is returned and zipped across the count of instances requested, but we can't make heat do that (yet, if ever) :(.

Not sure what you mean by syncing the scheduler with the compute hosts - do you mean moving the claim code into the scheduler, or something else?

In terms of optimism, that seems like a variation on the "
- have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well" suggestion I made in the bug ?

Revision history for this message

Alexis Lee (alexisl) wrote on 2014-07-14:

#5

By syncing I mean synchronizing, IE make the scheduler wait for the instance to succeed or fail creation before scheduling another instance. There'd have to be a timeout too because messages can go missing (unsure how theoretical that is). Not pretty, could slow down stack standup significantly.

You're right, similar suggestions. Rather than dest. selection counts it should track resources consumed by outstanding requests. Worst case resources are double-counted for a very short period if the DB gets updated before the claim is released. I've asked my colleague Paul Murray to provide his input, he knows this issue intimately.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#6

Cool, thank you! having read more of the code I see that the db scheduler is reading the entire hosts lists on every schedule request, which seems bonkers, but thats *clearly* a different discussion. However, it does simplify the choices we have, since we don't need to consider skewed views of the data, we just need to ensure that the resources are claimed immediately.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-14: Related fix proposed to nova (master)

#7

Related fix proposed to branch: master
Review: https://review.openstack.org/106716

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#8

This is horrid, but let me deploy the 29 node scenario perfectly first-time. https://review.openstack.org/106716

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#9

It seems quite spiky as to triggering - I just had one test pass (29 deploys on a pool of 45, all worked), then the next run failed with 14 ERRORs due to this :/.

Revision history for this message

Phil Day (philip-day) wrote on 2014-07-14:

#10

I believe its an inherent part of the Nova design that compute nodes have to be able to reject and send requests back to the Scheduler (back to conductor in the new way of doing things) if they get a request for resources that they no longer have.

For sure you don't want a lot of concurrency in the scheduler (try rpc_thread_pool_size=2 , but I think compute managers in an Ironic setup will need to be able to follow the same basic approach.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#11

Sure, but this is pathological, we're scheduling incredibly poorly if requests come in at a fast enough rate. Ironic doesn't change *anything* about the basic approach, and the problem isn't unique to Ironic - KVM with machine sized VMs will behave identically, for instance.

IMO we need to consider this bug a result of design decisions, and we can revisit those after learning about the issues they have....

The issue here is that there is a data structure maintained by the compute process (perhaps via conductor for actual DB writes) but consulted by the scheduler process(es).

If we take a leaf from the BASE design guidelines, we might have this look something like this as a small change to fix things, using a read-after-write pattern (which is in some ways heinous...)

add a scheduler grants table (timestamp, host, memory_gb, cpu_count, disk_gb)

scheduler receives a request:
- gets a baseline in-memory view using the current read-all-rows approach
- updates that by reading outstanding grants (or all and discard-in-memory)
- schedules
- writes a grant to the DB (using DB set timestamp)
- reads from the DB to see if there were other grants made for the same host(s), and if they would have invalidated the grant.
- if and only if they invalidate it and the timestamp on the invalidating grant was less than ours, fall over to the next viable host and remove the grant (we lost), otherwise we won the conflict.
- return results

compute process receives a instance to build (or migrate):
- takes its local claim-lock
- reads the grant from the scheduler table (select from ... where host=self.hostname and cpu_count=... order by timestamp asc limit 1)
- updates its host row in the DB with the now available resources
- also sets the last-grant timestamp with the timestamp from that grant

And we add a periodic job to the scheduler to delete rows from the grant table where the timestamp is less than the timestamp in the compute table (or there is no matching compute row).

This would have the following properties:
- min two more DB writes per scheduled instance (one to write the grant, one to delete it later)
- min three more DB reads per scheduled instance (compute host has to read the grant, and the scheduler has to read all the unhandled grants and subtract them before scheduling, and scheduler has to look for conflicting grants after scheduling before returning)
- up to viable-hosts extra writes and reads to handle falling over to the next viable host in the event of a conflict. Assume there are 10 schedulers and someone is flooding the system with identical requests. In each round one will one but it will rejoin with the next request and presumably come up with the same next host that all the others have fallen over to. In this scenario scheduling will end up single threaded (with 9 threads of spinlock failures at any point in time), but it won't deadlock or halt.

Sure, but this is pathological, we're scheduling incredibly poorly if requests come in at a fast enough rate. Ironic doesn't change *anything* about the basic approach, and the problem isn't unique to Ironic - KVM with machine sized VMs will behave identically, for instance.

IMO we need to consider this bug a result of design decisions, and we can revisit those after learning about the issues they have....

The issue here is that there is a data structure maintained by the compute process (perhaps via conductor for actual DB writes) but consulted by the scheduler process(es).

If we take a leaf from the BASE design guidelines, we might have this look something like this as a small change to fix things, using a read-after-write pattern (which is in some ways heinous...)

add a scheduler grants table (timestamp, host, memory_gb, cpu_count, disk_gb)

scheduler receives a request:
 - gets a baseline in-memory view using the current read-all-rows approach
 - updates that by reading outstanding grants (or all and discard-in-memory)
 - schedules
 - writes a grant to the DB (using DB set timestamp)
 - reads from the DB to see if there were other grants made for the same host(s), and if they would have invalidated the grant.
   - if and only if they invalidate it and the timestamp on the invalidating grant was less than ours, fall over to the next viable host and remove the grant (we lost), otherwise we won the conflict.
 - return results

compute process receives a instance to build (or migrate):
 - takes its local claim-lock
 - reads the grant from the scheduler table (select from ... where host=self.hostname and cpu_count=...  order by timestamp asc limit 1)
 - updates its host row in the DB with the now available resources
 - also sets the last-grant timestamp with the timestamp from that grant

And we add a periodic job to the scheduler to delete rows from the grant table where the timestamp is less than the timestamp in the compute table (or there is no matching compute row).

This would have the following properties:
 - min two more DB writes per scheduled instance (one to write the grant, one to delete it later)
 - min three more DB reads per scheduled instance (compute host has to read the grant, and the scheduler has to read all the unhandled grants and subtract them before scheduling, and scheduler has to look for conflicting grants after scheduling before returning)
 - up to viable-hosts extra writes and reads to handle falling over to the next viable host in the event of a conflict. Assume there are 10 schedulers and someone is flooding the system with identical requests. In each round one will one but it will rejoin with the next request and presumably come up with the same next host that all the others have fallen over to. In this scenario scheduling will end up single threaded (with 9 threads of spinlock failures at any point in time), but it won't deadlock or halt.

Dmitry Tantsur (divius) on 2014-07-14

Changed in ironic:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2014-07-18:

#12

I personally dislike any change in the scheduler where it would wait for the end of booting an instance.
That would generate a locking mechanism in the scheduler while we currently have a chance to get one lock-free.

Scheduler is doing decisions based on its internal representation. If the representation is not up-to-date, there are big chances that eventually the request will fail but in that case, that's the role of formerly the compute node (and now the conductor) to issue a retry. Yes, that's by design that the scheduler can generate race conditions, but that also allows us to have an external scheduler without being necessary to remove the bits of locking mechanism.

Long story short, we have to find out how to improve the retry mechanism in the conductor, not in the scheduler.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-24:

#13

@sylvain There are plenty of other proposals than 'interlock with instance booting' - so I disagree with you that improving the retry mechanism is the only option.

This is basically a consensus-view-of-resources problem, so any of the broad set of consensus answers will work.

Revision history for this message

Paul Murray (pmurray) wrote on 2014-07-25:

#14

@lifeless - Do any of those consensus answers scale well and handle failure gracefully? ;)

I think the issue here is that there are two sets of problems when you reach opposite ends of scale in both number of machines and what can fit on a single machine. At the "we have so many resources you can consider them infinite" end of the world an approximate view is the optimal one because is scales and there is a low likelyhood of being wrong. In the case presented in this bug report, where the problem is the system is approaching resource starvation, the optimal solution would be to have a globally consistent view of resources.

The current design is aimed at the first of these two.

Improving the retry mechanism does make sense, but I would make sure that we do not sacrafice scalability and graceful failure handling in the process.

So anything that can shorten the delay in getting up to date information to the scheduler is good. Anything that introduces any kind of synchronisation is bad - including additional lookups in the database.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2014-07-25:

#15

+1 with @PaulMurray, IMHO we have to work for making sure we can reduce the delay in between computes and scheduler, but still without locking mechanism, because the opposite would impact the scability.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-20: Change abandoned on nova (master)

#16

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/106716
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Sean Dague (sdague) on 2015-03-30

Changed in nova:
status:	Triaged → Confirmed

Revision history for this message

Robert Collins (lifeless) wrote on 2015-03-30:

#17

@PaulMurray, @Sylvain

This particular bug occurs with *one* scheduler instance, and 2 resources under subscription, and its a clear example of bad data hygiene - eventual consistency doesn't imply the poor behaviour we use.

Trivially, caching the schedule operations we've made locally and folding those in until the hypervisor would fix this bug with no additional synchronisation overhead.

I appreciate that everyone is concerned about the scheduler, but before saying we can't use a technique, we need to be clear about what our success criteria are. One of the big scheduling problems we have is that we have no specific success criteria, and its broken with bugs like this, so we get design pushback not on the basis of actually success or failure, but fear that it will be worse.

For instance, if we define success as 'be able to place up to 10000 uses/second on up to 1M resources and no more than 10 seconds downtime in the event of scheduler failure', then we have a metric we can examine to assess scheduling implementations. E.g. A single fast scheduler with HA via fail-over on a 5s heartbeat with 15s warmup time could meet this. As could a distributed scheduler with sharding, or possibly a consensus scheduler with sync-on-over-subscribe [e.g. only one retry ever triggered].

Revision history for this message

Michael Davies (mrda) wrote on 2015-06-02:

#18

This is not an Ironic bug. It's just triggered by using Ironic. This is a a Nova scheduling bug (and a PITA to fix :(

no longer affects:	ironic
tags:	removed: ironic

Revision history for this message

Nikola Đipanov (ndipanov) wrote on 2015-08-30:

#19

There was a spec proposed and accepted for Liberty that aims at fixing or at least mostly mittigating the described issues

http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/host-state-level-locking.html

However it did not make it for Liberty - I plan to re-propose for M, and it would be great to get feedback from Ironic folks of course!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-22: Fix proposed to nova (master)

#20

Fix proposed to branch: master
Review: https://review.openstack.org/226235

Changed in nova:
assignee:	nobody → Lucas Alvares Gomes (lucasagomes)
status:	Confirmed → In Progress

Lucas Alvares Gomes (lucasagomes) on 2015-09-22

Changed in nova:
assignee:	Lucas Alvares Gomes (lucasagomes) → nobody

OpenStack Infra (hudson-openstack) on 2015-11-26

Changed in nova:
assignee:	nobody → Lucas Alvares Gomes (lucasagomes)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-15: Fix merged to nova (master)

#21

Reviewed: https://review.openstack.org/226235
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=608af87541240793425760c16e528d9b3ec02fee
Submitter: Jenkins
Branch: master

commit 608af87541240793425760c16e528d9b3ec02fee
Author: Lucas Alvares Gomes <email address hidden>
Date: Mon Sep 21 16:36:04 2015 +0100

Ironic: Workaround to mitigate bug #1341420

    The bug #1341420 can cause the driver to try to deploy onto a node which
    already has an instance associated with it, the way Ironic reserve a
    node to a specific instance is associating the UUID of that instance
    with the node via the instance_uuid field. This patch just makes the
    request to associate that field non-retriable when the api returns HTTP
    409 (Conflict) (that's the HTTP error indicating that the node is
    already associated with another instance), so it can fail fast in
    case the node is already pick and the RetryFilter could then pick
    another node for that instance.

Partial-Bug: #1341420
Change-Id: I697c5129893b18562182bbb3ceb8cb160e84c312

Revision history for this message

Yingxin (cyx1231st) wrote on 2015-12-23:

#22

Did anyone notice the problem that `select_destination()` doesn't reuse host state cache to make decisions at all?

That is to say:
Scheduler will discard its host state and refresh it from DB each time at the beginning of `select_destination()`, instead of reusing the recent updated host state cache.

### explanation ###

Robert Collins pointed out in the bug description that scheduler works well in situation[1] when boot 45 instances in one command. But if he chose to boot these 45 instances in 45 concurrent commends in the second situation[2], there will be up to 50% failure, which is unacceptable.

The real difference between [1] and [2] is that:
- In the first situation[1], scheduler will reuse the host state cache in the `for` loop [3], so that the following 44 schedule decisions are made on the INCREMENTAL UPDATED host state cache. Thus the result turns out to be accurate.
- However in [2], each 45 requests will refresh the host state in `get_all_host_states()` by logic [4] in the beginning. Thus these 45 concurrent schedule decisions are all made based on the SAME db state. No wonder there are 50% failure caused by conflictions. Worse, it could be 97.77% failure if CONF.scheduler_host_subset_size is 1 and in the most idea condition.

Another thing to point out:
Currently, there is no `sleep(0)` or any asynchronous request(except for the experimental trusted filter) during filtering and weighing. So schedule operations including 'refresh host state from db data', 'filtering', 'weighing', 'consume host states', 'return decision' can be treat as an atomic operation as a whole. This strengthens my opinion that in situation[2], scheduler will use the 'almost' same host state cache in 45 concurrent requests. Thus definitely those decisions will conflict.

[1] 15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
[2] 15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
[3] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L117-L149
[4] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L160-L222

### solution ###

The reasonable solution is to refresh host state only if the fetched data is newer than the db data which host state is based on.

Current timestamp in compute_node table is not accurate enough because this record is only up to seconds. It should be at least in milliseconds or even in updated counts to determine whether a host state is outdated. This might deserve a bp to implement.

Did anyone notice the problem that `select_destination()` doesn't reuse host state cache to make decisions at all?

That is to say:
Scheduler will discard its host state and refresh it from DB each time at the beginning of `select_destination()`, instead of reusing the recent updated host state cache.

### explanation ###

Robert Collins pointed out in the bug description that scheduler works well in situation[1] when boot 45 instances in one command. But if he chose to boot these 45 instances in 45 concurrent commends in the second situation[2], there will be up to 50% failure, which is unacceptable.

The real difference between [1] and [2] is that:
- In the first situation[1], scheduler will reuse the host state cache in the `for` loop [3], so that the following 44 schedule decisions are made on the INCREMENTAL UPDATED host state cache. Thus the result turns out to be accurate.
- However in [2], each 45 requests will refresh the host state in `get_all_host_states()` by logic [4] in the beginning. Thus these 45 concurrent schedule decisions are all made based on the SAME db state. No wonder there are 50% failure caused by conflictions. Worse, it could be 97.77% failure if CONF.scheduler_host_subset_size is 1 and in the most idea condition.

Another thing to point out: 
Currently, there is no `sleep(0)` or any asynchronous request(except for the experimental trusted filter) during filtering and weighing. So schedule operations including 'refresh host state from db data', 'filtering', 'weighing', 'consume host states', 'return decision' can be treat as an atomic operation as a whole. This strengthens my opinion that in situation[2], scheduler will use the 'almost' same host state cache in 45 concurrent requests. Thus definitely those decisions will conflict.

[1] 15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
[2] 15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
[3] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L117-L149
[4] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L160-L222

### solution ###

The reasonable solution is to refresh host state only if the fetched data is newer than the db data which host state is based on.

Current timestamp in compute_node table is not accurate enough because this record is only up to seconds. It should be at least in milliseconds or even in updated counts to determine whether a host state is outdated. This might deserve a bp to implement.

Revision history for this message

Yingxin (cyx1231st) wrote on 2015-12-23:

#23

Hmm... rephrased because I lost the code snippet[1].

But the same problem still exists because of the line [2], why host state should be updated when it is already updated to the db state with the same timestamp? This leads to the problem that the concurrent schedule request can not benefit from host state cache, so there are conflicts.

[1] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L162-L164
[2] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L163

Revision history for this message

Yingxin (cyx1231st) wrote on 2015-12-26:

#24

Download full text (5.1 KiB)

I've reproduced this bug with *one* scheduler instance based on the latest nova code.

----- reproduce -----

My openstack environment:
There are 3 compute nodes(node2, node3, node4), each of them has limited 1489MB free RAM and plenty of other resources to boot instances. The scheduler is configured "ram_allocation_ration = 1.0" to prevent memory overcommit. And there is only one flavor that requests 400MB RAM, so that at most 3 instances are allowed to boot onto each compute node. 9 instances at most in this 3-node environment.

Then I requested to boot 9 instances named p1, p2, p3, p4, p5, p6, p7, p8, p9 using 9 separate Restful API requests. I sent those requests in 3 groups, each group will send 3 requests in parallel, and those groups are sent in every 0.5 second. It takes a second to finish all those requests.

Those 9 requests are expected to be successful, but here is the full log [2] showing that it is actually not.
I also made some efforts to make those logs look more friendly [3].
As the log said, the request to schedule p8 is failed with 2 retries. Although it finally succeeded to select node3, it is likely to fail if there are more nodes. Imagine there could be up to 1000 nodes with much busier and complex requests.

I modified scheduler logging a little to better understand what is happening inside scheduler: Add log to indicate the beginning and the end of `select_destination()` with the instance name from request_spec; Add log to indicate the start and end time around `objects.ComputeNodeList.get_all()` in `get_all_host_states()`; Most importantly, I added logs to the code[1] to show whether a host state is successfully updated from compute node db object.

----- log analysis -----

Things go well during #L1-L94, until the request 8582 schedules the instance p8 to node4. Note that node4 already has 3 instances at that time. The request will certainly fail because node4 cannot claim another 400MB memory. The request 8582 still makes this wrong decision because the cached host state told that node4 can fit another instance with 689MB memory(#L92). It looks all host state's fault!

So back again to check how host state of node4 went wrong... and here it is. After scheduling two instances to node4, #L66 shows that the host state was updated by database that told scheduler node4 has 1089MB RAM. That is to say, the db contained outdated message but scheduler still accepted it. But according to logic[1], the cache is assumed outdated because the update time of db is newer, why still wrong?

Because there is a time GAP between scheduler makes a decision and the compute node persistents this decision. For example in the log #L66 [2], the first request 132f was persisted to db at 17:19:45, but the second request 8de7 still hadn't been persisted to db by the time 17:19:45.242. However, the node4 state cache was updated by the third request 8582 at 17:19:44.935. So that host state is refreshed by db whose updated time is newer, but it caused the lost of consumption from request 8de7 and bad things happened!

The same bad happened again at the first retry of request 8582(the 10th column of [3...

I've reproduced this bug with *one* scheduler instance based on the latest nova code.

----- reproduce -----

My openstack environment:
There are 3 compute nodes(node2, node3, node4), each of them has limited 1489MB free RAM and plenty of other resources to boot instances. The scheduler is configured "ram_allocation_ration = 1.0" to prevent memory overcommit. And there is only one flavor that requests 400MB RAM, so that at most 3 instances are allowed to boot onto each compute node. 9 instances at most in this 3-node environment.

Then I requested to boot 9 instances named p1, p2, p3, p4, p5, p6, p7, p8, p9 using 9 separate Restful API requests. I sent those requests in 3 groups, each group will send 3 requests in parallel, and those groups are sent in every 0.5 second. It takes a second to finish all those requests.

Those 9 requests are expected to be successful, but here is the full log [2] showing that it is actually not.
I also made some efforts to make those logs look more friendly [3].
As the log said, the request to schedule p8 is failed with 2 retries. Although it finally succeeded to select node3, it is likely to fail if there are more nodes. Imagine there could be up to 1000 nodes with much busier and complex requests.

I modified scheduler logging a little to better understand what is happening inside scheduler: Add log to indicate the beginning and the end of `select_destination()` with the instance name from request_spec; Add log to indicate the start and end time around `objects.ComputeNodeList.get_all()` in `get_all_host_states()`; Most importantly, I added logs to the code[1] to show whether a host state is successfully updated from compute node db object.

----- log analysis -----

Things go well during #L1-L94, until the request 8582 schedules the instance p8 to node4. Note that node4 already has 3 instances at that time. The request will certainly fail because node4 cannot claim another 400MB memory. The request 8582 still makes this wrong decision because the cached host state told that node4 can fit another instance with 689MB memory(#L92). It looks all host state's fault!

So back again to check how host state of node4 went wrong... and here it is. After scheduling two instances to node4, #L66 shows that the host state was updated by database that told scheduler node4 has 1089MB RAM. That is to say, the db contained outdated message but scheduler still accepted it. But according to logic[1], the cache is assumed outdated because the update time of db is newer, why still wrong?

Because there is a time GAP between scheduler makes a decision and the compute node persistents this decision. For example in the log #L66 [2], the first request 132f was persisted to db at 17:19:45, but the second request 8de7 still hadn't been persisted to db by the time 17:19:45.242. However, the node4 state cache was updated by the third request 8582 at 17:19:44.935. So that host state is refreshed by db whose updated time is newer, but it caused the lost of consumption from request 8de7 and bad things happened!

The same bad happened again at the first retry of request 8582(the 10th column of [3]) to choose the wrong node2, because node2 state cache was updated by outdated data from sluggish db at #L113 [2].

----- possible solution -----

Synchronize host state cache with database is very important in making right decisions and supporting multiple scheduler instances. However, there is always a time gap between the scheduler making a consumption and this consumption is persisted to database. The scheduler may lost some recent consumptions after being refreshed from database. What if the scheduler can *remember* those recent consumptions as resource claims and reapply them to the updated host state cache? It can help fill the gap.

----- the extra -----

Current update model of host state cache has some issues. Firstly, a host state cache will be updated even if it is already the same with the latest db data according to code [1]. For example in previous log [2], #L16, L17, L29 are unnecessary updates regardless that they do no harm to making the right decision. Secondly, assume a host state cache is created based on db data1 at time t1, and later, there comes db data2 updated at t2(t2 > t1), but host state has already been consumed at t3(t3 > t2). Althouth data2 is newer than data1, scheduler will still discard the new information and use the older one, because `HostState.updated > compute.updated_at` according to code [1].

Another thing is, I see no race conditions between greenthreads after doing many experiments of sending requests in parallel. For example in log[4] I've triggered 6 successful parallel requests. The results are always correct if the database does not make troubles. I think it is because that there is no `sleep(0)` or any asynchronous request(except for the experimental trusted filter) during filtering, weighing and consuming. So these operations can be treat as atomic as a whole.

[1] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L162-L164
[2] http://paste.openstack.org/show/482724/
[3] http://paste.openstack.org/show/482725/
[4] http://paste.openstack.org/show/482728/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-28: Fix proposed to nova (master)

#25

Fix proposed to branch: master
Review: https://review.openstack.org/261920

Changed in nova:
assignee:	Lucas Alvares Gomes (lucasagomes) → Yingxin (cyx1231st)

Revision history for this message

John Garbutt (johngarbutt) wrote on 2016-01-04:

#26

Given the discussion around this in the scheduler meeting, moving this to a whishlist item.

Much of the behaviour here is as designed, and the retry loop should minimise the affects seen by API users. The caching_scheduler also helps for cases where you only run one scheduler, reducing the races even further, in most cases. The randomisation weigher also helps with that, somewhat.

The issue tends to be very deployment specific, and certainly the current scheduler is optimised for when the cloud generally has some free space, and gets expanded before it becomes full. Which was considered the typical cloud use case, at the time the scheduler was written.

Changed in nova:
importance:	High → Wishlist

Revision history for this message

Mark Goddard (mgoddard) wrote on 2016-01-26:

#27

We have a 15 node test cluster in our lab, which is managed by Nova and Ironic. We have run into this issue when attempting to boot all 15 nodes, and in general a number of nodes will fail to boot. This can be seen when using heat or nova directly.

We have developed a couple of changes that drastically improve the behaviour.

The first change, which I believe fixes a bug, is to ensure appropriate resource limits are added to the Exact*Filter filters, in the same way as is done for the other filters. For instance, the DiskFilter writes the calculated disk_gb resource limit to a HostState object in host_passes, if it passes through the filter (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/disk_filter.py#L59). Conversely, the ExactDiskFilter does not set the disk_db limit on the host state (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/exact_disk_filter.py#L23). I think that this limit should be updated in the exact filters also. This limit, if present, is later checked in the compute service during the claim. This allows the claim to verify that the requested resources are available, with the synchronisation provided by doing the check in compute. What this achieves, is to force invalid claims to fail in the compute service, rather than succeeding and causing strange problems with multiple instances trying to provision a single Ironic node.

The second change we have developed is to recognise that with when scheduling onto homogeneous Ironic nodes, we will typically have a large number of compute hosts pass through the scheduler filters, all with equal weight. In the default configuration, with scheduler_host_subset_size = 1, concurrent requests will all schedule onto the same node, the first in the list. As discussed earlier, increasing this number reduces the chances of a collision, but how big should it be? If we go too high, then we undermine the weighting system, allowing lower weight hosts a chance to be scheduled when they would not otherwise be. Our solution to this is to extend the host subset size to include all top weight hosts. Therefore, we can set the -scheduler_host_subset_size to a sensible number, but allow scheduling onto all of the 'equally best' nodes. Although the race still exists, the chance of hitting it is reduced.

Any thoughts on these approaches?

We have a 15 node test cluster in our lab, which is managed by Nova and Ironic. We have run into this issue when attempting to boot all 15 nodes, and in general a number of nodes will fail to boot. This can be seen when using heat or nova directly.

We have developed a couple of changes that drastically improve the behaviour.

The first change, which I believe fixes a bug, is to ensure appropriate resource limits are added to the Exact*Filter filters, in the same way as is done for the other filters. For instance, the DiskFilter writes the calculated disk_gb resource limit to a HostState object in host_passes, if it passes through the filter (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/disk_filter.py#L59). Conversely, the ExactDiskFilter does not set the disk_db limit on the host state (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/exact_disk_filter.py#L23). I think that this limit should be updated in the exact filters also. This limit, if present, is later checked in the compute service during the claim. This allows the claim to verify that the requested resources are available, with the synchronisation provided by doing the check in compute. What this achieves, is to force invalid claims to fail in the compute service, rather than succeeding and causing strange problems with multiple instances trying to provision a single Ironic node.

The second change we have developed is to recognise that with when scheduling onto homogeneous Ironic nodes, we will typically have a large number of compute hosts pass through the scheduler filters, all with equal weight. In the default configuration, with scheduler_host_subset_size = 1, concurrent requests will all schedule onto the same node, the first in the list. As discussed earlier, increasing this number reduces the chances of a collision, but how big should it be? If we go too high, then we undermine the weighting system, allowing lower weight hosts a chance to be scheduled when they would not otherwise be. Our solution to this is to extend the host subset size to include all top weight hosts. Therefore, we can set the -scheduler_host_subset_size to a sensible number, but allow scheduling onto all  of the 'equally best' nodes. Although the race still exists, the chance of hitting it is reduced.

Any thoughts on these approaches?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-27:

#28

Fix proposed to branch: master
Review: https://review.openstack.org/273154

Changed in nova:
assignee:	Yingxin (cyx1231st) → Mark Goddard (mgoddard)

Revision history for this message

Mark Goddard (mgoddard) wrote on 2016-01-27:

#29

Script to test rescheduling Edit (1.5 KiB, text/x-sh)

I ran the attached script on a devstack instance setup with 3 Ironic nodes. I used the fake Ironic driver because it should not affect Nova scheduling behaviour. First, I ran the script on unmodified master. Second, I modified the Exact* filters to set the resource limits on the HostState objects.

Unmodified:
Total runs 100 scheds 300 rescheds 203 errors 544

With limits applied to Exact* filters:
Total runs 100 scheds 300 rescheds 223 errors 0

Interestingly, although this change removed all errors, the number of reschedules was slightly increased. This is not hugely surprising, as the race condition will be caught sooner, increasing the likelihood that it could happen again.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2016-01-28:

#30

I just ran the script again, this time with the second change described above, in which the host subset is extended to include all top weight hosts. The fix for the Exact* filter was also included.

With all top weight hosts change:
Total runs 100 scheds 300 rescheds 67 errors 0 fails 0

This result shows a dramatic reduction in the number of reschedule events (22%) vs the unmodified code (68%).

OpenStack Infra (hudson-openstack) on 2016-01-29

Changed in nova:
assignee:	Mark Goddard (mgoddard) → Yingxin (cyx1231st)

OpenStack Infra (hudson-openstack) on 2016-02-01

Changed in nova:
assignee:	Yingxin (cyx1231st) → Mark Goddard (mgoddard)

Mark Goddard (mgoddard) on 2016-02-01

Changed in nova:
assignee:	Mark Goddard (mgoddard) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-02: Fix merged to nova (master)

#31

Reviewed: https://review.openstack.org/273154
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3471cc8b74444d04348d8e6dbc9c980641d4b0bc
Submitter: Jenkins
Branch: master

commit 3471cc8b74444d04348d8e6dbc9c980641d4b0bc
Author: Mark Goddard <email address hidden>
Date: Wed Jan 27 03:49:07 2016 +0000

Apply scheduler limits to Exact* filters

    The DiskFilter, RamFilter and CoreFilter scheduler filters set the
    resource limit on the HostState object during filtering. This is
    later checked in compute during the claim, and used to enforce
    the allocation ratios in a safe manner. The Exact* filters do not
    set the resource limits, so scheduler race conditions can result
    in multiple bare metal instances claiming a single host.

    This change sets the resource limits on the HostState object in
    the ExactCoreFilter, ExactDiskFilter and ExactRamFilter, ensuring
    that only a single baremetal instance can claim a compute host.

    Change-Id: I31d0331afc4698046a4568935a95f70f30e335dd
    Partial-Bug: #1341420
    Co-Authored-By: Will Miller <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix proposed to nova (stable/liberty)

#32

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/276191

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-01: Fix merged to nova (stable/liberty)

#33

Reviewed: https://review.openstack.org/276191
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=19e7abad96c5c8d19a442a41f799dabe9d600db8
Submitter: Jenkins
Branch: stable/liberty

commit 19e7abad96c5c8d19a442a41f799dabe9d600db8
Author: Mark Goddard <email address hidden>
Date: Wed Jan 27 03:49:07 2016 +0000

Apply scheduler limits to Exact* filters

    The DiskFilter, RamFilter and CoreFilter scheduler filters set the
    resource limit on the HostState object during filtering. This is
    later checked in compute during the claim, and used to enforce
    the allocation ratios in a safe manner. The Exact* filters do not
    set the resource limits, so scheduler race conditions can result
    in multiple bare metal instances claiming a single host.

    This change sets the resource limits on the HostState object in
    the ExactCoreFilter, ExactDiskFilter and ExactRamFilter, ensuring
    that only a single baremetal instance can claim a compute host.

     Conflicts:
     nova/tests/unit/scheduler/filters/test_exact_core_filter.py
     nova/tests/unit/scheduler/filters/test_exact_disk_filter.py
     nova/tests/unit/scheduler/filters/test_exact_ram_filter.py

    Change-Id: I31d0331afc4698046a4568935a95f70f30e335dd
    Partial-Bug: #1341420
    Co-Authored-By: Will Miller <email address hidden>
    (cherry picked from commit 3471cc8b74444d04348d8e6dbc9c980641d4b0bc)

tags:

added: in-stable-liberty

OpenStack Infra (hudson-openstack) on 2016-03-07

Changed in nova:
assignee:	nobody → Yingxin (cyx1231st)

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2016-04-18:

#34

That bug report is still accurate, but given the fact that we're working towards having new resource providers for the scheduler and discussing actively on the implementation details, I'd by far prefer to close that bug and leave a spec better describing the problem.

Changed in nova:
status:	In Progress → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-18: Fix proposed to nova (stable/liberty)

#35

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/307219

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-07: Change abandoned on nova (stable/liberty)

#36

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/307219

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-09: Change abandoned on nova (master)

#37

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/261920
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

Vasyl Saienko (vsaienko) wrote on 2017-02-03:

#38

The issues was reproduced at ironic-multinode job http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs

several instances were scheduled to the same ironic node.

http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs/screen-n-sch.txt.gz#_2017-02-03_18_46_35_112

2017-02-03 18:46:35.112 21443 DEBUG nova.scheduler.filter_scheduler [req-f8358730-fc5a-4f56-959c-af8742c8c3c2 tempest-ServersTestJSON-264305795 tempest-ServersTestJSON-264305795] Selected host: WeighedHost [host: (ubuntu-xenial-2-node-osic-cloud1-s3500-7106433, e85b2653-5563-4742-b23f-7ceb15b2032f) ram: 384MB disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:127

http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs/screen-n-sch.txt.gz#_2017-02-03_18_46_35_204

2017-02-03 18:46:35.204 21443 DEBUG nova.scheduler.filter_scheduler [req-3ff28238-f02b-4914-9ee1-dc07a15f6c9d tempest-ServerActionsTestJSON-1934209030 tempest-ServerActionsTestJSON-1934209030] Selected host: WeighedHost [host: (ubuntu-xenial-2-node-osic-cloud1-s3500-7106433, e85b2653-5563-4742-b23f-7ceb15b2032f) ram: 384MB disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:127
2017-02-03 18:46:35.204 21443 DEBUG oslo_concurrency.lockutils [req-3ff28238-f02b-4914-9ee1-dc07a15f6c9d tempest-ServerActionsTestJSON-1934209030 tempest-ServerActionsTestJSON-1934209030] Lock "(u'ubuntu-xenial-2-node-osic-cloud1-s3500-7106433', u'e85b2653-5563-4742-b23f-7ceb15b2032f')" acquired by "nova.scheduler.host_manager._locked" :: waited 0.000s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270

Changed in nova:
status:	Invalid → New

Revision history for this message

Dan Smith (danms) wrote on 2017-02-03:

#39

What you describe is fundamental to how nova works right now. We speculate in the scheduler, and if we race between two, we handle it with a reschedule. Nova specifically states that scheduling every last resource is out of scope. When trying to do that (which is often the use case for ironic) you're likely to hit this race as you run out of capacity:

https://github.com/openstack/nova/blob/master/doc/source/project_scope.rst#iaas-not-batch-processing

In the next few cycles we plan to move the claim process to the placement engine, which will eliminate most of these race-to-claim type issues, and in that situation things will be better for this kind of arrangement.

Until that point, this is not a bug though, because it's specifically how nova is designed to work.

Changed in nova:
status:	New → Invalid

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-09-22:

#40

This is a really old bug and I don't think it applies to tripleo anymore (if ever). Setting to invalid for tripleo.

Changed in tripleo:
status:	New → Invalid

OpenStack Compute (nova)

gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Wishlist	Yingxin
	tripleo	Invalid	Undecided	Unassigned