gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

Bug #1341420 reported by Robert Collins on 2014-07-14
48
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Wishlist
Yingxin
tripleo
Undecided
Unassigned

Bug Description

There is a race between the scheduler in select_destinations, which selects a set of hosts, and the nova compute manager, which claims resources on those hosts when building the instance. The race is particularly noticable with Ironic, where every request will consume a full host, but can turn up on libvirt etc too. Multiple schedulers will likely exacerbate this too unless they are in a version of python with randomised dictionary ordering, in which case they will make it better :).

I've put https://review.openstack.org/106677 up to remove a comment which comes from before we introduced this race.

One mitigating aspect to the race in the filter scheduler _schedule method attempts to randomly select hosts to avoid returning the same host in repeated requests, but the default minimum set it selects from is size 1 - so when heat requests a single instance, the same candidate is chosen every time. Setting that number higher can avoid all concurrent requests hitting the same host, but it will still be a race, and still likely to fail fairly hard at near-capacity situations (e.g. deploying all machines in a cluster with Ironic and Heat).

Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or 10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity on each hypervisor. Then deploy a bunch of VMs one at a time but very close together - e.g. use the python API to get cached keystone credentials, and boot 5 in a loop.

If using Ironic you will want https://review.openstack.org/106676 to let you see which host is being returned from the selection.

Possible fixes:
 - have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well
 - reinstate actioning claims into the scheduler, allowing the audit to correct any claimed-but-not-started resource counts asynchronously
 - special case the retry behaviour if there are lots of resources available elsewhere in the cluster.

Stats wise, I just testing a 29 instance deployment with ironic and a heat stack, with 45 machines to deploy onto (so 45 hosts in the scheduler set) and 4 failed with this race - which means they recheduled and failed 3 times each - or 12 cases of scheduler racing *at minimum*.

background chat

15:43 < lifeless> mikal: around? I need to sanity check something
15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
15:45 < mikal> lifeless: ok
15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
15:46 < lifeless> mikal: if you have ~5m
15:46 < mikal> Sure
15:46 < lifeless> so, symptoms
15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
15:47 < lifeless> this is with Ironic
15:47 < mikal> Sure
15:47 < lifeless> the failure on all the instances is the retry-three-times failure-of-death
15:47 < lifeless> what I believe is happening is this
15:48 < lifeless> the scheduler is allocating the same weighed list of hosts for requests that happen close enough together
15:49 < lifeless> and I believe its able to do that because the target hosts (from select_destinations) need to actually hit the compute node manager and have
15:49 < lifeless> with rt.instance_claim(context, instance, limits):
15:49 < lifeless> happen in _build_and_run_instance
15:49 < lifeless> before the resource usage is assigned
15:49 < mikal> Is heat making 45 separate requests to the nova API?
15:49 < lifeless> eys
15:49 < lifeless> yes
15:49 < lifeless> thats the key difference
15:50 < lifeless> same flavour, same image
15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: Remove cell api overrides for lock and unlock https://review.openstack.org/89487
15:50 < mikal> And you have enough quota for these instances, right?
15:50 < lifeless> yes
15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does seem worth filing a bug for
15:51 < lifeless> my theory is that there is enough time between select_destinations in the conductor, and _build_and_run_instance in compute for another request to come in the front door and be scheduled to the same host
15:51 < mikal> That seems possible to me
15:52 < lifeless> I have no idea right now about how to fix it (other than to have the resources provisionally allocated by the scheduler before it sends a reply), but I am guessing that might be contentious
15:52 < mikal> I can't instantly think of a fix though -- we've avoided queue like behaviour for scheduling
15:52 < mikal> How big is the clsuter compared with 45 instances?
15:52 < mikal> Is it approximately the same size as that?
15:52 < lifeless> (by provisionally allocated, I mean 'claim them and let the audit in 60 seconds fix it up if they are not actually used')
15:53 < lifeless> sorry, not sure what yoy mean by that last question
15:53 < mikal> So, if you have 45 ironic instances to schedule, and 45 identical machines to do it, then the probability of picking the same machine more than once to schedule on is very high
15:53 < mikal> Wehereas if you had 500 machines, it would be low
15:53 < lifeless> oh yes, all the hardware is homogeneous
15:54 < lifeless> we believe this is common in clouds :)
15:54 < mikal> And the cluster is sized at approximately 45 machines?
15:54 < lifeless> the cluster is 46 machines but one is down for maintenance
15:54 < lifeless> so 45 machines available to schedule onto.
15:54 < mikal> Its the size of the cluster compared to the size of the set of instances which I'm most interested in
15:54 < lifeless> However - and this is the interesting thing
15:54 < lifeless> I tried a heat stack of 20 machines.
15:54 < lifeless> same symptoms
15:54 < mikal> Yeah, that's like the worst possible case for this algorithm
15:54 < lifeless> about 30% failed due to scheduler retries.
15:54 < mikal> Hmmm
15:54 < mikal> That is unexpected to me
15:55 < lifeless> that is when I dived into the code.
15:55 < lifeless> the patch I pushed above will make it possible to see if my theory is correct
15:55 < mikal> you were going to file a bug, right?
15:56 < lifeless> I have the form open to file one with tasks on ironic and nova
15:56 < mikal> I vote you do that thing
15:56 < lifeless> seconded
15:56 < lifeless> I might copy this transcript in as well
15:57 < mikal> Works for me

tags: added: ironic scheduler
Changed in nova:
status: New → Triaged
importance: Undecided → High
Steve Kowalik (stevenk) on 2014-07-14
description: updated
Robert Collins (lifeless) wrote :

Ok, I've now confirmed that the weighed list order is the same each time, which means the hypothesis isn't disproved.

Robert Collins (lifeless) wrote :
Download full text (9.3 KiB)

Here is a snippet showing the scheduler thinking that cc0a5455-0978-4687-988e-8f3161cbc521 has no instances on it [host: (undercloud, cc0a5455-0978-4687-988e-8f3161cbc521) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], returning it in a list, and then an instance rescheduling because it actually has no resources:

scheduler.log:
2014-07-14 07:59:34.478 3915 DEBUG nova.scheduler.filter_scheduler [req-d701a25b-7bbc-44ed-9b31-57245e291ae1 None] Weighed [WeighedHost [host: (undercloud, aa4c5e32-46a5-46f2-b312-262a2df23980) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c63bc9b2-5340-4d46-837a-a4f9a77a00d3) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 1b44fa97-a41f-4aa1-a578-08e82b6ed624) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a5104c42-37e1-48ef-a5b6-0ba2d8808d8e) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 75e03645-caa6-4d16-9f9d-eb66bf0df769) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, cc0a5455-0978-4687-988e-8f3161cbc521) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 47bbaf38-9e0d-43a0-932b-eb0c9ec7af26) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 9686995e-3ff0-4a10-8c3c-bf3e09ffa770) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 578cf11e-3d79-45ae-883e-583a83f46e23) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a3f05b92-e693-4338-843b-62c8ccaad397) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 69dc8c40-dd79-4ed6-83a9-374dcb18c39b) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a76e82f6-b454-4b24-accb-b260fa361f27) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 0270cb74-4784-46ef-8521-38697daafb1d) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, d69b7468-9b3a-44b1-b7b0-4fcf7e3452d5) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, dc86f4ac-b416-410b-ab03-7912f1b9a1b8) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, a9d71ef3-3a7f-4932-a952-534c1128a1e6) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, b07295ee-1c09-484c-9447-10b9efee340c) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, c37db96f-e689-4789-929c-4d45067ba654) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 101a1fba-acd7-4c9e-9e3c-8436c35e4cef) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 7097eae3-c049-4e5a-ad90-dcb54d82d575) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (undercloud, 17fdb7b4-a7cf-4f67-a736-cb84ec3d49e9) ram:98304 disk:1638400 io_ops:0 instances:0, weight: 1.0], WeighedHost [host: (unde...

Read more...

Alexis Lee (alexisl) wrote :

It's possible to request multiple instances at once. They all have to be the same flavor IIRC. If you do this, the scheduler will ensure the whole request can be met. However if you make multiple requests then as you've seen, it kind of assumes all prior requests failed unless the DB has been updated to say they succeeded.

Two obvious fixes occur to me. One, sync the scheduler with the compute hosts. Generally unacceptable but might be OK in an early TripleO context. Two, make the scheduler more optimistic. Either way is going to be tricky, work on converting to use Nova objects and splitting out Gantt has basically frozen all other work on the scheduler.

Robert Collins (lifeless) wrote :

Right, N at once works because the list of hosts is returned and zipped across the count of instances requested, but we can't make heat do that (yet, if ever) :(.

Not sure what you mean by syncing the scheduler with the compute hosts - do you mean moving the claim code into the scheduler, or something else?

In terms of optimism, that seems like a variation on the "
 - have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well" suggestion I made in the bug ?

Alexis Lee (alexisl) wrote :

By syncing I mean synchronizing, IE make the scheduler wait for the instance to succeed or fail creation before scheduling another instance. There'd have to be a timeout too because messages can go missing (unsure how theoretical that is). Not pretty, could slow down stack standup significantly.

You're right, similar suggestions. Rather than dest. selection counts it should track resources consumed by outstanding requests. Worst case resources are double-counted for a very short period if the DB gets updated before the claim is released. I've asked my colleague Paul Murray to provide his input, he knows this issue intimately.

Robert Collins (lifeless) wrote :

Cool, thank you! having read more of the code I see that the db scheduler is reading the entire hosts lists on every schedule request, which seems bonkers, but thats *clearly* a different discussion. However, it does simplify the choices we have, since we don't need to consider skewed views of the data, we just need to ensure that the resources are claimed immediately.

Robert Collins (lifeless) wrote :

This is horrid, but let me deploy the 29 node scenario perfectly first-time. https://review.openstack.org/106716

Robert Collins (lifeless) wrote :

It seems quite spiky as to triggering - I just had one test pass (29 deploys on a pool of 45, all worked), then the next run failed with 14 ERRORs due to this :/.

Phil Day (philip-day) wrote :

I believe its an inherent part of the Nova design that compute nodes have to be able to reject and send requests back to the Scheduler (back to conductor in the new way of doing things) if they get a request for resources that they no longer have.

For sure you don't want a lot of concurrency in the scheduler (try rpc_thread_pool_size=2 , but I think compute managers in an Ironic setup will need to be able to follow the same basic approach.

Robert Collins (lifeless) wrote :

Sure, but this is pathological, we're scheduling incredibly poorly if requests come in at a fast enough rate. Ironic doesn't change *anything* about the basic approach, and the problem isn't unique to Ironic - KVM with machine sized VMs will behave identically, for instance.

IMO we need to consider this bug a result of design decisions, and we can revisit those after learning about the issues they have....

The issue here is that there is a data structure maintained by the compute process (perhaps via conductor for actual DB writes) but consulted by the scheduler process(es).

If we take a leaf from the BASE design guidelines, we might have this look something like this as a small change to fix things, using a read-after-write pattern (which is in some ways heinous...)

add a scheduler grants table (timestamp, host, memory_gb, cpu_count, disk_gb)

scheduler receives a request:
 - gets a baseline in-memory view using the current read-all-rows approach
 - updates that by reading outstanding grants (or all and discard-in-memory)
 - schedules
 - writes a grant to the DB (using DB set timestamp)
 - reads from the DB to see if there were other grants made for the same host(s), and if they would have invalidated the grant.
   - if and only if they invalidate it and the timestamp on the invalidating grant was less than ours, fall over to the next viable host and remove the grant (we lost), otherwise we won the conflict.
 - return results

compute process receives a instance to build (or migrate):
 - takes its local claim-lock
 - reads the grant from the scheduler table (select from ... where host=self.hostname and cpu_count=... order by timestamp asc limit 1)
 - updates its host row in the DB with the now available resources
 - also sets the last-grant timestamp with the timestamp from that grant

And we add a periodic job to the scheduler to delete rows from the grant table where the timestamp is less than the timestamp in the compute table (or there is no matching compute row).

This would have the following properties:
 - min two more DB writes per scheduled instance (one to write the grant, one to delete it later)
 - min three more DB reads per scheduled instance (compute host has to read the grant, and the scheduler has to read all the unhandled grants and subtract them before scheduling, and scheduler has to look for conflicting grants after scheduling before returning)
 - up to viable-hosts extra writes and reads to handle falling over to the next viable host in the event of a conflict. Assume there are 10 schedulers and someone is flooding the system with identical requests. In each round one will one but it will rejoin with the next request and presumably come up with the same next host that all the others have fallen over to. In this scenario scheduling will end up single threaded (with 9 threads of spinlock failures at any point in time), but it won't deadlock or halt.

Dmitry Tantsur (divius) on 2014-07-14
Changed in ironic:
status: New → Triaged
importance: Undecided → High
Sylvain Bauza (sylvain-bauza) wrote :

I personally dislike any change in the scheduler where it would wait for the end of booting an instance.
That would generate a locking mechanism in the scheduler while we currently have a chance to get one lock-free.

Scheduler is doing decisions based on its internal representation. If the representation is not up-to-date, there are big chances that eventually the request will fail but in that case, that's the role of formerly the compute node (and now the conductor) to issue a retry. Yes, that's by design that the scheduler can generate race conditions, but that also allows us to have an external scheduler without being necessary to remove the bits of locking mechanism.

Long story short, we have to find out how to improve the retry mechanism in the conductor, not in the scheduler.

Robert Collins (lifeless) wrote :

@sylvain There are plenty of other proposals than 'interlock with instance booting' - so I disagree with you that improving the retry mechanism is the only option.

This is basically a consensus-view-of-resources problem, so any of the broad set of consensus answers will work.

Paul Murray (pmurray) wrote :

@lifeless - Do any of those consensus answers scale well and handle failure gracefully? ;)

I think the issue here is that there are two sets of problems when you reach opposite ends of scale in both number of machines and what can fit on a single machine. At the "we have so many resources you can consider them infinite" end of the world an approximate view is the optimal one because is scales and there is a low likelyhood of being wrong. In the case presented in this bug report, where the problem is the system is approaching resource starvation, the optimal solution would be to have a globally consistent view of resources.

The current design is aimed at the first of these two.

Improving the retry mechanism does make sense, but I would make sure that we do not sacrafice scalability and graceful failure handling in the process.

So anything that can shorten the delay in getting up to date information to the scheduler is good. Anything that introduces any kind of synchronisation is bad - including additional lookups in the database.

Sylvain Bauza (sylvain-bauza) wrote :

+1 with @PaulMurray, IMHO we have to work for making sure we can reduce the delay in between computes and scheduler, but still without locking mechanism, because the opposite would impact the scability.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/106716
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Sean Dague (sdague) on 2015-03-30
Changed in nova:
status: Triaged → Confirmed
Robert Collins (lifeless) wrote :

@PaulMurray, @Sylvain

This particular bug occurs with *one* scheduler instance, and 2 resources under subscription, and its a clear example of bad data hygiene - eventual consistency doesn't imply the poor behaviour we use.

Trivially, caching the schedule operations we've made locally and folding those in until the hypervisor would fix this bug with no additional synchronisation overhead.

I appreciate that everyone is concerned about the scheduler, but before saying we can't use a technique, we need to be clear about what our success criteria are. One of the big scheduling problems we have is that we have no specific success criteria, and its broken with bugs like this, so we get design pushback not on the basis of actually success or failure, but fear that it will be worse.

For instance, if we define success as 'be able to place up to 10000 uses/second on up to 1M resources and no more than 10 seconds downtime in the event of scheduler failure', then we have a metric we can examine to assess scheduling implementations. E.g. A single fast scheduler with HA via fail-over on a 5s heartbeat with 15s warmup time could meet this. As could a distributed scheduler with sharding, or possibly a consensus scheduler with sync-on-over-subscribe [e.g. only one retry ever triggered].

Michael Davies (mrda) wrote :

This is not an Ironic bug. It's just triggered by using Ironic. This is a a Nova scheduling bug (and a PITA to fix :(

no longer affects: ironic
tags: removed: ironic
Nikola Đipanov (ndipanov) wrote :

There was a spec proposed and accepted for Liberty that aims at fixing or at least mostly mittigating the described issues

http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/host-state-level-locking.html

However it did not make it for Liberty - I plan to re-propose for M, and it would be great to get feedback from Ironic folks of course!

Fix proposed to branch: master
Review: https://review.openstack.org/226235

Changed in nova:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
status: Confirmed → In Progress
Changed in nova:
assignee: Lucas Alvares Gomes (lucasagomes) → nobody
Changed in nova:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)

Reviewed: https://review.openstack.org/226235
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=608af87541240793425760c16e528d9b3ec02fee
Submitter: Jenkins
Branch: master

commit 608af87541240793425760c16e528d9b3ec02fee
Author: Lucas Alvares Gomes <email address hidden>
Date: Mon Sep 21 16:36:04 2015 +0100

    Ironic: Workaround to mitigate bug #1341420

    The bug #1341420 can cause the driver to try to deploy onto a node which
    already has an instance associated with it, the way Ironic reserve a
    node to a specific instance is associating the UUID of that instance
    with the node via the instance_uuid field. This patch just makes the
    request to associate that field non-retriable when the api returns HTTP
    409 (Conflict) (that's the HTTP error indicating that the node is
    already associated with another instance), so it can fail fast in
    case the node is already pick and the RetryFilter could then pick
    another node for that instance.

    Partial-Bug: #1341420
    Change-Id: I697c5129893b18562182bbb3ceb8cb160e84c312

Yingxin (cyx1231st) wrote :

Did anyone notice the problem that `select_destination()` doesn't reuse host state cache to make decisions at all?

That is to say:
Scheduler will discard its host state and refresh it from DB each time at the beginning of `select_destination()`, instead of reusing the recent updated host state cache.

### explanation ###

Robert Collins pointed out in the bug description that scheduler works well in situation[1] when boot 45 instances in one command. But if he chose to boot these 45 instances in 45 concurrent commends in the second situation[2], there will be up to 50% failure, which is unacceptable.

The real difference between [1] and [2] is that:
- In the first situation[1], scheduler will reuse the host state cache in the `for` loop [3], so that the following 44 schedule decisions are made on the INCREMENTAL UPDATED host state cache. Thus the result turns out to be accurate.
- However in [2], each 45 requests will refresh the host state in `get_all_host_states()` by logic [4] in the beginning. Thus these 45 concurrent schedule decisions are all made based on the SAME db state. No wonder there are 50% failure caused by conflictions. Worse, it could be 97.77% failure if CONF.scheduler_host_subset_size is 1 and in the most idea condition.

Another thing to point out:
Currently, there is no `sleep(0)` or any asynchronous request(except for the experimental trusted filter) during filtering and weighing. So schedule operations including 'refresh host state from db data', 'filtering', 'weighing', 'consume host states', 'return decision' can be treat as an atomic operation as a whole. This strengthens my opinion that in situation[2], scheduler will use the 'almost' same host state cache in 45 concurrent requests. Thus definitely those decisions will conflict.

[1] 15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
[2] 15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
[3] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L117-L149
[4] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L160-L222

### solution ###

The reasonable solution is to refresh host state only if the fetched data is newer than the db data which host state is based on.

Current timestamp in compute_node table is not accurate enough because this record is only up to seconds. It should be at least in milliseconds or even in updated counts to determine whether a host state is outdated. This might deserve a bp to implement.

Yingxin (cyx1231st) wrote :

Hmm... rephrased because I lost the code snippet[1].

But the same problem still exists because of the line [2], why host state should be updated when it is already updated to the db state with the same timestamp? This leads to the problem that the concurrent schedule request can not benefit from host state cache, so there are conflicts.

[1] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L162-L164
[2] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L163

Yingxin (cyx1231st) wrote :
Download full text (5.1 KiB)

<please forget about my previous comments>

I've reproduced this bug with *one* scheduler instance based on the latest nova code.

----- reproduce -----

My openstack environment:
There are 3 compute nodes(node2, node3, node4), each of them has limited 1489MB free RAM and plenty of other resources to boot instances. The scheduler is configured "ram_allocation_ration = 1.0" to prevent memory overcommit. And there is only one flavor that requests 400MB RAM, so that at most 3 instances are allowed to boot onto each compute node. 9 instances at most in this 3-node environment.

Then I requested to boot 9 instances named p1, p2, p3, p4, p5, p6, p7, p8, p9 using 9 separate Restful API requests. I sent those requests in 3 groups, each group will send 3 requests in parallel, and those groups are sent in every 0.5 second. It takes a second to finish all those requests.

Those 9 requests are expected to be successful, but here is the full log [2] showing that it is actually not.
I also made some efforts to make those logs look more friendly [3].
As the log said, the request to schedule p8 is failed with 2 retries. Although it finally succeeded to select node3, it is likely to fail if there are more nodes. Imagine there could be up to 1000 nodes with much busier and complex requests.

I modified scheduler logging a little to better understand what is happening inside scheduler: Add log to indicate the beginning and the end of `select_destination()` with the instance name from request_spec; Add log to indicate the start and end time around `objects.ComputeNodeList.get_all()` in `get_all_host_states()`; Most importantly, I added logs to the code[1] to show whether a host state is successfully updated from compute node db object.

----- log analysis -----

Things go well during #L1-L94, until the request 8582 schedules the instance p8 to node4. Note that node4 already has 3 instances at that time. The request will certainly fail because node4 cannot claim another 400MB memory. The request 8582 still makes this wrong decision because the cached host state told that node4 can fit another instance with 689MB memory(#L92). It looks all host state's fault!

So back again to check how host state of node4 went wrong... and here it is. After scheduling two instances to node4, #L66 shows that the host state was updated by database that told scheduler node4 has 1089MB RAM. That is to say, the db contained outdated message but scheduler still accepted it. But according to logic[1], the cache is assumed outdated because the update time of db is newer, why still wrong?

Because there is a time GAP between scheduler makes a decision and the compute node persistents this decision. For example in the log #L66 [2], the first request 132f was persisted to db at 17:19:45, but the second request 8de7 still hadn't been persisted to db by the time 17:19:45.242. However, the node4 state cache was updated by the third request 8582 at 17:19:44.935. So that host state is refreshed by db whose updated time is newer, but it caused the lost of consumption from request 8de7 and bad things happened!

The same bad happened again at the first retry of request 8582(the 10th column of [3...

Read more...

Fix proposed to branch: master
Review: https://review.openstack.org/261920

Changed in nova:
assignee: Lucas Alvares Gomes (lucasagomes) → Yingxin (cyx1231st)
John Garbutt (johngarbutt) wrote :

Given the discussion around this in the scheduler meeting, moving this to a whishlist item.

Much of the behaviour here is as designed, and the retry loop should minimise the affects seen by API users. The caching_scheduler also helps for cases where you only run one scheduler, reducing the races even further, in most cases. The randomisation weigher also helps with that, somewhat.

The issue tends to be very deployment specific, and certainly the current scheduler is optimised for when the cloud generally has some free space, and gets expanded before it becomes full. Which was considered the typical cloud use case, at the time the scheduler was written.

Changed in nova:
importance: High → Wishlist
Mark Goddard (mgoddard) wrote :

We have a 15 node test cluster in our lab, which is managed by Nova and Ironic. We have run into this issue when attempting to boot all 15 nodes, and in general a number of nodes will fail to boot. This can be seen when using heat or nova directly.

We have developed a couple of changes that drastically improve the behaviour.

The first change, which I believe fixes a bug, is to ensure appropriate resource limits are added to the Exact*Filter filters, in the same way as is done for the other filters. For instance, the DiskFilter writes the calculated disk_gb resource limit to a HostState object in host_passes, if it passes through the filter (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/disk_filter.py#L59). Conversely, the ExactDiskFilter does not set the disk_db limit on the host state (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/exact_disk_filter.py#L23). I think that this limit should be updated in the exact filters also. This limit, if present, is later checked in the compute service during the claim. This allows the claim to verify that the requested resources are available, with the synchronisation provided by doing the check in compute. What this achieves, is to force invalid claims to fail in the compute service, rather than succeeding and causing strange problems with multiple instances trying to provision a single Ironic node.

The second change we have developed is to recognise that with when scheduling onto homogeneous Ironic nodes, we will typically have a large number of compute hosts pass through the scheduler filters, all with equal weight. In the default configuration, with scheduler_host_subset_size = 1, concurrent requests will all schedule onto the same node, the first in the list. As discussed earlier, increasing this number reduces the chances of a collision, but how big should it be? If we go too high, then we undermine the weighting system, allowing lower weight hosts a chance to be scheduled when they would not otherwise be. Our solution to this is to extend the host subset size to include all top weight hosts. Therefore, we can set the -scheduler_host_subset_size to a sensible number, but allow scheduling onto all of the 'equally best' nodes. Although the race still exists, the chance of hitting it is reduced.

Any thoughts on these approaches?

Fix proposed to branch: master
Review: https://review.openstack.org/273154

Changed in nova:
assignee: Yingxin (cyx1231st) → Mark Goddard (mgoddard)
Mark Goddard (mgoddard) wrote :

I ran the attached script on a devstack instance setup with 3 Ironic nodes. I used the fake Ironic driver because it should not affect Nova scheduling behaviour. First, I ran the script on unmodified master. Second, I modified the Exact* filters to set the resource limits on the HostState objects.

Unmodified:
Total runs 100 scheds 300 rescheds 203 errors 544

With limits applied to Exact* filters:
Total runs 100 scheds 300 rescheds 223 errors 0

Interestingly, although this change removed all errors, the number of reschedules was slightly increased. This is not hugely surprising, as the race condition will be caught sooner, increasing the likelihood that it could happen again.

Mark Goddard (mgoddard) wrote :

I just ran the script again, this time with the second change described above, in which the host subset is extended to include all top weight hosts. The fix for the Exact* filter was also included.

With all top weight hosts change:
Total runs 100 scheds 300 rescheds 67 errors 0 fails 0

This result shows a dramatic reduction in the number of reschedule events (22%) vs the unmodified code (68%).

Changed in nova:
assignee: Mark Goddard (mgoddard) → Yingxin (cyx1231st)
Changed in nova:
assignee: Yingxin (cyx1231st) → Mark Goddard (mgoddard)
Mark Goddard (mgoddard) on 2016-02-01
Changed in nova:
assignee: Mark Goddard (mgoddard) → nobody

Reviewed: https://review.openstack.org/273154
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3471cc8b74444d04348d8e6dbc9c980641d4b0bc
Submitter: Jenkins
Branch: master

commit 3471cc8b74444d04348d8e6dbc9c980641d4b0bc
Author: Mark Goddard <email address hidden>
Date: Wed Jan 27 03:49:07 2016 +0000

    Apply scheduler limits to Exact* filters

    The DiskFilter, RamFilter and CoreFilter scheduler filters set the
    resource limit on the HostState object during filtering. This is
    later checked in compute during the claim, and used to enforce
    the allocation ratios in a safe manner. The Exact* filters do not
    set the resource limits, so scheduler race conditions can result
    in multiple bare metal instances claiming a single host.

    This change sets the resource limits on the HostState object in
    the ExactCoreFilter, ExactDiskFilter and ExactRamFilter, ensuring
    that only a single baremetal instance can claim a compute host.

    Change-Id: I31d0331afc4698046a4568935a95f70f30e335dd
    Partial-Bug: #1341420
    Co-Authored-By: Will Miller <email address hidden>

Reviewed: https://review.openstack.org/276191
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=19e7abad96c5c8d19a442a41f799dabe9d600db8
Submitter: Jenkins
Branch: stable/liberty

commit 19e7abad96c5c8d19a442a41f799dabe9d600db8
Author: Mark Goddard <email address hidden>
Date: Wed Jan 27 03:49:07 2016 +0000

    Apply scheduler limits to Exact* filters

    The DiskFilter, RamFilter and CoreFilter scheduler filters set the
    resource limit on the HostState object during filtering. This is
    later checked in compute during the claim, and used to enforce
    the allocation ratios in a safe manner. The Exact* filters do not
    set the resource limits, so scheduler race conditions can result
    in multiple bare metal instances claiming a single host.

    This change sets the resource limits on the HostState object in
    the ExactCoreFilter, ExactDiskFilter and ExactRamFilter, ensuring
    that only a single baremetal instance can claim a compute host.

     Conflicts:
     nova/tests/unit/scheduler/filters/test_exact_core_filter.py
     nova/tests/unit/scheduler/filters/test_exact_disk_filter.py
     nova/tests/unit/scheduler/filters/test_exact_ram_filter.py

    Change-Id: I31d0331afc4698046a4568935a95f70f30e335dd
    Partial-Bug: #1341420
    Co-Authored-By: Will Miller <email address hidden>
    (cherry picked from commit 3471cc8b74444d04348d8e6dbc9c980641d4b0bc)

tags: added: in-stable-liberty
Changed in nova:
assignee: nobody → Yingxin (cyx1231st)
Sylvain Bauza (sylvain-bauza) wrote :

That bug report is still accurate, but given the fact that we're working towards having new resource providers for the scheduler and discussing actively on the implementation details, I'd by far prefer to close that bug and leave a spec better describing the problem.

Changed in nova:
status: In Progress → Invalid

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/307219

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/261920
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Vasyl Saienko (vsaienko) wrote :

The issues was reproduced at ironic-multinode job http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs

several instances were scheduled to the same ironic node.

http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs/screen-n-sch.txt.gz#_2017-02-03_18_46_35_112

2017-02-03 18:46:35.112 21443 DEBUG nova.scheduler.filter_scheduler [req-f8358730-fc5a-4f56-959c-af8742c8c3c2 tempest-ServersTestJSON-264305795 tempest-ServersTestJSON-264305795] Selected host: WeighedHost [host: (ubuntu-xenial-2-node-osic-cloud1-s3500-7106433, e85b2653-5563-4742-b23f-7ceb15b2032f) ram: 384MB disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:127

http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs/screen-n-sch.txt.gz#_2017-02-03_18_46_35_204

2017-02-03 18:46:35.204 21443 DEBUG nova.scheduler.filter_scheduler [req-3ff28238-f02b-4914-9ee1-dc07a15f6c9d tempest-ServerActionsTestJSON-1934209030 tempest-ServerActionsTestJSON-1934209030] Selected host: WeighedHost [host: (ubuntu-xenial-2-node-osic-cloud1-s3500-7106433, e85b2653-5563-4742-b23f-7ceb15b2032f) ram: 384MB disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:127
2017-02-03 18:46:35.204 21443 DEBUG oslo_concurrency.lockutils [req-3ff28238-f02b-4914-9ee1-dc07a15f6c9d tempest-ServerActionsTestJSON-1934209030 tempest-ServerActionsTestJSON-1934209030] Lock "(u'ubuntu-xenial-2-node-osic-cloud1-s3500-7106433', u'e85b2653-5563-4742-b23f-7ceb15b2032f')" acquired by "nova.scheduler.host_manager._locked" :: waited 0.000s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270

Changed in nova:
status: Invalid → New
Dan Smith (danms) wrote :

What you describe is fundamental to how nova works right now. We speculate in the scheduler, and if we race between two, we handle it with a reschedule. Nova specifically states that scheduling every last resource is out of scope. When trying to do that (which is often the use case for ironic) you're likely to hit this race as you run out of capacity:

https://github.com/openstack/nova/blob/master/doc/source/project_scope.rst#iaas-not-batch-processing

In the next few cycles we plan to move the claim process to the placement engine, which will eliminate most of these race-to-claim type issues, and in that situation things will be better for this kind of arrangement.

Until that point, this is not a bug though, because it's specifically how nova is designed to work.

Changed in nova:
status: New → Invalid
Alex Schultz (alex-schultz) wrote :

This is a really old bug and I don't think it applies to tripleo anymore (if ever). Setting to invalid for tripleo.

Changed in tripleo:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers