OpenStack Compute (nova)

Bug #1341420
Activity log

Activity log for bug #1341420

Date	Who	What changed	Old value	New value	Message
2014-07-14 04:15:03	Robert Collins	bug			added bug
2014-07-14 04:15:19	Robert Collins	bug task added		nova
2014-07-14 04:18:08	Robert Collins	tags		ironic scheduler
2014-07-14 04:28:49	Robert Collins	nova: status	New	Triaged
2014-07-14 04:28:52	Robert Collins	nova: importance	Undecided	High
2014-07-14 04:54:36	Steve Kowalik	description	There is a race between the scheduler in select_destinations, which selects a set of hosts, and the nova compute manager, which claims resources on those hosts when building the instance. The race is particularly noticable with Ironic, where ever request will consume a full host, but can turn up on libvirt etc too. Multiple schedulers will likely exacerbate this too unless they are in a version of python with randomised dictionary ordering, in which case they will make it better :). I've put https://review.openstack.org/106677 up to remove a comment which comes from before we introduced this race. One mitigating aspect to the race in the filter scheduler _schedule method attempts to randomly select hosts to avoid returning the same host in repeated requests, but the default minimum set it selects from is size 1 - so when heat requests a single instance, the same candidate is chosen every time. Setting that number higher can avoid all concurrent requests hitting the same host, but it will still be a race, and still likely to fail fairly hard at near-capacity situations (e.g. deploying all machines in a cluster with Ironic and Heat). Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or 10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity on each hypervisor. Then deploy a bunch of VMs one at a time but very close together - e.g. use the python API to get cached keystone credentials, and boot 5 in a loop. If using Ironic you will want https://review.openstack.org/106676 to let you see which host is being returned from the selection. Possible fixes: - have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well - reinstate actioning claims into the scheduler, allowing the audit to correct any claimed-but-not-started resource counts asynchronously - special case the retry behaviour if there are lots of resources available elsewhere in the cluster. Stats wise, I just testing a 29 instance deployment with ironic and a heat stack, with 45 machines to deploy onto (so 45 hosts in the scheduler set) and 4 failed with this race - which means they recheduled and failed 3 times each - or 12 cases of scheduler racing at minimum. background chat 15:43 < lifeless> mikal: around? I need to sanity check something 15:44 < lifeless> ulp, nope, am sure of it. filing a bug. 15:45 < mikal> lifeless: ok 15:46 < lifeless> mikal: oh, you're here, I will run it past you :) 15:46 < lifeless> mikal: if you have ~5m 15:46 < mikal> Sure 15:46 < lifeless> so, symptoms 15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic. 15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up 15:47 < lifeless> this is with Ironic 15:47 < mikal> Sure 15:47 < lifeless> the failure on all the instances is the retry-three-times failure-of-death 15:47 < lifeless> what I believe is happening is this 15:48 < lifeless> the scheduler is allocating the same weighed list of hosts for requests that happen close enough together 15:49 < lifeless> and I believe its able to do that because the target hosts (from select_destinations) need to actually hit the compute node manager and have 15:49 < lifeless> with rt.instance_claim(context, instance, limits): 15:49 < lifeless> happen in _build_and_run_instance 15:49 < lifeless> before the resource usage is assigned 15:49 < mikal> Is heat making 45 separate requests to the nova API? 15:49 < lifeless> eys 15:49 < lifeless> yes 15:49 < lifeless> thats the key difference 15:50 < lifeless> same flavour, same image 15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: Remove cell api overrides for lock and unlock https://review.openstack.org/89487 15:50 < mikal> And you have enough quota for these instances, right? 15:50 < lifeless> yes 15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does seem worth filing a bug for 15:51 < lifeless> my theory is that there is enough time between select_destinations in the conductor, and _build_and_run_instance in compute for another request to come in the front door and be scheduled to the same host 15:51 < mikal> That seems possible to me 15:52 < lifeless> I have no idea right now about how to fix it (other than to have the resources provisionally allocated by the scheduler before it sends a reply), but I am guessing that might be contentious 15:52 < mikal> I can't instantly think of a fix though -- we've avoided queue like behaviour for scheduling 15:52 < mikal> How big is the clsuter compared with 45 instances? 15:52 < mikal> Is it approximately the same size as that? 15:52 < lifeless> (by provisionally allocated, I mean 'claim them and let the audit in 60 seconds fix it up if they are not actually used') 15:53 < lifeless> sorry, not sure what yoy mean by that last question 15:53 < mikal> So, if you have 45 ironic instances to schedule, and 45 identical machines to do it, then the probability of picking the same machine more than once to schedule on is very high 15:53 < mikal> Wehereas if you had 500 machines, it would be low 15:53 < lifeless> oh yes, all the hardware is homogeneous 15:54 < lifeless> we believe this is common in clouds :) 15:54 < mikal> And the cluster is sized at approximately 45 machines? 15:54 < lifeless> the cluster is 46 machines but one is down for maintenance 15:54 < lifeless> so 45 machines available to schedule onto. 15:54 < mikal> Its the size of the cluster compared to the size of the set of instances which I'm most interested in 15:54 < lifeless> However - and this is the interesting thing 15:54 < lifeless> I tried a heat stack of 20 machines. 15:54 < lifeless> same symptoms 15:54 < mikal> Yeah, that's like the worst possible case for this algorithm 15:54 < lifeless> about 30% failed due to scheduler retries. 15:54 < mikal> Hmmm 15:54 < mikal> That is unexpected to me 15:55 < lifeless> that is when I dived into the code. 15:55 < lifeless> the patch I pushed above will make it possible to see if my theory is correct 15:55 < mikal> you were going to file a bug, right? 15:56 < lifeless> I have the form open to file one with tasks on ironic and nova 15:56 < mikal> I vote you do that thing 15:56 < lifeless> seconded 15:56 < lifeless> I might copy this transcript in as well 15:57 < mikal> Works for me	There is a race between the scheduler in select_destinations, which selects a set of hosts, and the nova compute manager, which claims resources on those hosts when building the instance. The race is particularly noticable with Ironic, where every request will consume a full host, but can turn up on libvirt etc too. Multiple schedulers will likely exacerbate this too unless they are in a version of python with randomised dictionary ordering, in which case they will make it better :). I've put https://review.openstack.org/106677 up to remove a comment which comes from before we introduced this race. One mitigating aspect to the race in the filter scheduler _schedule method attempts to randomly select hosts to avoid returning the same host in repeated requests, but the default minimum set it selects from is size 1 - so when heat requests a single instance, the same candidate is chosen every time. Setting that number higher can avoid all concurrent requests hitting the same host, but it will still be a race, and still likely to fail fairly hard at near-capacity situations (e.g. deploying all machines in a cluster with Ironic and Heat). Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or 10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity on each hypervisor. Then deploy a bunch of VMs one at a time but very close together - e.g. use the python API to get cached keystone credentials, and boot 5 in a loop. If using Ironic you will want https://review.openstack.org/106676 to let you see which host is being returned from the selection. Possible fixes: - have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well - reinstate actioning claims into the scheduler, allowing the audit to correct any claimed-but-not-started resource counts asynchronously - special case the retry behaviour if there are lots of resources available elsewhere in the cluster. Stats wise, I just testing a 29 instance deployment with ironic and a heat stack, with 45 machines to deploy onto (so 45 hosts in the scheduler set) and 4 failed with this race - which means they recheduled and failed 3 times each - or 12 cases of scheduler racing at minimum. background chat 15:43 < lifeless> mikal: around? I need to sanity check something 15:44 < lifeless> ulp, nope, am sure of it. filing a bug. 15:45 < mikal> lifeless: ok 15:46 < lifeless> mikal: oh, you're here, I will run it past you :) 15:46 < lifeless> mikal: if you have ~5m 15:46 < mikal> Sure 15:46 < lifeless> so, symptoms 15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic. 15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up 15:47 < lifeless> this is with Ironic 15:47 < mikal> Sure 15:47 < lifeless> the failure on all the instances is the retry-three-times failure-of-death 15:47 < lifeless> what I believe is happening is this 15:48 < lifeless> the scheduler is allocating the same weighed list of hosts for requests that happen close enough together 15:49 < lifeless> and I believe its able to do that because the target hosts (from select_destinations) need to actually hit the compute node manager and have 15:49 < lifeless> with rt.instance_claim(context, instance, limits): 15:49 < lifeless> happen in _build_and_run_instance 15:49 < lifeless> before the resource usage is assigned 15:49 < mikal> Is heat making 45 separate requests to the nova API? 15:49 < lifeless> eys 15:49 < lifeless> yes 15:49 < lifeless> thats the key difference 15:50 < lifeless> same flavour, same image 15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: Remove cell api overrides for lock and unlock https://review.openstack.org/89487 15:50 < mikal> And you have enough quota for these instances, right? 15:50 < lifeless> yes 15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does seem worth filing a bug for 15:51 < lifeless> my theory is that there is enough time between select_destinations in the conductor, and _build_and_run_instance in compute for another request to come in the front door and be scheduled to the same host 15:51 < mikal> That seems possible to me 15:52 < lifeless> I have no idea right now about how to fix it (other than to have the resources provisionally allocated by the scheduler before it sends a reply), but I am guessing that might be contentious 15:52 < mikal> I can't instantly think of a fix though -- we've avoided queue like behaviour for scheduling 15:52 < mikal> How big is the clsuter compared with 45 instances? 15:52 < mikal> Is it approximately the same size as that? 15:52 < lifeless> (by provisionally allocated, I mean 'claim them and let the audit in 60 seconds fix it up if they are not actually used') 15:53 < lifeless> sorry, not sure what yoy mean by that last question 15:53 < mikal> So, if you have 45 ironic instances to schedule, and 45 identical machines to do it, then the probability of picking the same machine more than once to schedule on is very high 15:53 < mikal> Wehereas if you had 500 machines, it would be low 15:53 < lifeless> oh yes, all the hardware is homogeneous 15:54 < lifeless> we believe this is common in clouds :) 15:54 < mikal> And the cluster is sized at approximately 45 machines? 15:54 < lifeless> the cluster is 46 machines but one is down for maintenance 15:54 < lifeless> so 45 machines available to schedule onto. 15:54 < mikal> Its the size of the cluster compared to the size of the set of instances which I'm most interested in 15:54 < lifeless> However - and this is the interesting thing 15:54 < lifeless> I tried a heat stack of 20 machines. 15:54 < lifeless> same symptoms 15:54 < mikal> Yeah, that's like the worst possible case for this algorithm 15:54 < lifeless> about 30% failed due to scheduler retries. 15:54 < mikal> Hmmm 15:54 < mikal> That is unexpected to me 15:55 < lifeless> that is when I dived into the code. 15:55 < lifeless> the patch I pushed above will make it possible to see if my theory is correct 15:55 < mikal> you were going to file a bug, right? 15:56 < lifeless> I have the form open to file one with tasks on ironic and nova 15:56 < mikal> I vote you do that thing 15:56 < lifeless> seconded 15:56 < lifeless> I might copy this transcript in as well 15:57 < mikal> Works for me
2014-07-14 08:11:02	Frank O'Neill	bug			added subscriber Frank O'Neill
2014-07-14 09:13:08	Lucas Alvares Gomes	bug			added subscriber Lucas Alvares Gomes
2014-07-14 13:25:26	Dmitry Tantsur	ironic: status	New	Triaged
2014-07-14 13:25:32	Dmitry Tantsur	ironic: importance	Undecided	High
2014-07-14 21:53:28	Adam Gandelman	bug			added subscriber Adam Gandelman
2014-07-17 08:15:18	Sylvain Bauza	bug			added subscriber Sylvain Bauza
2014-08-27 05:47:55	haruka tanizawa	bug			added subscriber haruka tanizawa
2015-03-30 14:37:38	Sean Dague	nova: status	Triaged	Confirmed
2015-04-09 07:41:37	gustavo panizzo	bug			added subscriber gustavo panizzo
2015-05-12 22:10:07	John L. Villalovos	bug			added subscriber John L. Villalovos
2015-06-02 23:24:10	Tony Breeds	bug			added subscriber Tony Breeds
2015-06-02 23:25:47	Michael Davies	bug task deleted	ironic
2015-06-02 23:27:23	Michael Davies	tags	ironic scheduler	scheduler
2015-08-26 11:31:14	Dmitry Tantsur	bug			added subscriber Dmitry Tantsur
2015-09-22 10:15:46	OpenStack Infra	nova: status	Confirmed	In Progress
2015-09-22 10:15:46	OpenStack Infra	nova: assignee		Lucas Alvares Gomes (lucasagomes)
2015-09-22 10:28:40	Lucas Alvares Gomes	nova: assignee	Lucas Alvares Gomes (lucasagomes)
2015-11-26 12:29:55	OpenStack Infra	nova: assignee		Lucas Alvares Gomes (lucasagomes)
2015-12-23 08:16:24	Yingxin	bug			added subscriber Yingxin
2015-12-28 08:04:33	OpenStack Infra	nova: assignee	Lucas Alvares Gomes (lucasagomes)	Yingxin (cyx1231st)
2016-01-04 14:39:03	Chris Dent	bug			added subscriber Chris Dent
2016-01-04 15:49:51	John Garbutt	nova: importance	High	Wishlist
2016-01-26 11:54:26	Mark Goddard	bug			added subscriber Mark Goddard
2016-01-27 17:03:14	OpenStack Infra	nova: assignee	Yingxin (cyx1231st)	Mark Goddard (mgoddard)
2016-01-27 17:11:58	Mark Goddard	attachment added		Script to test rescheduling https://bugs.launchpad.net/nova/+bug/1341420/+attachment/4557718/+files/reschedule_test.sh
2016-01-29 03:04:19	OpenStack Infra	nova: assignee	Mark Goddard (mgoddard)	Yingxin (cyx1231st)
2016-02-01 09:42:24	OpenStack Infra	nova: assignee	Yingxin (cyx1231st)	Mark Goddard (mgoddard)
2016-02-01 09:51:48	Mark Goddard	nova: assignee	Mark Goddard (mgoddard)
2016-03-01 02:53:17	OpenStack Infra	tags	scheduler	in-stable-liberty scheduler
2016-03-07 07:42:05	OpenStack Infra	nova: assignee		Yingxin (cyx1231st)
2016-03-08 15:24:03	James Slagle	bug task added		tripleo
2016-04-18 10:12:22	Sylvain Bauza	nova: status	In Progress	Invalid
2016-05-23 07:13:30	Shinobu KINJO	bug			added subscriber Shinobu KINJO
2017-02-03 20:24:31	Vasyl Saienko	nova: status	Invalid	New
2017-02-03 22:38:49	Diana Clarke	bug			added subscriber Diana Clarke
2017-02-03 22:41:58	Dan Smith	nova: status	New	Invalid
2017-02-04 01:47:58	Arata Notsu	bug			added subscriber Arata Notsu
2017-09-22 17:28:41	Alex Schultz	tripleo: status	New	Invalid