Reschedule of failed instance doesn't happening when scheduler placed two instances to the same ironic node

Bug #1670319 reported by Vasyl Saienko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

There is a known bug https://bugs.launchpad.net/tripleo/+bug/1341420 that is caused by nova scheduling/claim resources design. In two words scheduler may schedule different instances to the same ironic node and second instance will alway fail as claim of resources is done on nova-compute side.
There should be a reschedule for second instance once it is failed, but it doesn't happening.

1. Fisrt instance is placed on c4d5e326-7ad3-4c25-bfe5-3cab211a723e
http://logs.openstack.org/71/441271/1/gate/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial/b0037ad/logs/screen-n-sch.txt.gz#_2017-03-06_09_08_08_343

2017-03-06 09:08:08.343 20337 DEBUG nova.scheduler.filter_scheduler [req-d7c167ea-4bd9-40fe-bfa6-452695a40fa9 tempest-ServersTestJSON-207710543 tempest-ServersTestJSON-207710543] Selected host: WeighedHost [host: (ubuntu-xenial-2-node-osic-cloud1-s3500-7711232-456798, c4d5e326-7ad3-4c25-bfe5-3cab211a723e) ram: 384MB disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:126

2. Second instance is placed on c4d5e326-7ad3-4c25-bfe5-3cab211a723e
http://logs.openstack.org/71/441271/1/gate/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial/b0037ad/logs/screen-n-sch.txt.gz#_2017-03-06_09_08_08_421
2017-03-06 09:08:08.421 20337 DEBUG nova.scheduler.filter_scheduler [req-f903ab7f-7525-4567-82f7-8bf2f2b53c86 tempest-ServerActionsTestJSON-1730451988 tempest-ServerActionsTestJSON-1730451988] Selected host: WeighedHost [host: (ubuntu-xenial-2-node-osic-cloud1-s3500-7711232-456798, c4d5e326-7ad3-4c25-bfe5-3cab211a723e) ram: 384MB disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule

3. nova-compute doesn't reschedule failed instance

http://logs.openstack.org/71/441271/1/gate/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial/b0037ad/logs/subnode-2/screen-n-cpu.txt.gz#_2017-03-06_09_08_09_137

2017-03-06 09:08:09.137 31801 DEBUG nova.compute.manager [req-f903ab7f-7525-4567-82f7-8bf2f2b53c86 tempest-ServerActionsTestJSON-1730451988 tempest-ServerActionsTestJSON-1730451988] [instance: bef43a32-f310-4ef4-8264-c7bc064856b1] Retry info not present, will not reschedule _do_build_and_run_instance /opt/stack/new/nova/nova/compute/manager.py:1788

Revision history for this message
Matt Riedemann (mriedem) wrote :

What is [scheduler]/max_attempts set to in nova.conf? By default that's 3.

tags: added: ironic scheduler
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like in this CI run max_attempts is the default of 3:

2017-03-06 09:05:40.796 31801 DEBUG oslo_service.service [req-3f161188-1da5-4fef-b651-d1783a386daa - -] scheduler.max_attempts = 3 log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2744

This is where the retry filter property gets set:

https://github.com/openstack/nova/blob/5cf6bbf374a8b877d2e158aa8802b31d14ceb121/nova/scheduler/utils.py#L153

Is force_hosts or force_nodes set?

Revision history for this message
Matt Riedemann (mriedem) wrote :

It doesn't look like force_hosts or force_nodes is set in this case, so I'm not sure why the scheduler isn't performing retries.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Just a note that the dictionary is set by the conductor and passed thru RPC to the compute node, so the opt values need to be verified on the n-cond.txt log.

That said, it's also max_attempts = 3 in n-cond for not the subnode, so I still don't know why we don't have that passed.

Revision history for this message
Vasyl Saienko (vsaienko) wrote :
Revision history for this message
Shunli Zhou (shunliz) wrote :

I think it's due to this bug https://bugs.launchpad.net/nova/+bug/1671648, I have submit a patch.

Could someone help review the code.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.