OpenStack Compute (nova)

No retry for removing instance in case of ironic service down

Bug #1685590 reported by Kaifeng Wang on 2017-04-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Medium	Unassigned

Bug Description

When ironic service is shortly down (e.g. ironic conductor down), removing an instance will immediately make this instance into error state without status polling.

After investigation, it points to the code segment: https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L977-L984

When conductor is down, the exception is raised, so ironic will not apply the configuration CONF.ironic.api_max_retries and CONF.ironic.api_retry_interval.

Reproduce:
1. nova boot a baremetal instance.
2. reboot the ironic conductor node (or stop conductor service).
3. remove instance in spawn.
4. instance go into error state, not after 2 minutes (default value).

As a comparison, simply comments L983-984 to reproduce. It seems that, if we comment out L983-984, then if ironic conductor is up before nova mark instance into error state, then nova delete again will also delete ironic instance info. If not, instance on ironic node will not be removed when remove instance from nova.

Still needs investigate.

See original description

Tags:

Kaifeng Wang (kaifeng) on 2017-04-23

Changed in nova:
assignee:	nobody → Wang KaiFeng (kaifeng)

Kaifeng Wang (kaifeng) on 2017-04-23

description:

updated

Sean Dague (sdague) on 2017-06-23

Changed in nova:
assignee:	Wang KaiFeng (kaifeng) → nobody

Sean Dague (sdague) on 2017-07-27

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Kaifeng Wang (kaifeng) wrote on 2017-07-28:

Thanks for confirming this bug. Since I have not deep insight with nova, following statements may not accurate or true, but these is what I found and put here for reference.

I think the case is when virt driver raises exception during instance destroy, nova will mark this instance to error state, and when user deletes this instance, nova will never call virt driver, so ironic has no chance to get cleaned up.

The pooling provision state does not cause a major problem, it's the outcome of first issue. If the driver can't successfully send request to ironic api, waiting for 2 minutes is meaningless.

Possibly there are two ways to address this bug:

1. nova do not remove instance in error state, when user deletes the instance, virt driver has a chance to get called, so the provisioning request can be sent to ironic api again. nova never delete an instance without the success acknowledgement from virt driver.
2. add retry mechanism to provisioning request in ironic driver.

I don't know if method 1 is reasonable, but it seems logical to me based on my current knowledge.
Method 2 is definitely a workaround, but it's easy to adopt, and works when service unavailable time is short, this is the way I do in the downstream.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.