No retry for removing instance in case of ironic service down

Bug #1685590 reported by Kaifeng Wang
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

When ironic service is shortly down (e.g. ironic conductor down), removing an instance will immediately make this instance into error state without status polling.

After investigation, it points to the code segment: https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L977-L984

When conductor is down, the exception is raised, so ironic will not apply the configuration CONF.ironic.api_max_retries and CONF.ironic.api_retry_interval.

Reproduce:
1. nova boot a baremetal instance.
2. reboot the ironic conductor node (or stop conductor service).
3. remove instance in spawn.
4. instance go into error state, not after 2 minutes (default value).

As a comparison, simply comments L983-984 to reproduce. It seems that, if we comment out L983-984, then if ironic conductor is up before nova mark instance into error state, then nova delete again will also delete ironic instance info. If not, instance on ironic node will not be removed when remove instance from nova.

Still needs investigate.

Tags: ironic
Kaifeng Wang (kaifeng)
Changed in nova:
assignee: nobody → Wang KaiFeng (kaifeng)
Kaifeng Wang (kaifeng)
description: updated
Sean Dague (sdague)
Changed in nova:
assignee: Wang KaiFeng (kaifeng) → nobody
Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Kaifeng Wang (kaifeng) wrote :

Thanks for confirming this bug. Since I have not deep insight with nova, following statements may not accurate or true, but these is what I found and put here for reference.

I think the case is when virt driver raises exception during instance destroy, nova will mark this instance to error state, and when user deletes this instance, nova will never call virt driver, so ironic has no chance to get cleaned up.

The pooling provision state does not cause a major problem, it's the outcome of first issue. If the driver can't successfully send request to ironic api, waiting for 2 minutes is meaningless.

Possibly there are two ways to address this bug:

1. nova do not remove instance in error state, when user deletes the instance, virt driver has a chance to get called, so the provisioning request can be sent to ironic api again. nova never delete an instance without the success acknowledgement from virt driver.
2. add retry mechanism to provisioning request in ironic driver.

I don't know if method 1 is reasonable, but it seems logical to me based on my current knowledge.
Method 2 is definitely a workaround, but it's easy to adopt, and works when service unavailable time is short, this is the way I do in the downstream.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.