Instance reschedule failure leaves orphaned neutron ports

Bug #1510979 reported by Boden R
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Wenzhi Yu
Liberty
Fix Released
Undecided
jichenjc

Bug Description

During the instance boot (spawn/run) process, neutron ports are allocated for the instance if necessary. If the instance fails to spawn (say as a result of a compute host failure), the default behavior is to reschedule the instance and leave it's networking resources in-tact for potential reuse on the rescheduled host (as per deallocate_networks_on_reschedule() [1] which returns False for most compute drivers).

All is good if the instance is successfully rescheduled, but if the reschedule fails (say no more applicable hosts) the allocated ports are left as-is and effectively orphaned.

There are some related defects ([2] and [3]), but they don't quite touch on the particular behavior described herein.

Obviously there are a number of ways to address this issue, but the most obvious is perhaps nova should be aware of the reschedule failure and deallocate any resources which may have been left in-tact for the reschedule.

I'm running devstack all-in-one setup from openstack master branches.

nova --version
2.32.0
neutron --version
3.1.0

The easiest way to repo is to use an all-in-one devstack (only 1 compute host) simulate a host spawn failure by editing the spwan() method of your compute driver to raise an exception at the end of the method and simply try to boot a server. In this setup there's only 1 host so the reschedule will fail and you can verify the port allocated for the instance still exists after trying to boot the instance.

[1] https://github.com/openstack/nova/blob/master/nova/virt/driver.py#L1273
[2] https://bugs.launchpad.net/nova/+bug/1410739
[3] https://bugs.launchpad.net/nova/+bug/1327124

Wenzhi Yu (yuywz)
Changed in nova:
assignee: nobody → Wen Zhi Yu (yuywz)
Revision history for this message
Wenzhi Yu (yuywz) wrote :

For the scenario described in description, if the reschedule fails, a "NoValidHost_Remote(u'No valid host was found. There are not enough hosts available.',)" exception will be captured in nova conductor manager.
At this point, conductor.build_instances method will set state of the instance as vm_states.ERROR and send related notification, return without cleaning up allocated network resources, see [1].
I think one way to fix this bug is adding code to clean up allocated network resources(like we do in compute manager, see [2]) in the exception handling section.

[1] https://github.com/openstack/nova/blob/12.0.0.0rc3/nova/conductor/manager.py#L740-L746
[2] https://github.com/openstack/nova/blob/12.0.0.0rc3/nova/compute/manager.py#L1968

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/243477

Changed in nova:
assignee: Wen Zhi Yu (yuywz) → Boden R (boden)
Changed in nova:
assignee: Boden R (boden) → Wen Zhi Yu (yuywz)
Gary Kotton (garyk)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/243477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=08d24b733ee9f4da44bfbb8d6d3914924a41ccdc
Submitter: Jenkins
Branch: master

commit 08d24b733ee9f4da44bfbb8d6d3914924a41ccdc
Author: Wen Zhi Yu <email address hidden>
Date: Tue Nov 10 17:16:36 2015 +0800

    Clean up network resources when reschedule fails

    During the instance boot (spawn/run) process, neutron ports are
    allocated for the instance if necessary. If the instance fails
    to spawn (say as a result of a compute host failure), the default
    behaviour is to reschedule the instance and leave its networking
    resources in-tact for potential reuse on the rescheduled host.
    All is good if the instance is successfully rescheduled, but if
    the reschedule fails (say no more applicable hosts) the allocated
    ports are left as-is and effectively orphaned.

    This commit add code to clean up allocated network resources
    when the reschedule fails.

    Change-Id: Ic670dd4dc192603c2faecf18e14ef59ebca9e420
    Closes-Bug: #1510979

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0b2

This issue was fixed in the openstack/nova 13.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/367316

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.