Port detach fails when compute host is unreachable

Bug #1827746 reported by Adam Harwell
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned
octavia
Invalid
Undecided
Unassigned

Bug Description

When a compute host is unreachable, a port detach for a VM on that host will not complete until the host is reachable again. In some cases, this may for an extended period or even indefinitely (for example, a host is powered down for hardware maintenance, and possibly needs to be removed from the fleet entirely). This is problematic for multiple reasons:

1) The port should not be deleted in this state (it can be, but for reasons outside the scope of this bug, that is not recommended). Thus, the quota cannot be reclaimed by the project.
2) The port cannot be reassigned to another VM. This means that for projects that rely heavily on maintaining a published IP (or possibly even a published port ID), there is no way to proceed. For example, if Octavia wanted to allow failing over from one VM to another in a VM down event (as would happen if the host was powered off) without using AAP, it would be unable to do so, leading to an extended downtime.

Nova will supposedly clean up such resources after the host has been powered up, but that could take hours or possibly never happen. So, there should be a way to force the port to detach regardless of ability to reach the compute host, and simply allow the cleanup to happen on that host in the future (if possible) but immediately release the port for delete or rebinding.

Revision history for this message
Adam Harwell (adam-harwell) wrote :

If nova would allow an admin to `force` an unbind, but still queue all the standard cleanup in nova, would that solve this? Is that unreasonable?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The detach-interface is an RPC cast (async) to the nova-compute. When the nova-compute is down the RCP message waits in the queue to be processed. So when the compute comes up it will get the message and detach the interface successfully (I've tested in devstack). In theory nova-api could detect if the compute is down and can unbind the port in neutron (and maybe also de-allocate the resources in placement) if the compute is down. Then nova-api would still cast to the compute to make sure when the compute is up, it can detach the vif from the server.

Changed in nova:
status: New → Confirmed
tags: added: network
Revision history for this message
melanie witt (melwitt) wrote :

I agree with gibi, it sounds like we could do something similar to "local delete" (when nova-compute is down and delete is requested, we deallocate ports, volumes, and placement resources) in nova-api for the detach-interface API. There might be some resistance to the idea as "local delete" has been a popular source of bugs in the past. But as always, it's a tradeoff, and if detach-interface while nova-compute is down is a greater pain point, it might be worth adding a "local detach" ability.

I think the only concern here will be the case where nova-compute is down and comes back up, but somehow the RPC message is lost and it never detaches the vif but the port is free to be bound to a new server. Would that be a problem or would it work OK because the port was unbound (and not cause two servers to potentially respond to the same IP, if the port gets bound to a new server)?

Revision history for this message
Adam Harwell (adam-harwell) wrote :

Yeah I think it'd work fine with neutron understanding where the port should be bound correctly.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

it depnes on how the comptue node is down and if the queue still exists wheater it would get cleand up eventualy. until https://bugs.launchpad.net/oslo.messaging/+bug/1661510 is fix i dont think we can assume that it will definetly get recived and acted on when the compute node comes back up.

on master at least if we detach the port on master the network info cache should get updated with the correct value due to the recent force heal change and that should mean if the vm was stoop and then started it would be started without the port.

to add support for local delete we would need a perodic or a starup task that force referesh the info cache form neutron and checks every vm and determing if the xml should be modified to match the info cache interface built form neutorn. it can be dont but i dont think we can rely on the rpc to do it.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

given a workaround to the quota issue would be to delete the server and or loadbalancer and recreate in this case
i would normally set this to low. given the implication that that would be problematic in an octavia context ill mark it as medium but it would be good if you could add context as to why that is not recommended for octavia related instances. i would have assumed you could jsut delete the port and remove it form the load blancer and then re add the port somewhere else.

Changed in nova:
importance: Undecided → Medium
Revision history for this message
Michael Johnson (johnsom) wrote :

Sean, I'm not sure I can follow your last comments.

The issue we have is when someone powers down the host with an instances on it, the port detach doesn't complete until the host is brought back up.
This "locks" both quota, but also the fixed IP address on the failed instance. It blocks us from being able to create a new port on a new instance using that quota and/or fixed IP address.
Deleting and detach both get "stuck" on the instance until the host comes back, which it may never do.
This is why this is a high severity issue for users like Octavia.

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote : auto-abandon-script

Abandoned after re-enabling the Octavia launchpad.

Changed in octavia:
status: New → Invalid
tags: added: auto-abandon
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.