OpenStack Compute (nova)

evacuate after network issue will cause vm running on two host

Bug #1968555 reported by shews on 2022-04-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Undecided	Unassigned

Bug Description

Environment
===========
openstack queen + libvirt 4.5.0 + qemu 2.12 running on centos7, with ceph rbd storage

Description
===========
If the management network of the compute host is abnormal, it may cause nova-compute down but the openstack-nova-compute.service is still running on that host. Now you evacuate a vm on that host, the evacuate will succeed, the vm will be running both on the old host and the new host even after the management network of old host recover, it may cause vm error.

Steps to reproduce
==================
1. Manually turn down the management network port of the compute host, like ifconfig eth0 down
2. After the nova-compute of that host see down with openstack compute service list, evacuate one vm on that host:
nova evacuate <vm's uuid>
3. After evacuate succeed, you can find the vm running on two host.
4. Manually turn up the management network port of the old compute host, like ifconfig eth0 up, you can find the vm still running on this host, it can't be auto destroy unless you restart the openstack-nova-compute.service on that host.

Expected result
===============
Maybe we can add a periodic task to auto destroy this vm?

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2022-04-19:

If you see some compute flapping due to some network issue, you can force it to be down :
https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down

Once the compute is down (because either it's forced down or by the service group API), indeed you can evacuate the instance and then you would have two different instances, once for the original one, and the other one for the new host.

That said, given the original host is down, you should restart the compute service then once it's back up, right? If so, we then verify the evacuated instances and we delete them :
https://github.com/openstack/nova/blob/a1f006d799d2294234d381395a9ae9c22a2d80b9/nova/compute/manager.py#L1531

Changed in nova:
status:	New → Invalid

shews (shews) on 2022-05-10

Changed in nova:
status:	Invalid → New

Revision history for this message

shews (shews) wrote on 2022-05-10:

Thank you , Sylvain Bauza.
"That said, given the original host is down, you should restart the compute service then once it's back up, right?", I don't think so.
If the network issue cause compute down, after network recover, the compute service can be up automatically, so I can't know I should restart the compute service, it will lead two instances run on two host forever.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2022-06-08:

The evacuate API states:

Preconditions

The failed host must be fenced and no longer running the original server.

The failed host must be reported as down or marked as forced down using Update Forced Down.

So when you detect the control network failure you have to make sure that the host is fenced before you evacuate the instance. This is exactly there to prevent the duplication of the VM via evacuation.

The most common fencing method is power fencing. I.e. when the issue is detected the problematic compute is powered off via out of band management. Then VMs can be safely evacuated.

[1] https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-evacuate-action-detail#evacuate-server-evacuate-action

Changed in nova:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.