Evacuated instances are not removed from the source
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Opinion
|
Wishlist
|
Unassigned |
Bug Description
Instance "evacuation" is a great feature and we are trying to take advantage of it.
But, it has some limitations, depending how "broken" is the node.
Let me give some context...
In the scenario where the compute node loses connectivity (broken switch port, loose network cable, ...) or nova-compute is suck (filesystem issue) evacuating instances can have some unexpected consequences and lead to data corruption in the application (for example in a DB application).
If a compute node loses connectivity (or an entire set of compute nodes), nova-compute and the instances are "not available".
If the node runs critical applications (let's suppose a MySQL DB), the cloud operator could be tempted to "evacuate" the instance to recover the critical application for the user. At this point the cloud operator may not know yet the compute node issue and maybe it won't be possible to shut it down (management network affected?, ...) or even simply don't want to interfere with the work of the repair team.
The repair teams fixes the issue (it can take few minutes or hours...) and nova-compute and the instances are available again.
The problem is that nova-compute doesn't destroy the evacuated instances in the source.
```
2021-10-19 11:17:51.519 3050 WARNING nova.compute.
```
At this point we have 2 instances sharing the same IP and possibly writing into the same volume.
Only when nova-compute is restarted (I guess that was always the assumption... the compute node was really broken) the evacuated instances in the affected node are removed.
```
2021-10-19 15:39:49.257 21189 INFO nova.compute.
2021-10-19 15:39:52.949 21189 INFO nova.virt.
```
I would expect that nova-compute will constantly check for the evacuated instances and then removed them.
Otherwise, this requires a lot of coordination between different support teams.
Should this be moved to a periodic task?
https:/
I'm running Stein, but looking into the code, we have the same behaviour in master.
maybe your evacuation status is not in ['accepted', 'pre-migrating', 'done'], see the bug https:/ /bugs.launchpad .net/nova/ +bug/1947812 report by me.