OpenStack Compute (nova)

Bug #1947753
Comment #3

Comment 3 for bug 1947753

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2021-10-27:

OK, let me get it right.

You say that if you want to evacuate an instance, you don't really know whether the original service runs correctly, right?
That's basically why Nova verifies whether the host is not operational and somehow 'failed'.
Sometimes, you're right, Nova thinks the compute service isn't faulty and then you can't evacuate. Some other time, Nova thinks the compute service *is* faulty and then you can evacuate.

If you're doing so, then indeed you could have problems *if* the host is actually running.
That's why in general we recommend operators to "fence" the original faulty host that's detected by Nova before evacuating.

Either way, if the service continues to run, it verifies the evacuation status periodically and deletes the host. So, maybe you're getting a race when you evacuate while a compute fault is transient and then you see a problem.

If so, I'd recommend you, as I said, to 'fence' the host before evacuating instances... or wait a little bit before evacuating the instances if the issue is transient.
Maybe that's something related to healthchecks we want to work on : if you're getting a better status of a faulty compute service, you wouldn't issue evacuations unless you're sure it went down.

Putting the bug report as Opinion but I'm more than happy to discuss with you, Belmiro, on #openstack-nova if you wish.