task_state stuck in "powering_on" when starting a server and the nova-compute host service is down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
Description
===========
After stopping the nova-compute service on a node, powering on VMs that were in a shutoff state, but assigned to the compute node where the compute service is no longer running, are powered on the host where the nova-compute service is stopped, causing the VM task state to be set to a bad state of "powering-on" forever.
Steps to reproduce
==================
Example:
Disable the nova-compute service on host compute001
VM001 was previously running on this host, but is now shut down
Power on VM001
nova-scheduler schedules VM001 on compute001
Expected result
===============
The scheduler filter ComputeFilter should see that compute001 is not "operational and enabled", as described here:
https:/
and "not" schedule this VM on host compute001.
Actual result
=============
compute001 is chosen to power on VM001, leaving it in a bad task state of "powering-on" forever.
Environment
===========
stable/rocky using Kolla-Ansible 7.0.0.0rc3devXX and Kolla 7.0.0.0rc3devXX.
CentOS 7.5 with latest updates
Kernel: Linux 4.18.14-
Hypervisor: KVM
Storage: Ceph
Networking: DVR
summary: |
- Nova scheduler schedules VMs on nodes where nova-compute is down + task_state stuck in "powering_on" when starting a server and the nova- + compute host service is down |
When you say "Disable the nova-compute service on host compute001" do you mean using the disable API for the compute service, or simply stopping the nova-compute process on the host? If the latter, are you waiting for it's status to e 'down' in the nova service-list output?
And when you say "Power on VM001" what specifically are you doing? Issuing "nova start VM001"? Because starting a stopped VM does nothing with the scheduler.
I think the issue is that the API changes the task state on the server and then RPC casts to the compute service, which is down, and then the task_state is stuck. Since it's an RPC cast, we don't know if the compute is down or not when we make the cast, and the API isn't checking that every operation performed on a VM has the underlying compute service running (except delete, and any move operations).
You can reset the state via the nova reset-state CLI:
https:/ /docs.openstack .org/python- novaclient/ latest/ cli/nova. html#nova- reset-state