RPC failure makes server stuck in a task
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
When RPC between services fails, because of a bug in the RPC service, network faults etc., the server gets stuck in a task, and never exits it.
For example, when powering up a server, nova-api sets the task_state of the server to 'powering-on', and then sends nova-compute an RPC message to nova-compute. nova-compute is the one responsible for setting the task_state back to NULL after the server was powered-up, but if the RPC message fails to reach nova-compute, the server is forever stuck in 'powering-on'. This can prevent further API operations on the VM and leave a good VMs in non-operable state. The user can usually bypass this manually by running a 'reset-state'.
To reproduce:
1. Make RPC messages hang
2. Issue a 'server start' request
3. After request, do a 'server show' - the server is stuck in 'powering-on' forever.
This issue was previously reported to the mailing list. Please see this:
http://
and this
http://
tags: |
added: api removed: compute |
Changed in nova: | |
status: | New → Confirmed |
tags: | removed: api |
Changed in nova: | |
assignee: | nobody → Prateek Arora (parora) |
Changed in nova: | |
assignee: | Prateek Arora (parora) → nobody |
Please see a previous bug on a similar issue: /bugs.launchpad .net/nova/ +bug/1276214
https:/
There, the RPC message for a migrate is issued using a 'call', and not a 'cast', hence waiting for nova-conductor to acknowledge the message before it is delivered. So catching the RPC-timeout exception was enough.
We should either consider always using the RPC 'call' function, or, if we want to free the REST request as fast as possible, let's move the migrate RPC request to 'cast' as well.