RPC failure makes server stuck in a task

Bug #1571175 reported by Shoham Peller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Wishlist
Unassigned

Bug Description

When RPC between services fails, because of a bug in the RPC service, network faults etc., the server gets stuck in a task, and never exits it.

For example, when powering up a server, nova-api sets the task_state of the server to 'powering-on', and then sends nova-compute an RPC message to nova-compute. nova-compute is the one responsible for setting the task_state back to NULL after the server was powered-up, but if the RPC message fails to reach nova-compute, the server is forever stuck in 'powering-on'. This can prevent further API operations on the VM and leave a good VMs in non-operable state. The user can usually bypass this manually by running a 'reset-state'.

To reproduce:
1. Make RPC messages hang
2. Issue a 'server start' request
3. After request, do a 'server show' - the server is stuck in 'powering-on' forever.

This issue was previously reported to the mailing list. Please see this:
http://lists.openstack.org/pipermail/openstack-dev/2016-April/092239.html
and this
http://lists.openstack.org/pipermail/openstack-dev/2016-April/092240.html

Revision history for this message
Shoham Peller (shoham-peller) wrote :

Please see a previous bug on a similar issue:
https://bugs.launchpad.net/nova/+bug/1276214

There, the RPC message for a migrate is issued using a 'call', and not a 'cast', hence waiting for nova-conductor to acknowledge the message before it is delivered. So catching the RPC-timeout exception was enough.
We should either consider always using the RPC 'call' function, or, if we want to free the REST request as fast as possible, let's move the migrate RPC request to 'cast' as well.

tags: added: compute
tags: added: api
removed: compute
Changed in nova:
status: New → Confirmed
Revision history for this message
Pushkar Umaranikar (pushkar-umaranikar) wrote :

Can you please confirm "vm state" of an instance in the scenario you mentioned. If vm state is "building" then I believe after build timeout we move vm's to error state.

Revision history for this message
Shoham Peller (shoham-peller) wrote :
Download full text (3.5 KiB)

When starting a VM, and RPC message is lost, the VM's status is not "building", but "shutoff". To reproduce, in file compute/rpcapi.py, in function start_instance, I've replaced the cctxt.cast() call by a "pass"

(openstack) server show a
+--------------------------------------+------------------------------------------------------------------------+
| Field | Value |
+--------------------------------------+------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-STS:power_state | 4 |
| OS-EXT-STS:task_state | powering-on |
| OS-EXT-STS:vm_state | stopped |
| OS-SRV-USG:launched_at | 2016-04-18T17:43:48.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | private=10.0.0.3 |
| config_drive | True |
| created | 2016-04-18T17:43:42Z |
| flavor | m1.tiny (1) |
| hostId | 65821c34e6318b27333cd8df3c106507f119a0d474a8d05af3333646 |
| id | 0c06b0ac-fd48-47dc-b032-723191bc83a5 |
| image | cirros-0.3.4-x86_64-uec-ramdisk (b8f076fb-f213-41ce-bd10-c3a7574fb9bd) |
| key_name | None |
| name | a |
| os-extended-volumes:volumes_attached | [] |
| project_id | 5e2ff5f669c94b3aa24a260c6f572358 |
| properties | |
| security_groups | [{u'name': u'default'}] |
| status | SHUTOFF ...

Read more...

Sean Dague (sdague)
tags: removed: api
Prateek Arora (parora)
Changed in nova:
assignee: nobody → Prateek Arora (parora)
Prateek Arora (parora)
Changed in nova:
assignee: Prateek Arora (parora) → nobody
Revision history for this message
John Garbutt (johngarbutt) wrote :

The proper fix for this is the proposed tasks framework mentioned here:
http://specs.openstack.org/openstack/nova-specs/priorities/liberty-priorities.html#priorities-without-a-clear-plan

Currently, if you restart nova-compute it will largely trying and clean up the failed states you describe.

Changed in nova:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.