Bug #1571175 “RPC failure makes server stuck in a task” : Bugs : OpenStack Compute (nova)

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-16:

#1

Please see a previous bug on a similar issue:
https://bugs.launchpad.net/nova/+bug/1276214

There, the RPC message for a migrate is issued using a 'call', and not a 'cast', hence waiting for nova-conductor to acknowledge the message before it is delivered. So catching the RPC-timeout exception was enough.
We should either consider always using the RPC 'call' function, or, if we want to free the REST request as fast as possible, let's move the migrate RPC request to 'cast' as well.

tags:

added: compute

Shoham Peller (shoham-peller) on 2016-04-16

tags:

added: api
removed: compute

Pushkar Umaranikar (pushkar-umaranikar) on 2016-04-18

Changed in nova:
status:	New → Confirmed

Revision history for this message

Pushkar Umaranikar (pushkar-umaranikar) wrote on 2016-04-18:

#2

Can you please confirm "vm state" of an instance in the scenario you mentioned. If vm state is "building" then I believe after build timeout we move vm's to error state.

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-18:

#3

Download full text (3.5 KiB)

When starting a VM, and RPC message is lost, the VM's status is not "building", but "shutoff". To reproduce, in file compute/rpcapi.py, in function start_instance, I've replaced the cctxt.cast() call by a "pass"

(openstack) server show a
+--------------------------------------+------------------------------------------------------------------------+
| Field | Value |
+--------------------------------------+------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-STS:power_state | 4 |
| OS-EXT-STS:task_state | powering-on |
| OS-EXT-STS:vm_state | stopped |
| OS-SRV-USG:launched_at | 2016-04-18T17:43:48.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | private=10.0.0.3 |
| config_drive | True |
| created | 2016-04-18T17:43:42Z |
| flavor | m1.tiny (1) |
| hostId | 65821c34e6318b27333cd8df3c106507f119a0d474a8d05af3333646 |
| id | 0c06b0ac-fd48-47dc-b032-723191bc83a5 |
| image | cirros-0.3.4-x86_64-uec-ramdisk (b8f076fb-f213-41ce-bd10-c3a7574fb9bd) |
| key_name | None |
| name | a |
| os-extended-volumes:volumes_attached | [] |
| project_id | 5e2ff5f669c94b3aa24a260c6f572358 |
| properties | |
| security_groups | [{u'name': u'default'}] |
| status | SHUTOFF ...

When starting a VM, and RPC message is lost, the VM's status is not "building", but "shutoff". To reproduce, in file compute/rpcapi.py, in function start_instance, I've replaced the cctxt.cast() call by a "pass"

(openstack) server show a
+--------------------------------------+------------------------------------------------------------------------+
| Field                                | Value                                                                  |
+--------------------------------------+------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                 |
| OS-EXT-AZ:availability_zone          | nova                                                                   |
| OS-EXT-STS:power_state               | 4                                                                      |
| OS-EXT-STS:task_state                | powering-on                                                            |
| OS-EXT-STS:vm_state                  | stopped                                                                |
| OS-SRV-USG:launched_at               | 2016-04-18T17:43:48.000000                                             |
| OS-SRV-USG:terminated_at             | None                                                                   |
| accessIPv4                           |                                                                        |
| accessIPv6                           |                                                                        |
| addresses                            | private=10.0.0.3                                                       |
| config_drive                         | True                                                                   |
| created                              | 2016-04-18T17:43:42Z                                                   |
| flavor                               | m1.tiny (1)                                                            |
| hostId                               | 65821c34e6318b27333cd8df3c106507f119a0d474a8d05af3333646               |
| id                                   | 0c06b0ac-fd48-47dc-b032-723191bc83a5                                   |
| image                                | cirros-0.3.4-x86_64-uec-ramdisk (b8f076fb-f213-41ce-bd10-c3a7574fb9bd) |
| key_name                             | None                                                                   |
| name                                 | a                                                                      |
| os-extended-volumes:volumes_attached | []                                                                     |
| project_id                           | 5e2ff5f669c94b3aa24a260c6f572358                                       |
| properties                           |                                                                        |
| security_groups                      | [{u'name': u'default'}]                                                |
| status                               | SHUTOFF                                                                |
| updated                              | 2016-04-18T17:45:05Z                                                   |
| user_id                              | ccc6e83814364e1badd37cd0b36db08d                                       |
+--------------------------------------+------------------------------------------------------------------------+

Sean Dague (sdague) on 2016-04-18

tags:

removed: api

Prateek Arora (parora) on 2016-04-28

Changed in nova:
assignee:	nobody → Prateek Arora (parora)

Prateek Arora (parora) on 2016-05-04

Changed in nova:
assignee:	Prateek Arora (parora) → nobody

Revision history for this message

John Garbutt (johngarbutt) wrote on 2016-07-14:

#4

The proper fix for this is the proposed tasks framework mentioned here:
http://specs.openstack.org/openstack/nova-specs/priorities/liberty-priorities.html#priorities-without-a-clear-plan

Currently, if you restart nova-compute it will largely trying and clean up the failed states you describe.

Changed in nova:
importance:	Undecided → Wishlist

OpenStack Compute (nova)

RPC failure makes server stuck in a task

Bug Description

Other bug subscribers

Remote bug watches