Nova rescheduling failure

Bug #1300288 reported by Lucas Alvares Gomes
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Lucas Alvares Gomes

Bug Description

If something fails in the spawn() method (E.g we Nova Ironic Driver couldn't add the PXE parameters in Ironic) nova will try to rescheduler the instance and as part of the rescheduling process nova will call destroy() on that instance which will then send a request to Ironic to undeploy the node, but the problem is that the node wasn't even deployed, so another error will be raised which will then overwrite the previous one making it very difficult to debug the real cause of the problem. Also the instance can't be deleted from nova after it.

LOGS:

$ nova boot...

stack@stack-virtual-machine:~/devstack$ nova list
+--------------------------------------+------------+--------+------------+-------------+-------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------+--------+------------+-------------+-------------------+
| e155d6c7-f4ba-4726-a892-3db2d8ee9c0d | ironictest | ERROR | - | NOSTATE | public=172.24.4.4 |
+--------------------------------------+------------+--------+------------+-------------+-------------------+

stack@stack-virtual-machine:~/devstack$ nova show e155d6c7-f4ba-4726-a892-3db2d8ee9c0d
+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | stack-virtual-machine |
| OS-EXT-SRV-ATTR:hypervisor_hostname | db2f5627-cb55-44d7-99dd-88c0a0eccd77 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000002 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2014-03-31T15:04:10Z |
| fault | {"message": "RPC do_node_tear_down not allowed for node db2f5627-cb55-44d7-99dd-88c0a0eccd77 in state None |
| | Traceback (most recent call last): |
| | |
| | File \"/usr/local/lib/python2.7/dist-packages/wsmeext/pecan.py\", line 77, in callfunction |
| | result = f(self, *args, **kwar", "code": 500, "details": " File \"/opt/stack/nova/nova/compute/manager.py\", line 293, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 2003, in run_instance |
| | do_run_instance() |
| | File \"/opt/stack/nova/nova/openstack/common/lockutils.py\", line 249, in inner |
| | return f(*args, **kwargs) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 2002, in do_run_instance |
| | legacy_bdm_in_spec) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 1150, in _run_instance |
| | notify(\"error\", fault=e) # notify that build failed |
| | File \"/opt/stack/nova/nova/openstack/common/excutils.py\", line 68, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 1134, in _run_instance |
| | instance, image_meta, legacy_bdm_in_spec) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 1287, in _build_instance |
| | filter_properties, bdms, legacy_bdm_in_spec) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 1333, in _reschedule_or_error |
| | self._log_original_error(exc_info, instance_uuid) |
| | File \"/opt/stack/nova/nova/openstack/common/excutils.py\", line 68, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 1328, in _reschedule_or_error |
| | bdms, requested_networks) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 2053, in _shutdown_instance |
| | requested_networks) |
| | File \"/opt/stack/nova/nova/openstack/common/excutils.py\", line 68, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 2043, in _shutdown_instance |
| | block_device_info) |
| | File \"/opt/stack/ironic/ironic/nova/virt/ironic/driver.py\", line 500, in destroy |
| | icli.node.set_provision_state, node_uuid, 'deleted') |
| | File \"/opt/stack/ironic/ironic/nova/virt/ironic/driver.py\", line 183, in _retry_if_service_is_unavailable |
| | timer.start(interval=CONF.ironic.api_retry_interval).wait() |
| | File \"/usr/local/lib/python2.7/dist-packages/eventlet/event.py\", line 116, in wait |
| | return hubs.get_hub().switch() |
| | File \"/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py\", line 187, in switch |
| | return self.greenlet.switch() |
| | File \"/opt/stack/nova/nova/openstack/common/loopingcall.py\", line 78, in _inner |
| | self.f(*self.args, **self.kw) |
| | File \"/opt/stack/ironic/ironic/nova/virt/ironic/driver.py\", line 171, in _request_api |
| | func(*args) |
| | File \"/opt/stack/python-ironicclient/ironicclient/v1/node.py\", line 102, in set_provision_state |
| | return self._update(self._path(path), target, method='PUT') |
| | File \"/opt/stack/python-ironicclient/ironicclient/common/base.py\", line 68, in _update |
| | resp, body = self.api.json_request(method, url, body=body) |
| | File \"/opt/stack/python-ironicclient/ironicclient/common/http.py\", line 186, in json_request |
| | resp, body_iter = self._http_request(url, method, **kwargs) |
| | File \"/opt/stack/python-ironicclient/ironicclient/common/http.py\", line 169, in _http_request |
| | error_json.get('debuginfo')) |
| | ", "created": "2014-03-31T15:04:18Z"} |
| flavor | baremetal (ae187c1a-3604-413f-b5e8-d8fd0b687ddd) |
| hostId | ea888db9bf4d0ee6a8188f9c70413951a362a970b297d0d59ec31cac |
| id | e155d6c7-f4ba-4726-a892-3db2d8ee9c0d |
| image | cirros-0.3.1-x86_64-uec (29d1326b-f5a2-42c0-a712-a81057612521) |
| key_name | default |
| metadata | {} |
| name | ironictest |
| os-extended-volumes:volumes_attached | [] |
| public network | 172.24.4.4 |
| status | ERROR |
| tenant_id | 6fa62fa758e645659f5b2519c981a0b6 |
| updated | 2014-03-31T15:04:18Z |
| user_id | be6a18d4c46c452782eeea06754627a2 |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

2014-03-31 14:11:46.580 13461 ERROR wsme.api [-] Server-side error: "RPC do_node_tear_down not allowed for node 632244aa-a2ff-40e7-b2dd-37c983691397 in state None
Traceback (most recent call last):

  File "/opt/stack/ironic/ironic/openstack/common/rpc/common.py", line 423, in catch_client_exception
    return func(*args, **kwargs)

  File "/opt/stack/ironic/ironic/conductor/manager.py", line 428, in do_node_tear_down
    task.release_resources()

  File "/opt/stack/ironic/ironic/openstack/common/excutils.py", line 70, in __exit__
    six.reraise(self.type_, self.value, self.tb)

  File "/opt/stack/ironic/ironic/conductor/manager.py", line 405, in do_node_tear_down
    % {'node': node_id, 'state': node.provision_state})

InstanceDeployFailure: RPC do_node_tear_down not allowed for node 632244aa-a2ff-40e7-b2dd-37c983691397 in state None
". Detail:
Traceback (most recent call last):

  File "/usr/local/lib/python2.7/dist-packages/wsmeext/pecan.py", line 77, in callfunction
    result = f(self, *args, **kwargs)

  File "/opt/stack/ironic/ironic/api/controllers/v1/node.py", line 228, in provision
    pecan.request.context, node_uuid, topic)

  File "/opt/stack/ironic/ironic/conductor/rpcapi.py", line 205, in do_node_tear_down
    topic=topic or self.topic)

  File "/opt/stack/ironic/ironic/openstack/common/rpc/proxy.py", line 125, in call
    result = rpc.call(context, real_topic, msg, timeout)

  File "/opt/stack/ironic/ironic/openstack/common/rpc/__init__.py", line 112, in call
    return _get_impl().call(CONF, context, topic, msg, timeout)

  File "/opt/stack/ironic/ironic/openstack/common/rpc/impl_kombu.py", line 815, in call
    rpc_amqp.get_connection_pool(conf, Connection))

  File "/opt/stack/ironic/ironic/openstack/common/rpc/amqp.py", line 575, in call
    rv = list(rv)

  File "/opt/stack/ironic/ironic/openstack/common/rpc/amqp.py", line 540, in __iter__
    raise result

InstanceDeployFailure_Remote: RPC do_node_tear_down not allowed for node 632244aa-a2ff-40e7-b2dd-37c983691397 in state None
Traceback (most recent call last):

  File "/opt/stack/ironic/ironic/openstack/common/rpc/common.py", line 423, in catch_client_exception
    return func(*args, **kwargs)

  File "/opt/stack/ironic/ironic/conductor/manager.py", line 428, in do_node_tear_down
    task.release_resources()

  File "/opt/stack/ironic/ironic/openstack/common/excutils.py", line 70, in __exit__
    six.reraise(self.type_, self.value, self.tb)

  File "/opt/stack/ironic/ironic/conductor/manager.py", line 405, in do_node_tear_down
    % {'node': node_id, 'state': node.provision_state})

InstanceDeployFailure: RPC do_node_tear_down not allowed for node 632244aa-a2ff-40e7-b2dd-37c983691397 in state None

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
importance: Undecided → High
milestone: none → juno-1
description: updated
Revision history for this message
Dmitry Tantsur (divius) wrote :

My steps to reproduce:
1. Put nodes to maintenance mode
2. Try to deploy anything

aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/84497

Changed in ironic:
status: Triaged → In Progress
aeva black (tenbrae)
tags: added: icehouse-backport-potential
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Dmitry "Divius" Tantsur (divius)
Changed in ironic:
assignee: Dmitry "Divius" Tantsur (divius) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/84497
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=1b1d6b0cec64311ae62443b8647fd528d3c91f1d
Submitter: Jenkins
Branch: master

commit 1b1d6b0cec64311ae62443b8647fd528d3c91f1d
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Apr 1 16:16:02 2014 +0100

    Fix Nova rescheduling tear down problem

    If something fails in the spawn() method (e.g driver parameters, node
    in maintenance mode) nova will try to rescheduler the instance (in case
    the retry filter is enabled) as part of the rescheduling process nova
    will call destroy() on that instance which was cause another problem
    when Ironic tried to undeploy a node that wasn't deployed. This patch
    is fixing the tear down problem by checking the provision_state of the
    node before requesting Ironic to undeploy the node.

    Also, the relevant code and tests underwent some refactoring to make
    the code shorter and less nested.

    Closes-Bug: #1300288
    Co-Authored-By: Dmitry Tantsur <email address hidden>
    Change-Id: I4c7037d6c7b99548a646ef6665b29479277fb4d5

Changed in ironic:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/86722

aeva black (tenbrae)
Changed in ironic:
milestone: juno-1 → icehouse-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (milestone-proposed)

Reviewed: https://review.openstack.org/86722
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=80e26634af4d9e0ab943abe06583c4c1d0466c65
Submitter: Jenkins
Branch: milestone-proposed

commit 80e26634af4d9e0ab943abe06583c4c1d0466c65
Author: Devananda van der Veen <email address hidden>
Date: Tue Apr 1 16:16:02 2014 +0100

    Fix Nova rescheduling tear down problem

    If something fails in the spawn() method (e.g driver parameters, node
    in maintenance mode) nova will try to rescheduler the instance (in case
    the retry filter is enabled) as part of the rescheduling process nova
    will call destroy() on that instance which was cause another problem
    when Ironic tried to undeploy a node that wasn't deployed. This patch
    is fixing the tear down problem by checking the provision_state of the
    node before requesting Ironic to undeploy the node.

    Cherry picked from 1b1d6b0cec64311ae62443b8647fd528d3c91f1d

    Change-Id: Ic3a502167103e84be041ca1ee06fd17d8140529c
    Closes-Bug: #1300288

Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-rc2 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.