Evacuate Fails 'Invalid state of instance files' using Ceph Ephemeral RBD

Bug #1340411 reported by hifieli
84
This bug affects 15 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Feilong Wang
Icehouse
Fix Released
Undecided
Unassigned
Juno
Fix Released
Undecided
Unassigned

Bug Description

Greetings,

We can't seem to be able to evacuate instances from a failed compute node using shared storage. We are using Ceph Ephemeral RBD as the storage medium.

Steps to reproduce:

nova evacuate --on-shared-storage 6e2081ec-2723-43c7-a730-488bb863674c node-24
or
POST to http://ip-address:port/v2/tenant_id/servers/server_id/action with
{"evacuate":{"host":"node-24","onSharedStorage":1}}

Here is what shows up in the logs:

180>Jul 10 20:36:48 node-24 nova-nova.compute.manager AUDIT: Rebuilding instance
<179>Jul 10 20:36:48 node-24 nova-nova.compute.manager ERROR: Setting instance vm_state to ERROR
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5554, in _error_out_instance_on_exception
    yield
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2434, in rebuild_instance
    _("Invalid state of instance files on shared"
InvalidSharedStorage: Invalid state of instance files on shared storage
<179>Jul 10 20:36:49 node-24 nova-oslo.messaging.rpc.dispatcher ERROR: Exception during message handling: Invalid state of instance files on shared storage
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 133, in _dispatch_and_reply
    incoming.message))
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 176, in _dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 122, in _do_dispatch
    result = getattr(endpoint, method)(ctxt, **new_args)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 393, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/server.py", line 139, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 88, in wrapped
    payload)
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 68, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 71, in wrapped
    return f(self, context, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 274, in decorated_function
    pass
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 68, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 260, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 327, in decorated_function
    function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 303, in decorated_function
    e, sys.exc_info())
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 68, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 290, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2434, in rebuild_instance
    _("Invalid state of instance files on shared"
InvalidSharedStorage: Invalid state of instance files on shared storage

Revision history for this message
Tyler Wilson (loth) wrote :

Was able to complete a workaround by

1. Edit nova.instances and replace all references of old node to destination node
2. reset-status of instance to active
3. Issue a hard-reboot to the instance

This will re-create the xml and console log on the destination node and boot the instance using the existing Ceph RBD

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Feilong Wang (flwang)
Changed in nova:
assignee: nobody → Fei Long Wang (flwang)
Revision history for this message
Feilong Wang (flwang) wrote :

hifieli and Tyler, I doubt it's a configuration issue, can you add the nova instance patch to CephFS and try again, you can follow below document. Cheers.

http://www.ibm.com/developerworks/cloud/library/cl-openstackceph/

Revision history for this message
Feilong Wang (flwang) wrote :

Meanwhile, I will investigate if we can improve the check to cover the case without CephFS.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121745

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Looks like a duplicate of bug #1249319.

Dan Smith (danms)
Changed in nova:
importance: Low → Medium
Matt Riedemann (mriedem)
tags: added: juno-backport-potential
Ante Karamatić (ivoks)
tags: added: cts
Revision history for this message
Nobuyoshi NIHONGI (nihongi) wrote :

I confirmed that the patch also fixes bug #1372472.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/131613

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/121745
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=91d3272b975572d9866b7d959547e438142dc4fb
Submitter: Jenkins
Branch: master

commit 91d3272b975572d9866b7d959547e438142dc4fb
Author: Fei Long Wang <email address hidden>
Date: Tue Sep 16 15:43:37 2014 +1200

    Fix nova evacuate issues for RBD

    For RBD scenario, there are some issues in Nova code
    now against evacuate function:

    1. Based on current implementation, nova evacuate and
    nova rebuild are sharing some code. When user enables
    the on_shared_storage option for nova evacuate, nova
    will check if the instance path is accessible. For
    the RBD scenario, the volume(block) is shared between
    different hosts, though the path isn't shared at the
    filesystem level. This patch fixes this issue and adds
    test cases for that.

    2. Missing the 'recreate' parameter for rebuild method.
    Though the libvirt driver doesn't implement rebuild
    method(only Ironic driver implements it), but we really
    need to set 'recreate' in kwargs so it gets passed to
    _rebuild_default_impl so we don't call driver.destroy
    on evacuate for shared filesystem/block storage cases.
    It is fixed in this patch and test case is added as well.

    Closes-Bug: 1249319
    Closes-Bug: 1340411

    Change-Id: Idc8c45b055e986cf85730235d5d25777632ad1c1

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/131629

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/131613
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7920cfdab2fb10e01544eeb713a1e3bc79bc4996
Submitter: Jenkins
Branch: stable/juno

commit 7920cfdab2fb10e01544eeb713a1e3bc79bc4996
Author: Fei Long Wang <email address hidden>
Date: Tue Sep 16 15:43:37 2014 +1200

    Fix nova evacuate issues for RBD

    For RBD scenario, there are some issues in Nova code
    now against evacuate function:

    1. Based on current implementation, nova evacuate and
    nova rebuild are sharing some code. When user enables
    the on_shared_storage option for nova evacuate, nova
    will check if the instance path is accessible. For
    the RBD scenario, the volume(block) is shared between
    different hosts, though the path isn't shared at the
    filesystem level. This patch fixes this issue and adds
    test cases for that.

    2. Missing the 'recreate' parameter for rebuild method.
    Though the libvirt driver doesn't implement rebuild
    method(only Ironic driver implements it), but we really
    need to set 'recreate' in kwargs so it gets passed to
    _rebuild_default_impl so we don't call driver.destroy
    on evacuate for shared filesystem/block storage cases.
    It is fixed in this patch and test case is added as well.

    Closes-Bug: 1249319
    Closes-Bug: 1340411

    Change-Id: Idc8c45b055e986cf85730235d5d25777632ad1c1
    (cherry picked from commit 91d3272b975572d9866b7d959547e438142dc4fb)

tags: added: in-stable-juno
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/131629
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3de3f1066fa47312b8c3075abf790631034d67a3
Submitter: Jenkins
Branch: stable/icehouse

commit 3de3f1066fa47312b8c3075abf790631034d67a3
Author: Fei Long Wang <email address hidden>
Date: Tue Sep 16 15:43:37 2014 +1200

    Fix nova evacuate issues for RBD

    For RBD scenario, there are some issues in Nova code
    now against evacuate function:

    1. Based on current implementation, nova evacuate and
    nova rebuild are sharing some code. When user enables
    the on_shared_storage option for nova evacuate, nova
    will check if the instance path is accessible. For
    the RBD scenario, the volume(block) is shared between
    different hosts, though the path isn't shared at the
    filesystem level. This patch fixes this issue and adds
    test cases for that.

    2. Missing the 'recreate' parameter for rebuild method.
    Though the libvirt driver doesn't implement rebuild
    method(only Ironic driver implements it), but we really
    need to set 'recreate' in kwargs so it gets passed to
    _rebuild_default_impl so we don't call driver.destroy
    on evacuate for shared filesystem/block storage cases.
    It is fixed in this patch and test case is added as well.

    Closes-Bug: 1249319
    Closes-Bug: 1340411

    Conflicts:
            nova/tests/compute/test_compute_mgr.py
            nova/tests/virt/libvirt/test_libvirt.py
            nova/virt/libvirt/driver.py

    Change-Id: Idc8c45b055e986cf85730235d5d25777632ad1c1
    (cherry picked from commit 91d3272b975572d9866b7d959547e438142dc4fb)
    (cherry picked from commit 7920cfdab2fb10e01544eeb713a1e3bc79bc4996)

tags: added: in-stable-icehouse
Yaguang Tang (heut2008)
tags: removed: juno-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.