no unquiesce for volume backed on quiesce failure

Bug #1754360 reported by Eric M Gonzalez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Eric M Gonzalez
Ocata
Fix Committed
Medium
Matt Riedemann
Pike
Fix Committed
Medium
Matt Riedemann
Queens
Fix Committed
Medium
Matt Riedemann

Bug Description

Extension of bug #1731986;

The above bug and fix catches errors that occur during the snapshot of an instance's volumes. I later discovered that a failure can occur during the call to quisce_instance() that raises an uncaught Exceptions through snapshot_volume_backed() that can leave the instance frozen / quiesced.

Replication is tricky; my failures result during the RPC call to the compute host and a MessagingTimeout waiting for a reply. I have not found a way to handily replicate this. My compute combination is: Nova Mitaka, Libvirt-1.3.1, & Ceph Jewel

Similar to the above bug, this condition was discovered in Mitaka and the issue remains in Queens.

My proposed patch adds a blanket Exception catch around the call to rpcapi.quiesce_instance(), logs the caught exception, and issues an immediate rpcapi.unquiesce_instance() in order to thaw the instance.

Stack trace from nova-api-os container, responsible for quiesce / unquiesce of instance during snapshot:

[req-6229d689-dcc3-41ca-99b5-3dfc04e1e994 50505ffa89754660b4e6f7ebf69532b5 24bfcdab70714b85b5cb9f5f8270a414 - - -] Unexpected exception in API method
Traceback (most recent call last):
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/extensions.py", line 478, in wrapped
    return f(*args, **kwargs)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/common.py", line 391, in inner
    return f(*args, **kwargs)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 73, in wrapper
    return func(*args, **kwargs)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 73, in wrapper
    return func(*args, **kwargs)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 1108, in _action_create_image
    metadata)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 140, in inner
    return f(self, context, instance, *args, **kw)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 2389, in snapshot_volume_backed
    mapping=None)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 2368, in snapshot_volume_backed
    self.compute_rpcapi.quiesce_instance(context, instance)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 1041, in quiesce_instance
    return cctxt.call(ctxt, 'quiesce_instance', instance=instance)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
    retry=self.retry)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
    timeout=timeout, retry=retry)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
    retry=retry)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
    result = self._waiter.wait(msg_id, timeout)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
    'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID 70ee5f80284b4b68a289bf232b89325c

Revision history for this message
Eric M Gonzalez (egrh3) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/550865

Changed in nova:
assignee: nobody → Eric M Gonzalez (egrh3)
status: New → In Progress
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Changed in nova:
assignee: Eric M Gonzalez (egrh3) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Eric M Gonzalez (egrh3)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/581451

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/581454

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/550865
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1e77faaa412ab9909dd9491cab4a819b5c84d3e8
Submitter: Zuul
Branch: master

commit 1e77faaa412ab9909dd9491cab4a819b5c84d3e8
Author: Eric M Gonzalez <email address hidden>
Date: Thu Mar 8 09:11:25 2018 -0600

    unquiesce instance after quiesce failure

    If the call to compute_rpcapi.quisece_instance() raises an exception,
    any uncaught exception will break out of the function
    snapshot_volume_backed(). This can leave the instance in frozen state.

    This patch adds a blanket Exception catch to the try block and calls
    compute_rpcapi.unquiesce_instance() before reraising.

    This has been seen in the wild with RPC timeouts, but this is not the
    only possible genesis for an unknown error from quiesce_instance.

    Change-Id: Idca5998da8bb42b29a8fffdf52b4af3a043c6326
    Closes-Bug: #1754360

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b3

This issue was fixed in the openstack/nova 18.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/581451
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bcae081c4672a13eb1a75d1d3d81a439b5fdb796
Submitter: Zuul
Branch: stable/queens

commit bcae081c4672a13eb1a75d1d3d81a439b5fdb796
Author: Eric M Gonzalez <email address hidden>
Date: Thu Mar 8 09:11:25 2018 -0600

    unquiesce instance after quiesce failure

    If the call to compute_rpcapi.quisece_instance() raises an exception,
    any uncaught exception will break out of the function
    snapshot_volume_backed(). This can leave the instance in frozen state.

    This patch adds a blanket Exception catch to the try block and calls
    compute_rpcapi.unquiesce_instance() before reraising.

    This has been seen in the wild with RPC timeouts, but this is not the
    only possible genesis for an unknown error from quiesce_instance.

    Change-Id: Idca5998da8bb42b29a8fffdf52b4af3a043c6326
    Closes-Bug: #1754360
    (cherry picked from commit 1e77faaa412ab9909dd9491cab4a819b5c84d3e8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.6

This issue was fixed in the openstack/nova 17.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/605884

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/581454
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1487ea7abb974650d5658b29e52e0b0603751556
Submitter: Zuul
Branch: stable/pike

commit 1487ea7abb974650d5658b29e52e0b0603751556
Author: Eric M Gonzalez <email address hidden>
Date: Thu Mar 8 09:11:25 2018 -0600

    unquiesce instance after quiesce failure

    If the call to compute_rpcapi.quisece_instance() raises an exception,
    any uncaught exception will break out of the function
    snapshot_volume_backed(). This can leave the instance in frozen state.

    This patch adds a blanket Exception catch to the try block and calls
    compute_rpcapi.unquiesce_instance() before reraising.

    This has been seen in the wild with RPC timeouts, but this is not the
    only possible genesis for an unknown error from quiesce_instance.

    Conflicts:
          nova/tests/unit/compute/test_compute_api.py

    NOTE(mriedem): The conflict is due to not having change
    I4e7b46deb43c0c2430b480f1a498a52fc4a9daf0, and its dependencies,
    in Pike.

    Change-Id: Idca5998da8bb42b29a8fffdf52b4af3a043c6326
    Closes-Bug: #1754360
    (cherry picked from commit 1e77faaa412ab9909dd9491cab4a819b5c84d3e8)
    (cherry picked from commit bcae081c4672a13eb1a75d1d3d81a439b5fdb796)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/605884
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1535d42ea14503db63d34ac47dc8ae231e0edf59
Submitter: Zuul
Branch: stable/ocata

commit 1535d42ea14503db63d34ac47dc8ae231e0edf59
Author: Eric M Gonzalez <email address hidden>
Date: Thu Mar 8 09:11:25 2018 -0600

    unquiesce instance after quiesce failure

    If the call to compute_rpcapi.quisece_instance() raises an exception,
    any uncaught exception will break out of the function
    snapshot_volume_backed(). This can leave the instance in frozen state.

    This patch adds a blanket Exception catch to the try block and calls
    compute_rpcapi.unquiesce_instance() before reraising.

    This has been seen in the wild with RPC timeouts, but this is not the
    only possible genesis for an unknown error from quiesce_instance.

    Conflicts:
          nova/tests/unit/compute/test_compute_api.py

    NOTE(mriedem): The conflict is due to not having change
    I4e7b46deb43c0c2430b480f1a498a52fc4a9daf0, and its dependencies,
    in Pike.

    Change-Id: Idca5998da8bb42b29a8fffdf52b4af3a043c6326
    Closes-Bug: #1754360
    (cherry picked from commit 1e77faaa412ab9909dd9491cab4a819b5c84d3e8)
    (cherry picked from commit bcae081c4672a13eb1a75d1d3d81a439b5fdb796)
    (cherry picked from commit 1487ea7abb974650d5658b29e52e0b0603751556)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.6

This issue was fixed in the openstack/nova 16.1.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.5

This issue was fixed in the openstack/nova 15.1.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.