Rescheduled boot from volume instances fail due to the premature removal of their attachments

Bug #1784353 reported by Lee Yarwood on 2018-07-30
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Lee Yarwood
Queens
Medium
Matt Riedemann
Rocky
Medium
Matt Riedemann

Bug Description

Description
===========
This is caused by the cleanup code within the compute layer (_shutdown_instance) removing all volume attachments associated with an instance with no attempt being made to recreate these ahead of the instance being rescheduled.

Steps to reproduce
==================
- Attempt to boot an instance with volumes attached.
- Ensure spawn() fails, for example by stopping the l2 network agent services on the compute host.

Expected result
===============
The instance is reschedule to another compute host and boots correctly.

Actual result
=============
The instance fails to boot on all hosts that is rescheduled to due to a missing volume attachment.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   bf497cc47497d3a5603bf60de652054ac5ae1993

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM, however this shouldn't matter.

3. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

4. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] Traceback (most recent call last):
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1579, in _prep_block_device
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] wait_func=self._await_block_device_map_created)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 837, in attach_block_devices
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] _log_and_attach(device)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 834, in _log_and_attach
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] bdm.attach(*attach_args, **attach_kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 46, in wrapped
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] ret_val = method(obj, context, *args, **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 617, in attach
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] virt_driver, do_driver_attach)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] return f(*args, **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 614, in _do_locked_attach
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] self._do_attach(*args, **_kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 599, in _do_attach
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] do_driver_attach)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 513, in _volume_attach
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] self['mount_device'])['connection_info']
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 379, in wrapper
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] res = method(self, ctx, *args, **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 418, in wrapper
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] attachment_id=attachment_id))
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 450, in _reraise
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] six.reraise(type(desired_exc), desired_exc, sys.exc_info()[2])
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 415, in wrapper
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] res = method(self, ctx, attachment_id, *args, **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 824, in attachment_update
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] 'code': getattr(ex, 'code', None)})
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] self.force_reraise()
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] six.reraise(self.type_, self.value, self.tb)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 814, in attachment_update
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] attachment_id, _connector)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/cinderclient/v3/attachments.py", line 67, in update
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] resp = self._update('/attachments/%s' % id, body)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/cinderclient/base.py", line 344, in _update
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] resp, body = self.api.client.put(url, body=body, **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/cinderclient/client.py", line 206, in put
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] return self._cs_request(url, 'PUT', **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/cinderclient/client.py", line 191, in _cs_request
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] return self.request(url, method, **kwargs)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] File "/usr/lib/python2.7/site-packages/cinderclient/client.py", line 177, in request
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] raise exceptions.from_response(resp, body)
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1] VolumeAttachmentNotFound: Volume attachment 11 [details]d518a9-16d4-4ccb-9487-ec2b35834945 could not be found.
    2018-07-04 15:19:43.191 1 ERROR nova.compute.manager [instance: d48c9894-2ba2-4752-bae5-36c437933ff1]

Related fix proposed to branch: master
Review: https://review.openstack.org/587014

Fix proposed to branch: master
Review: https://review.openstack.org/587071

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress
melanie witt (melwitt) on 2018-08-01
tags: added: compute volumes
Changed in nova:
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

I'm not sure how you're hitting a reschedule because any failures coming out of attach_block_devices should result in the build getting aborted:

https://github.com/openstack/nova/blob/7125dcb9cb821faf3c68526ac34365a28141e480/nova/compute/manager.py#L1682

https://github.com/openstack/nova/blob/7125dcb9cb821faf3c68526ac34365a28141e480/nova/compute/manager.py#L2328

Matt Riedemann (mriedem) wrote :

Bug 1488111 is what I was thinking about, but Lee clarified the issue for me. The scenario is like:

1. spawn on host1 fails, reschedule to host2
2. prep_block_devices fails on host2 because of the volume attachment issue mentioned

Changed in nova:
assignee: Lee Yarwood (lyarwood) → Stephen Finucane (stephenfinucane)
Matt Riedemann (mriedem) on 2018-10-22
Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2018-10-22
Changed in nova:
assignee: Matt Riedemann (mriedem) → Lee Yarwood (lyarwood)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/612486

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/612495

Reviewed: https://review.openstack.org/587014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a8629e5800ed95c0e6bac04ca711e835606370bf
Submitter: Zuul
Branch: master

commit a8629e5800ed95c0e6bac04ca711e835606370bf
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 11:19:15 2018 +0100

    Add regression test for bug#1784353

    Related-Bug: #1784353
    Change-Id: I46511b5f3e3c850ba61add41e8ca053897618be6

Reviewed: https://review.openstack.org/587071
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=41452a5c6adb8cae54eef24803f4adc468131b34
Submitter: Zuul
Branch: master

commit 41452a5c6adb8cae54eef24803f4adc468131b34
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 13:41:35 2018 +0100

    conductor: Recreate volume attachments during a reschedule

    When an instance with attached volumes fails to spawn, cleanup code
    within the compute manager (_shutdown_instance called from
    _build_resources) will delete the volume attachments referenced by
    the bdms in Cinder. As a result we should check and if necessary
    recreate these volume attachments when rescheduling an instance.

    Note that there are a few different ways to fix this bug by
    making changes to the compute manager code, either by not deleting
    the volume attachment on failure before rescheduling [1] or by
    performing the get/create check during each build after the
    reschedule [2].

    The problem with *not* cleaning up the attachments is if we don't
    reschedule, then we've left orphaned "reserved" volumes in Cinder
    (or we have to add special logic to tell compute when to cleanup
    attachments).

    The problem with checking the existence of the attachment on every
    new host we build on is that we'd be needlessly checking that for
    initial creates even if we don't ever need to reschedule, unless
    again we have special logic against that (like checking to see if
    we've rescheduled at all).

    Also, in either case that involves changes to the compute means that
    older computes might not have the fix.

    So ultimately it seems that the best way to handle this is:

    1. Only deal with this on reschedules.
    2. Let the cell conductor orchestrate it since it's already dealing
       with the reschedule. Then the compute logic doesn't need to change.

    [1] https://review.openstack.org/#/c/587071/3/nova/compute/manager.py@1631
    [2] https://review.openstack.org/#/c/587071/4/nova/compute/manager.py@1667

    Change-Id: I739c06bd02336bf720cddacb21f48e7857378487
    Closes-bug: #1784353

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/612485
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3452b603a109d05c091246e4ff37e8f1997c0d50
Submitter: Zuul
Branch: stable/rocky

commit 3452b603a109d05c091246e4ff37e8f1997c0d50
Author: Lee Yarwood <email address hidden>
Date: Tue Jul 31 11:09:45 2018 +0100

    fixtures: Track volume attachments within CinderFixtureNewAttachFlow

    Previously volume attachments ids were not tracked at all within the
    fixture with only the instance_uuid and volume_id stashed. This change
    should allow future functional tests to exercise bugs where attachments
    are deleted but not recreated such as bug #1784353.

    Change-Id: Ib30144596fe6a8d8ffbb4ebd695ebcf38ef828a4
    Co-authored-by: Matthew Booth <email address hidden>
    Related-Bug: #1784353
    (cherry picked from commit bfdc6a0d5293e29d91d4755370c08769394f3ba7)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/612486
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=96889ad06d2e4841527e5b9d7c1b4ca71a7ae189
Submitter: Zuul
Branch: stable/rocky

commit 96889ad06d2e4841527e5b9d7c1b4ca71a7ae189
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 11:19:15 2018 +0100

    Add regression test for bug#1784353

    Related-Bug: #1784353
    Change-Id: I46511b5f3e3c850ba61add41e8ca053897618be6
    (cherry picked from commit a8629e5800ed95c0e6bac04ca711e835606370bf)

Reviewed: https://review.openstack.org/612487
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2a741be8f53e4e6b3a5be6cdc2fe32f050c870a9
Submitter: Zuul
Branch: stable/rocky

commit 2a741be8f53e4e6b3a5be6cdc2fe32f050c870a9
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 13:41:35 2018 +0100

    conductor: Recreate volume attachments during a reschedule

    When an instance with attached volumes fails to spawn, cleanup code
    within the compute manager (_shutdown_instance called from
    _build_resources) will delete the volume attachments referenced by
    the bdms in Cinder. As a result we should check and if necessary
    recreate these volume attachments when rescheduling an instance.

    Note that there are a few different ways to fix this bug by
    making changes to the compute manager code, either by not deleting
    the volume attachment on failure before rescheduling [1] or by
    performing the get/create check during each build after the
    reschedule [2].

    The problem with *not* cleaning up the attachments is if we don't
    reschedule, then we've left orphaned "reserved" volumes in Cinder
    (or we have to add special logic to tell compute when to cleanup
    attachments).

    The problem with checking the existence of the attachment on every
    new host we build on is that we'd be needlessly checking that for
    initial creates even if we don't ever need to reschedule, unless
    again we have special logic against that (like checking to see if
    we've rescheduled at all).

    Also, in either case that involves changes to the compute means that
    older computes might not have the fix.

    So ultimately it seems that the best way to handle this is:

    1. Only deal with this on reschedules.
    2. Let the cell conductor orchestrate it since it's already dealing
       with the reschedule. Then the compute logic doesn't need to change.

    [1] https://review.openstack.org/#/c/587071/3/nova/compute/manager.py@1631
    [2] https://review.openstack.org/#/c/587071/4/nova/compute/manager.py@1667

    Change-Id: I739c06bd02336bf720cddacb21f48e7857378487
    Closes-bug: #1784353
    (cherry picked from commit 41452a5c6adb8cae54eef24803f4adc468131b34)

This issue was fixed in the openstack/nova 18.0.3 release.

Reviewed: https://review.openstack.org/612494
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6fa5aee86b7a8a94bf4644c1aff3ac0c70079811
Submitter: Zuul
Branch: stable/queens

commit 6fa5aee86b7a8a94bf4644c1aff3ac0c70079811
Author: Lee Yarwood <email address hidden>
Date: Tue Jul 31 11:09:45 2018 +0100

    fixtures: Track volume attachments within CinderFixtureNewAttachFlow

    Previously volume attachments ids were not tracked at all within the
    fixture with only the instance_uuid and volume_id stashed. This change
    should allow future functional tests to exercise bugs where attachments
    are deleted but not recreated such as bug #1784353.

    Change-Id: Ib30144596fe6a8d8ffbb4ebd695ebcf38ef828a4
    Co-authored-by: Matthew Booth <email address hidden>
    Related-Bug: #1784353
    (cherry picked from commit bfdc6a0d5293e29d91d4755370c08769394f3ba7)
    (cherry picked from commit cb9996b45f4c04e8b064448f31442ecb7aea10bc)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/612495
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=51345d57a1b726dad08445a5c5a47c4e640c2a43
Submitter: Zuul
Branch: stable/queens

commit 51345d57a1b726dad08445a5c5a47c4e640c2a43
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 11:19:15 2018 +0100

    Add regression test for bug#1784353

    Related-Bug: #1784353
    Change-Id: I46511b5f3e3c850ba61add41e8ca053897618be6
    (cherry picked from commit a8629e5800ed95c0e6bac04ca711e835606370bf)
    (cherry picked from commit c013a276fffa612bf8742de502f698279f7ec54b)

Reviewed: https://review.openstack.org/612496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a0f26c822912dbaad7c8a1d6a3c599025afab37
Submitter: Zuul
Branch: stable/queens

commit 3a0f26c822912dbaad7c8a1d6a3c599025afab37
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 13:41:35 2018 +0100

    conductor: Recreate volume attachments during a reschedule

    When an instance with attached volumes fails to spawn, cleanup code
    within the compute manager (_shutdown_instance called from
    _build_resources) will delete the volume attachments referenced by
    the bdms in Cinder. As a result we should check and if necessary
    recreate these volume attachments when rescheduling an instance.

    Note that there are a few different ways to fix this bug by
    making changes to the compute manager code, either by not deleting
    the volume attachment on failure before rescheduling [1] or by
    performing the get/create check during each build after the
    reschedule [2].

    The problem with *not* cleaning up the attachments is if we don't
    reschedule, then we've left orphaned "reserved" volumes in Cinder
    (or we have to add special logic to tell compute when to cleanup
    attachments).

    The problem with checking the existence of the attachment on every
    new host we build on is that we'd be needlessly checking that for
    initial creates even if we don't ever need to reschedule, unless
    again we have special logic against that (like checking to see if
    we've rescheduled at all).

    Also, in either case that involves changes to the compute means that
    older computes might not have the fix.

    So ultimately it seems that the best way to handle this is:

    1. Only deal with this on reschedules.
    2. Let the cell conductor orchestrate it since it's already dealing
       with the reschedule. Then the compute logic doesn't need to change.

    [1] https://review.openstack.org/#/c/587071/3/nova/compute/manager.py@1631
    [2] https://review.openstack.org/#/c/587071/4/nova/compute/manager.py@1667

    Conflicts:

      nova/tests/unit/conductor/test_conductor.py

    NOTE(mriedem): There was a minor conflict due to not having change
    I56fb1fd984f06a58c3a7e8c2596471991950433a in Queens.

    Change-Id: I739c06bd02336bf720cddacb21f48e7857378487
    Closes-bug: #1784353
    (cherry picked from commit 41452a5c6adb8cae54eef24803f4adc468131b34)
    (cherry picked from commit d3397788fe2d9267c34698d9459b0abe3f215046)

This issue was fixed in the openstack/nova 17.0.9 release.

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers