BDMNotFound raised and stale block devices left over when simultaneously reboot and deleting an instance

Bug #1838392 reported by Lee Yarwood on 2019-07-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Lee Yarwood
Queens
Undecided
Lee Yarwood
Rocky
Undecided
Lee Yarwood
Stein
Undecided
Lee Yarwood
Train
Undecided
Lee Yarwood

Bug Description

Description
===========
Simultaneous requests to reboot and delete an instance _will_ race as only the call to delete takes a lock against the instance.uuid.

One possible outcome of this seen in the wild with the Libvirt driver is that the request to soft reboot will eventually turn into a hard reboot, reconnecting volumes that the delete request has already disconnected. These volumes will eventually be unmapped on the Cinder side by the delete request leaving stale devices on the host. Additionally BDMNotFound is raised by the reboot operation as the delete operation has already deleted the BDMs.

Steps to reproduce
==================
$ nova reboot $instance && nova delete $instance

Expected result
===============
The instance reboots and is then deleted without any errors raised.

Actual result
=============
BDMNotFound raised and stale block devices left over.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

1599e3cf68779eafaaa2b13a273d3bebd1379c19 / 19.0.0.0rc1-992-g1599e3cf68

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + QEMU/kvm

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress

Reviewed: https://review.opendev.org/673463
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ad54f3dacbd372271f441baea5380f913072dde
Submitter: Zuul
Branch: master

commit 9ad54f3dacbd372271f441baea5380f913072dde
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 29 16:25:45 2019 +0100

    compute: Take an instance.uuid lock when rebooting

    Previously simultaneous requests to reboot and delete an instance could
    race as only the latter took a lock against the uuid of the instance.

    With the Libvirt driver this race could potentially result in attempts
    being made to reconnect previously disconnected volumes on the host.
    Depending on the volume backend being used this could then result in
    stale block devices point to unmapped volumes being left on the host
    that in turn could cause failures later on when connecting newly mapped
    volumes.

    This change avoids this race by ensuring any request to reboot an
    instance takes an instance.uuid lock within the compute manager,
    serialising requests to reboot and then delete the instance.

    Closes-Bug: #1838392
    Change-Id: Ieb59de10c63bb067f92ec054535766cdd722dae2

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/696151
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=939cd9b177db8f12952e72145a5c00a0574959eb
Submitter: Zuul
Branch: stable/train

commit 939cd9b177db8f12952e72145a5c00a0574959eb
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 29 16:25:45 2019 +0100

    compute: Take an instance.uuid lock when rebooting

    Previously simultaneous requests to reboot and delete an instance could
    race as only the latter took a lock against the uuid of the instance.

    With the Libvirt driver this race could potentially result in attempts
    being made to reconnect previously disconnected volumes on the host.
    Depending on the volume backend being used this could then result in
    stale block devices point to unmapped volumes being left on the host
    that in turn could cause failures later on when connecting newly mapped
    volumes.

    This change avoids this race by ensuring any request to reboot an
    instance takes an instance.uuid lock within the compute manager,
    serialising requests to reboot and then delete the instance.

    Closes-Bug: #1838392
    Change-Id: Ieb59de10c63bb067f92ec054535766cdd722dae2
    (cherry picked from commit 9ad54f3dacbd372271f441baea5380f913072dde)

Reviewed: https://review.opendev.org/696152
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=304d3f62a4e3bdbaab6fe7dd665174bc5b696d08
Submitter: Zuul
Branch: stable/stein

commit 304d3f62a4e3bdbaab6fe7dd665174bc5b696d08
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 29 16:25:45 2019 +0100

    compute: Take an instance.uuid lock when rebooting

    Previously simultaneous requests to reboot and delete an instance could
    race as only the latter took a lock against the uuid of the instance.

    With the Libvirt driver this race could potentially result in attempts
    being made to reconnect previously disconnected volumes on the host.
    Depending on the volume backend being used this could then result in
    stale block devices point to unmapped volumes being left on the host
    that in turn could cause failures later on when connecting newly mapped
    volumes.

    This change avoids this race by ensuring any request to reboot an
    instance takes an instance.uuid lock within the compute manager,
    serialising requests to reboot and then delete the instance.

    Closes-Bug: #1838392
    Change-Id: Ieb59de10c63bb067f92ec054535766cdd722dae2
    (cherry picked from commit 9ad54f3dacbd372271f441baea5380f913072dde)
    (cherry picked from commit 939cd9b177db8f12952e72145a5c00a0574959eb)

Reviewed: https://review.opendev.org/696153
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d14b6a5170821c65e55d7b39ccf4419a81640f8
Submitter: Zuul
Branch: stable/rocky

commit 7d14b6a5170821c65e55d7b39ccf4419a81640f8
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 29 16:25:45 2019 +0100

    compute: Take an instance.uuid lock when rebooting

    Previously simultaneous requests to reboot and delete an instance could
    race as only the latter took a lock against the uuid of the instance.

    With the Libvirt driver this race could potentially result in attempts
    being made to reconnect previously disconnected volumes on the host.
    Depending on the volume backend being used this could then result in
    stale block devices point to unmapped volumes being left on the host
    that in turn could cause failures later on when connecting newly mapped
    volumes.

    This change avoids this race by ensuring any request to reboot an
    instance takes an instance.uuid lock within the compute manager,
    serialising requests to reboot and then delete the instance.

    Closes-Bug: #1838392
    Change-Id: Ieb59de10c63bb067f92ec054535766cdd722dae2
    (cherry picked from commit 9ad54f3dacbd372271f441baea5380f913072dde)
    (cherry picked from commit 939cd9b177db8f12952e72145a5c00a0574959eb)
    (cherry picked from commit 304d3f62a4e3bdbaab6fe7dd665174bc5b696d08)

Reviewed: https://review.opendev.org/696154
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=16fb8ac3f4c2fe94ae83d65fbcf6f49665a0dd60
Submitter: Zuul
Branch: stable/queens

commit 16fb8ac3f4c2fe94ae83d65fbcf6f49665a0dd60
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 29 16:25:45 2019 +0100

    compute: Take an instance.uuid lock when rebooting

    Previously simultaneous requests to reboot and delete an instance could
    race as only the latter took a lock against the uuid of the instance.

    With the Libvirt driver this race could potentially result in attempts
    being made to reconnect previously disconnected volumes on the host.
    Depending on the volume backend being used this could then result in
    stale block devices point to unmapped volumes being left on the host
    that in turn could cause failures later on when connecting newly mapped
    volumes.

    This change avoids this race by ensuring any request to reboot an
    instance takes an instance.uuid lock within the compute manager,
    serialising requests to reboot and then delete the instance.

    Closes-Bug: #1838392
    Change-Id: Ieb59de10c63bb067f92ec054535766cdd722dae2
    (cherry picked from commit 9ad54f3dacbd372271f441baea5380f913072dde)
    (cherry picked from commit 939cd9b177db8f12952e72145a5c00a0574959eb)
    (cherry picked from commit 304d3f62a4e3bdbaab6fe7dd665174bc5b696d08)
    (cherry picked from commit 7d14b6a5170821c65e55d7b39ccf4419a81640f8)

This issue was fixed in the openstack/nova 20.1.0 release.

This issue was fixed in the openstack/nova 19.1.0 release.

This issue was fixed in the openstack/nova 18.3.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers