unshelving an instance doesn't rollback volumes connections on failure

Bug #1627694 reported by Shoham Peller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Ocata
Fix Committed
Medium
Matt Riedemann
Pike
Fix Committed
Medium
Matt Riedemann
Queens
Fix Committed
Medium
Lee Yarwood

Bug Description

When creating and instance and it is being spawned for the first time fails, the instance's volumes' initialize_connection is being rolled-back, but when unshelving an shelved-offloaded instance, there's no rollback in case of failure.

The reason is that when spawning an instance for the first time, nova-compute calls initialize_connection using the _build_resources() method:
https://github.com/openstack/nova/blob/93e689516da0302b06c2760bb82c5004ae057913/nova/compute/manager.py#L1902
That context-aware method will also take care of rollback in case of a failure in spawning, and will terminate_connection of the volumes:
https://github.com/openstack/nova/blob/93e689516da0302b06c2760bb82c5004ae057913/nova/compute/manager.py#L2095

But, when unshelving an instance, initialize_connection is not called in a context aware method, and no rollback in happening when spawning fails.
https://github.com/openstack/nova/blob/93e689516da0302b06c2760bb82c5004ae057913/nova/compute/manager.py#L4330

This makes the volumes stay connected to the node even-though the instance is shelved-offloaded.
If you want to see this problem, replace the driver.spawn() call with a "raise Exception", and see the volumes have connection to the node even though they shouldn't have.

I'm using openstack Liberty, but I can see this problem in current master (pre-Newton)

Matt Riedemann (mriedem)
tags: added: compute shelve unshelve
Changed in nova:
status: New → Confirmed
summary: - unshelving an instance doesn't rollback volumes connections
+ unshelving an instance doesn't rollback volumes connections on failure
Revision history for this message
Matt Riedemann (mriedem) wrote :

Sounds like _unshelve_instance in the compute manager just needs to call _shutdown_instance in case of a failure from the virt driver.spawn() method, like in _build_resources.

Changed in nova:
importance: Undecided → Low
tags: added: low-hanging-fruit
Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually this is probably more complicated since it has to account for rescheduling or not, so it might require handling teardown like some of the rescheduling code in the _do_build_and_run_instance method in the compute manager.

tags: removed: low-hanging-fruit
Revision history for this message
Matt Riedemann (mriedem) wrote :

Ignore comment 2, unshelve never reschedules since the compute manager code would have to handle casting back to conductor to reschedule, like here for build_instances:

https://github.com/openstack/nova/blob/93e689516da0302b06c2760bb82c5004ae057913/nova/compute/manager.py#L1822-L1830

Revision history for this message
Matt Riedemann (mriedem) wrote :

So I'm actually not sure about this bug. When shelve-offloading an instance, the instance is destroyed from the hypervisor but the network and volumes are left intact for the instance in the database. When we unshelve the instance, network and volume information is still available in neutron and cinder, respectively. So if the unshelve fails, we don't actually want to delete the network and volume information, because then you can't try to unshelve them again later (on a different host for example). So I think this is actually working as designed.

Changed in nova:
status: Confirmed → Invalid
no longer affects: nova/newton
no longer affects: nova/mitaka
Revision history for this message
Shoham Peller (shoham-peller) wrote :

@mriedem Thank you for your response.
We don't have to detach the volume from the server, or deleting anything from the DB. The problem is simply that the volumes are stayed connected to the node, because initialize_connection was called, but when spawning failed, and the VM is not on the node, the volumes stay connected to the node.
If terminate_connection is called on rollback, the user can safely retry the unshelving, and the volumes won't be connected to more than 1 node. Nothing will be changed in the DB regarding the VM's attached volumes or networks.

Changed in nova:
status: Invalid → Won't Fix
status: Won't Fix → Confirmed
Revision history for this message
Shoham Peller (shoham-peller) wrote :

What if, in case of a spawning failure, instead of calling _shutdown_instance(), that deallocates networks, and detaches volumes, we'll just call _terminate_connection, to simply disconnect the volumes from the node?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/378009

Changed in nova:
assignee: nobody → Shoham Peller (shoham-peller)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/378009
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the review queue sane. Please feel free to restore the change if you're still working on it.

Changed in nova:
assignee: Shoham Peller (shoham-peller) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Shoham Peller (shoham-peller)
importance: Low → Medium
Changed in nova:
assignee: Shoham Peller (shoham-peller) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/378009
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dcdd2c9832c7c60fe9163cd744ca2b5acfe16bcc
Submitter: Zuul
Branch: master

commit dcdd2c9832c7c60fe9163cd744ca2b5acfe16bcc
Author: Shoham Peller <email address hidden>
Date: Tue Sep 27 20:25:05 2016 +0300

    Handle spawning error on unshelving

    If spawning fails when unshelving, terminate the volumes' connections
    with the node, and remove the node reference from the instance entry.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Closes-Bug: 1627694
    Change-Id: I8cfb2280d956d452ccad1fc711bd814b7258147f

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/548415

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/548526

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/548622

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/548415
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=02aac48b742ac611ee140152ba781ea6ee57f599
Submitter: Zuul
Branch: stable/queens

commit 02aac48b742ac611ee140152ba781ea6ee57f599
Author: Shoham Peller <email address hidden>
Date: Tue Sep 27 20:25:05 2016 +0300

    Handle spawning error on unshelving

    If spawning fails when unshelving, terminate the volumes' connections
    with the node, and remove the node reference from the instance entry.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Closes-Bug: 1627694
    Change-Id: I8cfb2280d956d452ccad1fc711bd814b7258147f
    (cherry picked from commit dcdd2c9832c7c60fe9163cd744ca2b5acfe16bcc)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.1

This issue was fixed in the openstack/nova 17.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/548526
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=06946b7f8ad310e01eff895362ad0f3a164636f5
Submitter: Zuul
Branch: stable/pike

commit 06946b7f8ad310e01eff895362ad0f3a164636f5
Author: Shoham Peller <email address hidden>
Date: Tue Sep 27 20:25:05 2016 +0300

    Handle spawning error on unshelving

    If spawning fails when unshelving, terminate the volumes' connections
    with the node, and remove the node reference from the instance entry.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Conflicts:
          nova/compute/manager.py

    NOTE(mriedem): Conflicts are due to the following changes not in Pike:

      I391554d3904a5a60b921ef4714a1cfd0a64a25c2

      If38e4a6d49910f0aa5016e1bcb61aac2be416fa7

      I00eab47edf1150788777300680e853a872c1db40

    Closes-Bug: 1627694
    Change-Id: I8cfb2280d956d452ccad1fc711bd814b7258147f
    (cherry picked from commit dcdd2c9832c7c60fe9163cd744ca2b5acfe16bcc)
    (cherry picked from commit 73c3e4969c108f2eefc77b57021d9b3e17afdc8e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.1

This issue was fixed in the openstack/nova 16.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/548622
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=01c4230c3b821bf1b5c46c0157974ae84d66d6e2
Submitter: Zuul
Branch: stable/ocata

commit 01c4230c3b821bf1b5c46c0157974ae84d66d6e2
Author: Shoham Peller <email address hidden>
Date: Tue Sep 27 20:25:05 2016 +0300

    Handle spawning error on unshelving

    If spawning fails when unshelving, terminate the volumes' connections
    with the node, and remove the node reference from the instance entry.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Conflicts:
          nova/compute/manager.py
          nova/tests/unit/compute/test_shelve.py

    NOTE(mriedem): Conflicts are due to the following changes not in Ocata:

      I2394b0588bc7210efd16456af2fc12dc7071681c

      Id2c7b7b3b4abda8a3b878fdee6806bcfe096e12e

      I3a3caa4c566ecc132aa2699f8c7e5987bbcc863a

    Closes-Bug: 1627694
    Change-Id: I8cfb2280d956d452ccad1fc711bd814b7258147f
    (cherry picked from commit dcdd2c9832c7c60fe9163cd744ca2b5acfe16bcc)
    (cherry picked from commit 73c3e4969c108f2eefc77b57021d9b3e17afdc8e)
    (cherry picked from commit 06946b7f8ad310e01eff895362ad0f3a164636f5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.1

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.