Instance deletion is prevented when another component locks up

Bug #1248563 reported by Stanislaw Pitucha
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Stanislaw Pitucha

Bug Description

Instance deletion can be delayed (even effectively prevented) if one of the components locks up. This is due to other operations holding the lock on the uuid, which is also needed by terminate_instances. A side effect of that is that the delete messages start queueing up, preventing other operations on the same compute manager.

This issue has been discussed on mailing list, starting with post http://lists.openstack.org/pipermail/openstack-dev/2013-October/017454.html

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

The first point of the solution (using separate locks) is potentially dangerous and should be looked at in context of issues like bug 998117 and possible migration complications.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/55444
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b3c9cc504903eccbc68c441a81b0a727a83117fa
Submitter: Jenkins
Branch: master

commit b3c9cc504903eccbc68c441a81b0a727a83117fa
Author: Stanislaw Pitucha <email address hidden>
Date: Wed Nov 6 16:57:10 2013 +0000

    Ignore duplicate delete requests

    Stop delete requests from queueing up in the compute manager if the
    task_state is already set to DELETING. This prevents exhausting thread
    pools when the service is stuck on a lock/external request.

    Dealing with delete requests that never got executed is not in scope of
    this change and will be submitted separately.

    Change-Id: I2f97f93bd714e0ea3b6d4fa3ac457ab43eed00e1
    Partial-Bug: #1248563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/55660
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0252d0981af9daaf9370a50c28d8baf4a9d29619
Submitter: Jenkins
Branch: master

commit 0252d0981af9daaf9370a50c28d8baf4a9d29619
Author: Stanislaw Pitucha <email address hidden>
Date: Fri Nov 8 13:38:08 2013 +0000

    Cleanup 'deleting' instances on restart

    In case some instance was marked as deleting but for some reason didn't
    finish (for example request was stuck in libvirt), retry the delete at
    startup. This happens at startup and on the host owning the instance,
    so there's no reason to use the lock.

    Change-Id: Iad18e9a7c6cb8e272e67a82284127ad895441dcf
    Partial-Bug: #1248563
    Closes-Bug: 1247174
    Implements: blueprint recover-stuck-state

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/55419
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=05f4f7170ab67c20e7f7b3f9f304ccc7ca163490
Submitter: Jenkins
Branch: master

commit 05f4f7170ab67c20e7f7b3f9f304ccc7ca163490
Author: Stanislaw Pitucha <email address hidden>
Date: Wed Nov 6 14:15:51 2013 +0000

    Allow deleting instances while uuid lock is held

    Avoid issues where locked up operations on the instance prevent it
    from being deleted by the user. Reasons for the lock up are a separate
    issue and will be handled elsewhere, but terminating the instance
    should not affect any tasks anyway. Any modification should already
    gracefully handle the instance going away.

    For more discussion about the issue see:
    http://lists.openstack.org/pipermail/openstack-dev/2013-October/017454.html

    Partial-Bug: #1248563

    Change-Id: Ife26ffefccea11d6e9ca5fbd970192b322fc04fd

Changed in nova:
status: New → In Progress
importance: Undecided → High
tags: added: havana-backport-potential
Changed in nova:
assignee: nobody → Stanislaw Pitucha (stanislaw-pitucha)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/70187
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=08beabc5e0fc3fffc074a634040f5821d4d1d5f2
Submitter: Jenkins
Branch: master

commit 08beabc5e0fc3fffc074a634040f5821d4d1d5f2
Author: Stanislaw Pitucha <email address hidden>
Date: Wed Nov 6 14:15:51 2013 +0000

    Allow deleting instances while uuid lock is held

    Avoid issues where locked up operations on the instance prevent it
    from being deleted by the user. Reasons for the lock up are a separate
    issue and will be handled elsewhere, but terminating the instance
    should not affect any tasks anyway. Any modification should already
    gracefully handle the instance going away.

    For more discussion about the issue see:
    http://lists.openstack.org/pipermail/openstack-dev/2013-October/017454.html

    This patch was merged once before and had to be reverted in
    6b7304d8d38c0f04643bdcd031d682c688c91b28. It was causing an extremely
    large number of errors in gate jobs. I believe that the case being hit
    should be expected with this patch in place, and that it's actually OK.
    We expect that the rest of the code will handle this case gracefully.
    Simply remove this error message from the code.

    Change-Id: I83deae464518fef5abe8fc00bfd0a186de01527b
    Partial-Bug: #1248563
    Co-authored-by: Russell Bryant <email address hidden>

Revision history for this message
Matt Riedemann (mriedem) wrote :

What's going on with this bug? In https://review.openstack.org/#/c/55444/ it says "Dealing with delete requests that never got executed is not in scope of this change and will be submitted separately." but I never see that happen in this same bug report.

Then later we had this revert of a revert https://review.openstack.org/#/c/70187/, which got reverted itself again later because it was causing race failures in hyper-v CI:

https://review.openstack.org/#/c/71363/

Revision history for this message
Tracy Jones (tjones-i) wrote :

Can someone give an update on this bug? It has not been touched in over 8 months.

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

I believe it's fixed in the 3 patches above, but since they're all marked partial, the bug was not closed.

@tjones-i, what was the context for the question - did you run into another case of this bug? If not, I'm happy to have it closed.

@mriedem, I'm not in the team dealing with Nova anymore, so I'm not sure what happened with the second part.

Revision history for this message
Tracy Jones (tjones-i) wrote :

I'm doing bug triage and looking at bugs which are not closed but have merged reviews.

Should someone else take this over since you are not working with nova any more?

Revision history for this message
Matt Riedemann (mriedem) wrote :

The separate bug where you can't ever delete an instance stuck in error state w/o restarting the compute service has been fixed by a separate bug that Joe Gordon worked on, which basically reverts the task state in the error case so you don't get stuck. I'm having a hard time finding the bugs linked to that fix though.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1329559 cleans up the instances stuck in deleting state issue, so I think we can close this bug.

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.