unable to re-issue confirm/revert of resize

Bug #1536703 reported by Chris Friesen on 2016-01-21
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned

Bug Description

If we call confirm_resize() that sets migration.status to 'confirming' and sends an RPC cast to the compute node.

If there's a glitch and that cast is received but never processed, there's no way to confirm the resize since it only looks for migrations with a status of "finished". It looks like it should be safe as-is to allow calling confirm_resize on a migration in the "confirming" state since it's already synchronized on the instance.

A similar problem holds for an interrupted revert_resize(), but in that case there's no synchronization currently. Not sure if that's a problem or not.

Tags: api Edit Tag help

@Chris Friesen:
> "If there's a glitch [...]"

So it's more a hardening of the current workflow, isn't it? We can't even test it automatized since we don't have a fault injection system (I'm aware of).

Chris Friesen (cbf123) wrote :

In order for the problem to occur you need to have nova-compute pull the message off the queue but then never process it. (Crash, kill, reboot, etc.)

Once that happens, there's no way to clean it up short of manually editing the database.

In the "confirming" case it seems very low-risk since the logic is basically already there to handle it in nova-compute.

We could test it in the unit tests simply by specifying a starting migration status of "confirming". Testing it in tempest would be a lot harder.

Eli Qiao (taget-9) wrote :

It seems make sense that allow we query migration which status is "ing" when doing revert_resize confirm_resize etc.

        migration = objects.Migration.get_by_instance_and_status(
            elevated, instance.uuid, 'finished')

Changed in nova:
assignee: nobody → Eli Qiao (taget-9)
Eli Qiao (taget-9) wrote :

I will start fixing this until someone others to confirm this bug need to be fixed.

Eli Qiao (taget-9) wrote :

More updates, there is a logic while doing deleting instance, it will consider if the migration are not confirmed, so it seems that we can do something similar.

    def _confirm_resize_on_deleting(self, context, instance):
        # If in the middle of a resize, use confirm_resize to
        # ensure the original instance is cleaned up too
        migration = None
        for status in ('finished', 'confirming'):
            try:
                migration = objects.Migration.get_by_instance_and_status(
                        context.elevated(), instance.uuid, status)
                LOG.info(_LI('Found an unconfirmed migration during delete, '
                             'id: %(id)s, status: %(status)s'),
                         {'id': migration.id,
                          'status': migration.status},
                         context=context, instance=instance)
                break
            except exception.MigrationNotFoundByStatus:
                pass

....

Sean Dague (sdague) wrote :

I believe allowing a resize_confirm to be idempotent should be fine. Eli provides a pretty good roadmap of how to get there above.

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
tags: added: api
Changed in nova:
status: Confirmed → Triaged

Fix proposed to branch: master
Review: https://review.openstack.org/285931

Changed in nova:
status: Triaged → In Progress

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/285931
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Triaged
assignee: Eli Qiao (taget-9) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers