unable to re-issue confirm/revert of resize

Bug #1536703 reported by Chris Friesen
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Unassigned

Bug Description

If we call confirm_resize() that sets migration.status to 'confirming' and sends an RPC cast to the compute node.

If there's a glitch and that cast is received but never processed, there's no way to confirm the resize since it only looks for migrations with a status of "finished". It looks like it should be safe as-is to allow calling confirm_resize on a migration in the "confirming" state since it's already synchronized on the instance.

A similar problem holds for an interrupted revert_resize(), but in that case there's no synchronization currently. Not sure if that's a problem or not.

Tags: api
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Chris Friesen:
> "If there's a glitch [...]"

So it's more a hardening of the current workflow, isn't it? We can't even test it automatized since we don't have a fault injection system (I'm aware of).

Revision history for this message
Chris Friesen (cbf123) wrote :

In order for the problem to occur you need to have nova-compute pull the message off the queue but then never process it. (Crash, kill, reboot, etc.)

Once that happens, there's no way to clean it up short of manually editing the database.

In the "confirming" case it seems very low-risk since the logic is basically already there to handle it in nova-compute.

We could test it in the unit tests simply by specifying a starting migration status of "confirming". Testing it in tempest would be a lot harder.

Revision history for this message
Eli Qiao (taget-9) wrote :

It seems make sense that allow we query migration which status is "ing" when doing revert_resize confirm_resize etc.

        migration = objects.Migration.get_by_instance_and_status(
            elevated, instance.uuid, 'finished')

Changed in nova:
assignee: nobody → Eli Qiao (taget-9)
Revision history for this message
Eli Qiao (taget-9) wrote :

I will start fixing this until someone others to confirm this bug need to be fixed.

Revision history for this message
Eli Qiao (taget-9) wrote :

More updates, there is a logic while doing deleting instance, it will consider if the migration are not confirmed, so it seems that we can do something similar.

    def _confirm_resize_on_deleting(self, context, instance):
        # If in the middle of a resize, use confirm_resize to
        # ensure the original instance is cleaned up too
        migration = None
        for status in ('finished', 'confirming'):
            try:
                migration = objects.Migration.get_by_instance_and_status(
                        context.elevated(), instance.uuid, status)
                LOG.info(_LI('Found an unconfirmed migration during delete, '
                             'id: %(id)s, status: %(status)s'),
                         {'id': migration.id,
                          'status': migration.status},
                         context=context, instance=instance)
                break
            except exception.MigrationNotFoundByStatus:
                pass

....

Revision history for this message
Sean Dague (sdague) wrote :

I believe allowing a resize_confirm to be idempotent should be fine. Eli provides a pretty good roadmap of how to get there above.

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
tags: added: api
Changed in nova:
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/285931

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/285931
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Triaged
assignee: Eli Qiao (taget-9) → nobody
Revision history for this message
Pablo (pescobar001) wrote :

I was hitting this problem and I could workaround it doing "nova reset-state --active {uuid-of-instance}"

I guess I could also use "openstack server set --state active {uuid-of-instance}" but I haven't tried this one

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.