instance's vm-state becomes error when cold-migrate instance to same host failed

Bug #1811235 reported by David Li on 2019-01-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Queens
Medium
Unassigned
Rocky
Medium
Unassigned
Stein
Medium
Matt Riedemann

Bug Description

step 1: boot a simple instance
step 2: stop the running instance which is booted in step 1
step 3: migrate the instance to same host,but the driver in the host doesn't support migrate to same host
step 4: nova show uuid(instance uuid), the instance's vm-state becomes error, and the fault show {"message": "Unable to migrate instance ( instance uuid) to current host (hostname)
step 5: nova instance-action-list uuid(instance uuid), you will see migrate error.

David Li (ldone) on 2019-01-10
description: updated
Brin Zhang (zhangbailin) wrote :

@David Li, which branch of nova with this issue?

On master, if you cold migrate the server to the same host, it's not set the server to error status. see [1].

[1] https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3530

David Li (ldone) wrote :

Dear Brin, I feel so sorry for replying too late.
I found this issue in private cloud, and also I found Ocata version of nova (15.1.5 release) may have this issue after reading its code. The key codes in nova/compute/manager.py --> class ComputeManager --> prep_resize --> _error_out_instance_on_exception.

Besides, I have some questions. First, why the latest version of nova not supports migrating instance to same host event through some drivers support that function..Second, if there is any problem when user migrates instance without migration destination host but there is just one compute node.

tags: added: compute
Balazs Gibizer (balazs-gibizer) wrote :

To answer your question: Nova has a config option allow_resize_to_same_host that is False by default. If it is False the nova does not allow migrating or resize an instance to the host it is already on. So latest nova still supports migrating to the same host if configuration allows it.

David Li (ldone) wrote :

@BalazsGibizer, thank you for you reply. I saw the config option allow_resize_to_same_host. However, that config makes no sense since I want to migrate to selfhost because exception.CannotMigrateToSameHost will be raised in the beginning of function resize.see [1].

[1] https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3530

melanie witt (melwitt) wrote :

David, thanks for the code link.

It looks like indeed, this change introduced a regression that causes CONF.allow_resize_to_same_host not to work any longer:

https://review.openstack.org/408955

melanie witt (melwitt) wrote :

Actually, looking at the code further [1], the docstring says that the host_name variable will only be set when the user has specified a target host for a cold migration. So, a resize without a target host specified will be able to land on the same host if CONF.allow_resize_to_same_host = True. BUT, a resize with a target host specified will be rejected. Prior to that change, it was not possible to specify a target host during a resize, at all.

David, you are specifying a target host in your resize command, I assume?

I'm unsure if the code was written intentionally to reject self host, given CONF.allow_resize_to_same_host. We should discuss this with the wider team and find out more about the original target host feature.

[1] https://github.com/openstack/nova/blob/9419c3e05499e55beda93d664197a7b0f0011ff7/nova/compute/api.py#L3524-L3525

David Li (ldone) wrote :

Dear melaniewitt, I just do cold-migrate with same target host specified in my command but resize.

Matt Riedemann (mriedem) wrote :

Resize does not allow specifying a target host at all. Specifying a target host only applies (in that code) for cold migrate:

https://developer.openstack.org/api-ref/compute/#resize-server-resize-action

https://developer.openstack.org/api-ref/compute/#migrate-server-migrate-action

The host parameter was added in the 2.56 microversion.

The allow_resize_to_same_host option really only makes sense for resize when the flavor changes. For cold migrate the flavor does not change, so what is the point of cold migrating a server to the same host?

As noted earlier, the cold migration might just fail in that case anyway if the virt driver doesn't support it:

https://github.com/openstack/nova/blob/9419c3e05499e55beda93d664197a7b0f0011ff7/nova/compute/manager.py#L4172

It looks like the only driver that does support cold migrate to the same host is the vcenter driver (because that is a cluster driver where the single nova-compute manages a single vcenter cluster which could have hundreds of ESXi hosts on it).

Are you using the vcenter driver in your deployment?

Matt Riedemann (mriedem) wrote :

I don't see where UnableToMigrateToSelf would eventually set the instance to error status, either in the compute manager prep_resize flow or in conductor if it tried to reschedule to another host. The fault does come from the compute manager recording UnableToMigrateToSelf as a fault.

Matt Riedemann (mriedem) wrote :

Oh yes I see:

https://github.com/openstack/nova/blob/stable/ocata/nova/compute/manager.py#L3798

That catches the UnableToMigrateToSelf because it doesn't reschedule and re-raises UnableToMigrateToSelf from _prep_resize which then sets the instance to error status:

https://github.com/openstack/nova/blob/stable/ocata/nova/compute/manager.py#L6801

That exception handler is kind of a pain - we have similar issues with rebuild failing and setting the instance to error status when it probably shouldn't.

In this case we should probably raise InstanceFaultRollback when we hit UnableToMigrateToSelf but even that would set the instance status back to active which it's not in this case (they resized a stopped VM).

And yes that code still exists on master...so this is a valid bug, but has less to do with the 2.56 microversion.

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → Low
Matt Riedemann (mriedem) wrote :

I'm going to mark this low severity because it should be rare to get into this scenario if you're doing a cold migrate. The scheduler should filter out the instance.host unless the allow_resize_to_same_host config option is True (which I guess it is in this case), but even then if you had more compute hosts in your deployment then we should be able to reschedule to another host (unless they are all full). And even if you do hit this, nothing changed with the underlying VM so you can reset the state using this API:

https://developer.openstack.org/api-ref/compute/#reset-server-state-os-resetstate-action

Although that only allows you to set the instance status to active or error (it's already in error status, and the guest isn't running so it's not active either. If you reset to active then the compute manager should automatically stop it via the _sync_power_states periodic task). I'm not sure why we don't allow more sets in the resetState API (like stopped).

Fix proposed to branch: master
Review: https://review.openstack.org/633227

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Matt Riedemann (mriedem) on 2019-01-25
Changed in nova:
importance: Low → Medium
Matt Riedemann (mriedem) wrote :

Per comment 2:

> Dear Brin, I feel so sorry for replying too late.
> I found this issue in private cloud, and also I found Ocata version of nova (15.1.5 release) may
> have this issue after reading its code. The key codes in nova/compute/manager.py --> class
> ComputeManager --> prep_resize --> _error_out_instance_on_exception.

This is the latent bug from Ocata (and probably earlier). My patches above should fix this bug.

> Besides, I have some questions. First, why the latest version of nova not supports migrating
> instance to same host event through some drivers support that function..Second, if there is any
> problem when user migrates instance without migration destination host but there is just one
> compute node.

This looks like a regression with the 2.56 microversion in the API code, which was added in Queens. As we can tell from the vmware driver's supports_migrate_to_same_host=True capability, we know that at least one compute driver supports cold migrating to the same compute service host and the API should probably allow that. It kind of makes sense for vmware since you could have a single nova-compute service hosting a single vcenter cluster which has 1000 ESXi hosts in it, so your only option to cold migrate is to that same compute host (but underlying vcenter would actually cold migrate to another ESXi host in the cluster).

That regression in 2.56 is a separate bug though in my opinion.

Matt Riedemann (mriedem) wrote :

Bug 1819216 is related to this.

melanie witt (melwitt) wrote :

At a high level, I agree it makes sense not to go into ERROR state if a cold migrate to self fails, but ERROR state + fault message is generally "the way" that nova communicates failure of an action upon an instance. The end user sees the ERROR state and looks at the fault message to discover that something was not successful.

If we remove going to ERROR state, how will the end user know that their instance did not succeed in the migration? The instance actions API policy defaults to admin-only. Is there some other way the end user can discover the migration was not successful?

Matt Riedemann (mriedem) wrote :

@melanie, I've replied to your questions from comment 16 in the code review:

https://review.opendev.org/#/c/633227/7/nova/compute/manager.py@4355

Matt Riedemann (mriedem) wrote :

Also note that setting the instance to ERROR just to see the fault means that to reset the status on the instance, if you use the resetState API your only options are ACTIVE or ERROR. But if I'm resizing from a STOPPED server and we go to ERROR state just to see a fault, then resetting the instance to ACTIVE is a lie since the guest is really stopped - the power state sync task in the compute should eventually correct that, but it's still annoyingly wrong.

I think in general we should avoid setting instances to ERROR status if we can help it, i.e. nothing really changed about the power state of the instance, it's data, or related resources (volumes, ports, etc).

The instance actions API is the way to track the details of an operation performed on a server and it's not admin-only.

Reviewed: https://review.opendev.org/633212
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=46cb2fdfe2df333d870aca8e6afc521172b8e061
Submitter: Zuul
Branch: master

commit 46cb2fdfe2df333d870aca8e6afc521172b8e061
Author: Matt Riedemann <email address hidden>
Date: Fri Jan 25 08:58:36 2019 -0500

    Change InstanceFaultRollback handling in _error_out_instance_on_exception

    For some reason, only NotImplementedError in _error_out_instance_on_exception
    would use the $instance_state parameter which can be controlled by the
    caller of the context manager to determine the rollback vm_state. But in the
    case of InstanceFaultRollback, the caller may want to reset the vm_state
    back to something other than ACTIVE, like if the instance is actually
    STOPPED and something like prep_resize fails (you can resize a STOPPED
    instance).

    This change makes _error_out_instance_on_exception handle InstanceFaultRollback
    like NotImplementedError in that the instance_state parameter is used to reset
    the instance.vm_state. It also adds a docstring explaining how this context
    manager works along with some notes/questions about ways to improve it.

    Change-Id: Ie4f9177f4d54cbc7dbcf58bd107fd5f24c60d8bb
    Related-Bug: #1811235

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/633227
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d1931ac0630415519f9c3d906caba9c83cc8162a
Submitter: Zuul
Branch: master

commit d1931ac0630415519f9c3d906caba9c83cc8162a
Author: Matt Riedemann <email address hidden>
Date: Fri Jan 25 10:55:00 2019 -0500

    Raise InstanceFaultRollback for UnableToMigrateToSelf from _prep_resize

    It is possible to cold migrate a stopped server. If, however, the
    cold migrate is scheduled to the instance's current host and the
    compute driver does not support cold migrating to the same host,
    then UnableToMigrateToSelf was being raised from _prep_resize. If
    _reschedule_resize_or_reraise re-raises that exception, then
    _error_out_instance_on_exception in prep_resize handles it and
    sets the instance vm_state to ACTIVE. This is wrong since the
    instance power state is unchanged at this point and the instance
    is actually stopped.

    This fixes the problem by wrapping UnableToMigrateToSelf in
    InstanceFaultRollback and raises that from _prep_resize, and
    _error_out_instance_on_exception is called with the initial
    vm_state (STOPPED in this case) so when _error_out_instance_on_exception
    handles the InstanceFaultRollback exception it sets the instance
    vm_state to STOPPED (what it already was) rather than ACTIVE.

    There were no existing unit tests for the UnableToMigrateToSelf
    case in _prep_resize so those are added here.

    Change-Id: I17543ecb572934ecc7d0bbc7a4ad2f537fa499bc
    Closes-Bug: #1811235

Reviewed: https://review.opendev.org/666638
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cdaa80078491517af3143690d40d30a5b7591edf
Submitter: Zuul
Branch: stable/stein

commit cdaa80078491517af3143690d40d30a5b7591edf
Author: Matt Riedemann <email address hidden>
Date: Fri Jan 25 08:58:36 2019 -0500

    Change InstanceFaultRollback handling in _error_out_instance_on_exception

    For some reason, only NotImplementedError in _error_out_instance_on_exception
    would use the $instance_state parameter which can be controlled by the
    caller of the context manager to determine the rollback vm_state. But in the
    case of InstanceFaultRollback, the caller may want to reset the vm_state
    back to something other than ACTIVE, like if the instance is actually
    STOPPED and something like prep_resize fails (you can resize a STOPPED
    instance).

    This change makes _error_out_instance_on_exception handle InstanceFaultRollback
    like NotImplementedError in that the instance_state parameter is used to reset
    the instance.vm_state. It also adds a docstring explaining how this context
    manager works along with some notes/questions about ways to improve it.

    Change-Id: Ie4f9177f4d54cbc7dbcf58bd107fd5f24c60d8bb
    Related-Bug: #1811235
    (cherry picked from commit 46cb2fdfe2df333d870aca8e6afc521172b8e061)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/666639
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1a11d5c7f35c17cfafc6b8ba57da88995c35b595
Submitter: Zuul
Branch: stable/stein

commit 1a11d5c7f35c17cfafc6b8ba57da88995c35b595
Author: Matt Riedemann <email address hidden>
Date: Fri Jan 25 10:55:00 2019 -0500

    Raise InstanceFaultRollback for UnableToMigrateToSelf from _prep_resize

    It is possible to cold migrate a stopped server. If, however, the
    cold migrate is scheduled to the instance's current host and the
    compute driver does not support cold migrating to the same host,
    then UnableToMigrateToSelf was being raised from _prep_resize. If
    _reschedule_resize_or_reraise re-raises that exception, then
    _error_out_instance_on_exception in prep_resize handles it and
    sets the instance vm_state to ACTIVE. This is wrong since the
    instance power state is unchanged at this point and the instance
    is actually stopped.

    This fixes the problem by wrapping UnableToMigrateToSelf in
    InstanceFaultRollback and raises that from _prep_resize, and
    _error_out_instance_on_exception is called with the initial
    vm_state (STOPPED in this case) so when _error_out_instance_on_exception
    handles the InstanceFaultRollback exception it sets the instance
    vm_state to STOPPED (what it already was) rather than ACTIVE.

    There were no existing unit tests for the UnableToMigrateToSelf
    case in _prep_resize so those are added here.

    Conflicts:
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The conflict is due to some unit tests from change
    I734cc01dce13f9e75a16639faf890ddb1661b7eb not being in Stein.

    Change-Id: I17543ecb572934ecc7d0bbc7a4ad2f537fa499bc
    Closes-Bug: #1811235
    (cherry picked from commit d1931ac0630415519f9c3d906caba9c83cc8162a)

This issue was fixed in the openstack/nova 19.0.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers