Cold migration fails when the filter only returns the host where the vm is located and the vm status is set to ERROR

Bug #1748697 reported by yangjie
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matt Riedemann

Bug Description

Description
===========
The configuration: allow_resize_to_same_host=True.
Cold migration fails when the filter only returns the host where the vm is located and the vm status is set to ERROR.

Steps to reproduce
==================
1、create a vm
2、disable all hosts except the host which the vm is located
3、cold migrate the vm, you can see the exception.UnableToMigrateToSelf in compute log, and the vm status is set to ERROR while it still running

Expected result
===============
cold migrate failed, and user should see HTTP400 NovaildHost from console output, because cold migrate to same host is meaningless. The vm status still keep Active.

Actual result
=============
cold migrate failed, nothing returns from console, and vm status is set to ERROR.

Environment
===========
nova16.0.4 KVM $ libvirt

Logs & Configs
==============
allow_resize_to_same_host=True

PS:
If the host which the vm located is particularly resource-rich, filters return this host each time when user executes cold migration. Even if the resources of other hosts can meet the requirements of the virtual machine, the vm can never be cold-migrated.

tangxing (tang-xing)
Changed in nova:
assignee: nobody → tangxing (tang-xing)
yangjie (yang.jie)
Changed in nova:
assignee: tangxing (tang-xing) → yangjie (yang.jie)
tags: added: openstack-version.pike resize
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Some Nova virt drivers support migrating to the same host (VMWare at least). That's why it's not meaningless to accept migrating to a same host.

Anyway, given your concern would be asking to have a different behaviour for cold-migrating, it would need a new API microversion hence a new spec and a blueprint.

See https://docs.openstack.org/nova/latest/contributor/blueprints.html for details.

Changed in nova:
status: New → Invalid
importance: Undecided → Wishlist
Revision history for this message
tangxing (tang-xing) wrote :

Hi,sylvain-bauza, in some cases instance will be set to ERROR state when migrate failed (such as UnableToMigrateToSelf),we can simply return the error to upper layer (operater or user) to avoid make user run reset to restore instance status since no changes to guest at all.

Changed in nova:
status: Invalid → Confirmed
importance: Wishlist → Low
Revision history for this message
Viktor Tikkanen (viktor-tikkanen) wrote :

Just to confirm that it is not always necessary to disable all the other hosts in order to reproduce the problem.

For example, in my system I've got following error:

Setting instance vm_state to ERROR: UnableToMigrateToSelf: Unable to migrate instance (949de2d7-cd46-402d-8a5c-04f37e3244b7) to current host (compute-14).

[admin@controller-2 ~(admin)]$ nova hypervisor-list|grep compute-14
| 1258b67e-e99a-4719-b905-adf4e8e42848 | compute-14 | up | enabled |

And if I give flavor details to the placement service, it happens to give own host (compute-14 / 1258b67e-e99a-4719-b905-adf4e8e42848) as the first candidate in the list

[root@vm000949 ~]# curl -k -i -g -X GET "https://<my_IP_address>:8780/allocation_candidates?limit=1000&resources=DISK_GB%3A10%2CMEMORY_MB%3A12288%2CVCPU%3A6" -H 'Content-Type: application/json' -H "X-Auth-Token: gAAAAABbrHvR_eQljtBZaBkWw6QABL1SrZWc5B4jkOYlUpm0ZegmUyJL2j0pgGgmNtv5To3mH-CEomQ_PORFG9YBCzqcaHrSxtZrnRtqIobxzef2Kk3dhkTNG9wioa1nxMw6-EFXBYsQD4fzkuCMRCOmxjB-4on77l7Jjo7kqhatgOb7Mkvecyw" -H 'openstack-api-version: placement 1.17'

{"provider_summaries": {"1258b67e-e99a-4719-b905-adf4e8e42848": {"traits": [], "resources": {"VCPU": {"used": 34, "capacity": 50}, "MEMORY_MB": {"used": 57344, "capacity": 126323}, "DISK_GB": {"used": 60, "capacity": 1400}}}, "88d719b0-0e69-4d85-8ba4-1435ae324b80": {"traits": [], "resources": {"VCPU": {"used": 42, "capacity": 50}, "MEMORY_MB": {"used": 61440, "capacity": 126323}, "DISK_GB": {"used": 70, "capacity": 1400}}},…

Revision history for this message
Hu Zhou (hu.zhou) wrote :

In my opinion, there is no point to do cold migration on the same host. Unless there is really a special need where particular resources such as some pCPUs should be freed. Even on such case, resize with new flavor can be used to rescue.

A quick fix would skip current host in MigrationTask._execute()'s scheduler_client.select_destinations() and rescheduling should also consider excluding current host from selections.

resize on the same host should not be affected by this fix.

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

I have the same issue as described in the ticket and in comment #3:

"If the host which the vm located is particularly resource-rich, filters return this host each time when user executes cold migration. Even if the resources of other hosts can meet the requirements of the virtual machine, the vm can never be cold-migrated."

The other hosts are rejected by the scheduler via the DifferentHostFilter. Removing the DifferentHostFilter resolved the issue.

VM in ERROR state maybe a minor issue; unable to migrate is a critical issue.

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

My apologies for comment #5, the VMs I had issues with are created with the "different host" rule, I was not aware of that.

So please ignore comment #5.

Changed in nova:
assignee: yangjie (yang.jie) → Jing Zhang (jing.zhang.nokia)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675025

Revision history for this message
Matt Riedemann (mriedem) wrote :

> 2、disable all hosts except the host which the vm is located

Of course cold migration is going to fail because there are no alternate hosts to reschedule to when the first host selected by the scheduler, which is the one the instance is running on in this case, fails because of the UnableToMigrateToSelf error.

> 3、cold migrate the vm, you can see the exception.UnableToMigrateToSelf in compute log, and the vm status is set to ERROR while it still running

I think this has been fixed, at least going back to Stein:

https://review.opendev.org/#/q/Ie4f9177f4d54cbc7dbcf58bd107fd5f24c60d8bb

> cold migrate failed, and user should see HTTP400 NovaildHost from console output, because cold migrate to same host is meaningless. The vm status still keep Active.

While we could technically detect that the scheduler returned the same host that the instance is currently on (because API->conductor->scheduler are all RPC calls, the only time we cast is from conductor to the dest compute's prep_resize method) and return a 400 to the user, it's not really the user's fault, and it would be incorrect for certain types of compute drivers, like the vCenter driver.

A better short-term solution is decoupling the allow_resize_to_same_host configuration from cold migrate operations because compute drivers like libvirt can resize to the same host but cannot cold migrate to the same host, but the vCenter driver can do both.

A better long-term solution is to be smarter about asking placement for the computes that can perform a cold migration to the same host, as described here: https://review.opendev.org/#/c/666604/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/676022

Changed in nova:
assignee: Jing Zhang (jing.zhang.nokia) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Jing Zhang (<email address hidden>) on branch: master
Review: https://review.opendev.org/675025
Reason: replaced by https://review.opendev.org/#/c/676022/1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695220

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/676022
Reason: I'm just going to drop this backportable workaround option. If some really needed this we could think about it as a stable-only change but this is a really latent bug that not many have cared much about fixing so not a high priority to backport and introduce the complexity that this brings. I'll drop this and focus on the master-only traits-based solution in https://review.opendev.org/#/c/695220/.

Changed in nova:
assignee: Matt Riedemann (mriedem) → melanie witt (melwitt)
melanie witt (melwitt)
Changed in nova:
assignee: melanie witt (melwitt) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Stephen Finucane (stephenfinucane)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/695220
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4921e822e73383af0c8da4c5e3acfaa021eafe68
Submitter: Zuul
Branch: master

commit 4921e822e73383af0c8da4c5e3acfaa021eafe68
Author: Matt Riedemann <email address hidden>
Date: Wed Nov 20 10:27:18 2019 -0500

    Use COMPUTE_SAME_HOST_COLD_MIGRATE trait during migrate

    This uses the COMPUTE_SAME_HOST_COLD_MIGRATE trait in the API during a
    cold migration to filter out hosts that cannot support same-host cold
    migration, which is all of them except for the hosts using the vCenter
    driver.

    For any nodes that do not report the trait, we won't know if they don't
    because they don't support it or if they are not new enough to report
    it, so the API has a service version check and will fallback to old
    behavior using the config if the node is old. That compat code can be
    removed in the next release.

    As a result of this the FakeDriver capabilities are updated so the
    FakeDriver no longer supports same-host cold migration and a new fake
    driver is added to support that scenario for any tests that need it.

    Change-Id: I7a4b951f3ab324c666ab924e6003d24cc8e539f5
    Closes-Bug: #1748697
    Related-Bug: #1811235

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.