Nova compute service exception that performs cold migration virtual machine stuck in resize state.

Bug #1856925 reported by wang
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matt Riedemann
Train
Fix Released
Low
Lee Yarwood

Bug Description

Description:
 In the case of a nova-compute service exception, such as down, the instance gets stuck in the resize state during cold migration and cannot perform evacuation.The command request for nova API is also issued, server_status and Task State have been changed, but compute cannot receive the request, resulting in the server State remaining in the resize State. When nova-compute is restarted, the server State becomes ERROR.It is recommended to add validation to prevent instances from entering inoperable states.
  This can also happen with commands such as stop/rebuild/reboot.
Environment:
1. openstack-Q;nova -version:9.1.1

2. hypervisor: Libvirt + KVM

3. One control node, two compute nodes.

Tags: resize
Revision history for this message
Matt Riedemann (mriedem) wrote :

For resize, do you mean when the source compute is down? Because the scheduler should filter out any destination computes that are down.

What error traceback/log messages are you seeing when this happens and where - the destination compute's prep_resize method? Conductor? Other?

Also, what server side release specifically? 9.1.1 looks like the version of python-novaclient but we need to know the server side version (queens?).

https://review.opendev.org/#/c/699291/ sounds similar but that's a case where the source compute is down while confirming a resize.

tags: added: resize
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Ignore comment 2, that was meant for another bug.

summary: Nova compute service exception that performs cold migration virtual
- machine card in resize state.
+ machine stuck in resize state.
Revision history for this message
Matt Riedemann (mriedem) wrote :

I was unable to recreate the resize issue in a devstack created from master today.

I had 2 compute services, created a server, then stopped the source compute service on which the instance was running, then tried resizing the server to the other host.

It basically hung because the dest host's prep_resize routine tries to do an asynchronous RPC cast to the source compute to power off the instance and start transferring disks but the source service is down so it doesn't process the message.

The server is still active but the task_state is stuck in resize_prep:

| OS-EXT-STS:task_state | resize_prep
| OS-EXT-STS:vm_state | active

So the server doesn't go to ERROR status (unless eventually the RPC cast failure would result in an exception from oslo.messaging) but obviously there is a problem, I agree with that.

Do you have details on what fails in your recreate and have logs for it? Otherwise it seems the simplest solution here is the API should check that the source compute service is up before initiating the resize/cold migrate operation.

Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/700062

Revision history for this message
wang (jiajing) wrote :

openstack-queen

Revision history for this message
wang (jiajing) wrote :

Restart the source compute service,compute service is normal, the instance is ERROR status.

Revision history for this message
wang (jiajing) wrote :

Is it possible to increase the validation of service state when obtaining instance?

Thank you very much.

Revision history for this message
wang (jiajing) wrote :

root@wangjj-barnican-ctl:~# openstack server resize test-resize1 --flavor test00
root@wangjj-barnican-ctl:~# nova list
+--------------------------------------+--------------+--------+-------------+-------------+------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------+--------+-------------+-------------+------------------------+
| cfc87583-a6c9-4d19-bb1b-47023853bb5e | test-resize1 | RESIZE | resize_prep | Running | selfservice=172.16.1.4 |
+--------------------------------------+--------------+--------+-------------+-------------+------------------------+
root@wangjj-barnican-ctl:~# date
Tue Dec 24 10:46:48 CST 2019
root@wangjj-barnican-ctl:~# date
Tue Dec 24 11:04:05 CST 2019
root@wangjj-barnican-ctl:~# nova list
+--------------------------------------+--------------+--------+-------------+-------------+------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------+--------+-------------+-------------+------------------------+
| cfc87583-a6c9-4d19-bb1b-47023853bb5e | test-resize1 | RESIZE | resize_prep | Running | selfservice=172.16.1.4 |
+--------------------------------------+--------------+--------+-------------+-------------+------------------------+
****
root@wangjj-barnican-cmp02:~# service nova-compute start
****

root@wangjj-barnican-ctl:~# nova list
+--------------------------------------+--------------+--------+------------------+-------------+------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------+--------+------------------+-------------+------------------------+
| cfc87583-a6c9-4d19-bb1b-47023853bb5e | test-resize1 | ERROR | resize_migrating | Running | selfservice=172.16.1.4 |
+--------------------------------------+--------------+--------+------------------+-------------+------------------------+

Revision history for this message
Matt Riedemann (mriedem) wrote :

Are you restarting the destination or source compute service? In other words, are you restarting the service where the resized instance is running before confirming the resize?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/700062
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ea2ea492a3d046d53d44039206fff69fe7e3ac61
Submitter: Zuul
Branch: master

commit ea2ea492a3d046d53d44039206fff69fe7e3ac61
Author: Matt Riedemann <email address hidden>
Date: Thu Dec 19 13:29:50 2019 -0500

    Ensure source service is up before resizing/migrating

    If the source compute service is down when a resize or
    cold migrate is initiated the prep_resize cast from the
    selected destination compute service to the source will
    fail/hang. The API can validate the source compute service
    is up or fail the operation with a 409 response if the
    source service is down. Note that a host status of
    "MAINTENANCE" means the service is up but disabled by
    an administrator which is OK for resize/cold migrate.

    The solution here works the validation into the
    check_instance_host decorator which surprisingly isn't
    used in more places where the source host is involved
    like reboot, rebuild, snapshot, etc. This change just
    handles the resize method but is done in such a way that
    the check_instance_host decorator could be applied to
    those other methods and perform the is-up check as well.
    The decorator is made backward compatible by default.

    Note that Instance._save_services is added because during
    resize the Instance is updated and the services field
    is set but not actually changed, but Instance.save()
    handles object fields differently so we need to implement
    the no-op _save_services method to avoid a failure.

    Change-Id: I85423c7bcacff3bc465c22686d0675529d211b59
    Closes-Bug: #1856925

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/701757

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/701757
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=938b499b1f427c3464602a28a63fe96056f1df25
Submitter: Zuul
Branch: stable/train

commit 938b499b1f427c3464602a28a63fe96056f1df25
Author: Matt Riedemann <email address hidden>
Date: Thu Dec 19 13:29:50 2019 -0500

    Ensure source service is up before resizing/migrating

    If the source compute service is down when a resize or
    cold migrate is initiated the prep_resize cast from the
    selected destination compute service to the source will
    fail/hang. The API can validate the source compute service
    is up or fail the operation with a 409 response if the
    source service is down. Note that a host status of
    "MAINTENANCE" means the service is up but disabled by
    an administrator which is OK for resize/cold migrate.

    The solution here works the validation into the
    check_instance_host decorator which surprisingly isn't
    used in more places where the source host is involved
    like reboot, rebuild, snapshot, etc. This change just
    handles the resize method but is done in such a way that
    the check_instance_host decorator could be applied to
    those other methods and perform the is-up check as well.
    The decorator is made backward compatible by default.

    Note that Instance._save_services is added because during
    resize the Instance is updated and the services field
    is set but not actually changed, but Instance.save()
    handles object fields differently so we need to implement
    the no-op _save_services method to avoid a failure.

    Conflicts:
        nova/api/openstack/compute/migrate_server.py
        nova/api/openstack/compute/servers.py
        nova/tests/unit/compute/test_compute_api.py
        nova/tests/functional/wsgi/test_servers.py

    NOTE(lyarwood): Conflicts as I8c96b337f32148f8f5899c9b87af331b1fa41424,
    I711e56bcb4b72605253fa63be230a68e03e45b84,
    I098f91d8c498e5a85266e193ad37c08aca4792b2 and
    I19db48bd03855d1a1edbeff5adf15a28abcb5d92 are not in stable/train.

    Change-Id: I85423c7bcacff3bc465c22686d0675529d211b59
    Closes-Bug: #1856925
    (cherry picked from commit ea2ea492a3d046d53d44039206fff69fe7e3ac61)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.