migrate from stopped goes back to ACTIVE state after resize-confirm

Bug #1197514 reported by Matt Riedemann
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

On havana master, running with this setup:

[root@ngp02-hv1 ~]# nova host-list
+-----------------------------+-------------+----------+
| host_name | service | zone |
+-----------------------------+-------------+----------+
| CN10 | compute | nova |
| cn12 | compute | nova |
| cn4 | compute | nova |
| cn6 | compute | nova |
| cn8 | compute | nova |
| ngp02-hv1.private.cloud.com | cells | internal |
| ngp02-hv1.private.cloud.com | cert | internal |
| ngp02-hv1.private.cloud.com | compute | nova |
| ngp02-hv1.private.cloud.com | conductor | internal |
| ngp02-hv1.private.cloud.com | console | internal |
| ngp02-hv1.private.cloud.com | consoleauth | internal |
| ngp02-hv1.private.cloud.com | scheduler | internal |
+-----------------------------+-------------+----------+

The compute nodes are all running the hyper-v driver.

1. Boot an instance to CN10:

nova boot --image quantal-server-cloudimg-amd64-disk1 --flavor 2 --availability-zone nova:CN10 --user-data /home/testscripts/myuserdata.txt --config-drive true Ivt_July2_9

2. Then stop it:

[root@ngp02-hv1 testscripts]# nova stop Ivt_July2_9
[root@ngp02-hv1 testscripts]# nova show Ivt_July2_9
+-------------------------------------+----------------------------------------------------------------------------+
| Property | Value |
+-------------------------------------+----------------------------------------------------------------------------+
| status | SHUTOFF |
| updated | 2013-07-02T17:21:36Z |
| OS-EXT-STS:task_state | None |
| OS-EXT-SRV-ATTR:host | CN10 |
| key_name | None |
| image | quantal-server-cloudimg-amd64-disk1 (8495ff12-ee3c-4fe4-b46b-7b3b10641c87) |
| network1 network | 10.0.1.49 |
| hostId | 114d3be3249df7d93321ac50d4672be322989685e6f99d4cd8f87fb1 |
| OS-EXT-STS:vm_state | stopped |
| OS-EXT-SRV-ATTR:instance_name | instance-00000090 |
| OS-SRV-USG:launched_at | 2013-07-02T16:59:30.310000 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | CN10 |
| flavor | m1.small (2) |
| id | 6c9a28c8-c52c-4046-ad83-00e12ee0994d |
| security_groups | [{u'name': u'default'}] |
| OS-SRV-USG:terminated_at | None |
| user_id | b25a17a8fcfa44098e45e0cfe5ae4fee |
| name | Ivt_July2_9 |
| created | 2013-07-02T16:59:00Z |
| tenant_id | 066c47e2d09b440aad1845fcea9959ba |
| OS-DCF:diskConfig | MANUAL |
| metadata | {} |
| accessIPv4 | |
| accessIPv6 | |
| OS-EXT-STS:power_state | 4 |
| OS-EXT-AZ:availability_zone | nova |
| config_drive | 1 |
+-------------------------------------+----------------------------------------------------------------------------+

3. Then migrate it (moves from CN10 to cn4):

[root@ngp02-hv1 testscripts]# nova migrate Ivt_July2_9
[root@ngp02-hv1 testscripts]# nova show Ivt_July2_9
[root@ngp02-hv1 testscripts]# nova show Ivt_July2_9
+-------------------------------------+----------------------------------------------------------------------------+
| Property | Value |
+-------------------------------------+----------------------------------------------------------------------------+
| status | VERIFY_RESIZE |
| updated | 2013-07-02T17:22:27Z |
| OS-EXT-STS:task_state | None |
| OS-EXT-SRV-ATTR:host | cn4 |
| key_name | None |
| image | quantal-server-cloudimg-amd64-disk1 (8495ff12-ee3c-4fe4-b46b-7b3b10641c87) |
| network1 network | 10.0.1.49 |
| hostId | aec601f205a7665ea600c0fed93cf50e9bb779c2d7625e4f7da23944 |
| OS-EXT-STS:vm_state | resized |
| OS-EXT-SRV-ATTR:instance_name | instance-00000090 |
| OS-SRV-USG:launched_at | 2013-07-02T17:22:34.143000 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | cn4 |
| flavor | m1.small (2) |
| id | 6c9a28c8-c52c-4046-ad83-00e12ee0994d |
| security_groups | [{u'name': u'default'}] |
| OS-SRV-USG:terminated_at | None |
| user_id | b25a17a8fcfa44098e45e0cfe5ae4fee |
| name | Ivt_July2_9 |
| created | 2013-07-02T16:59:00Z |
| tenant_id | 066c47e2d09b440aad1845fcea9959ba |
| OS-DCF:diskConfig | MANUAL |
| metadata | {} |
| accessIPv4 | |
| accessIPv6 | |
| progress | 0 |
| OS-EXT-STS:power_state | 4 |
| OS-EXT-AZ:availability_zone | nova |
| config_drive | 1 |
+-------------------------------------+----------------------------------------------------------------------------+

4. Then confirm the migration. This is where the problem happens - the vm_state goes to ACTIVE even though the power_state is 4 (SHUTOFF):

[root@ngp02-hv1 testscripts]# nova show Ivt_July2_9
+-------------------------------------+----------------------------------------------------------------------------+
| Property | Value |
+-------------------------------------+----------------------------------------------------------------------------+
| status | ACTIVE |
| updated | 2013-07-02T17:23:03Z |
| OS-EXT-STS:task_state | None |
| OS-EXT-SRV-ATTR:host | cn4 |
| key_name | None |
| image | quantal-server-cloudimg-amd64-disk1 (8495ff12-ee3c-4fe4-b46b-7b3b10641c87) |
| network1 network | 10.0.1.49 |
| hostId | aec601f205a7665ea600c0fed93cf50e9bb779c2d7625e4f7da23944 |
| OS-EXT-STS:vm_state | active |
| OS-EXT-SRV-ATTR:instance_name | instance-00000090 |
| OS-SRV-USG:launched_at | 2013-07-02T17:22:34.143000 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | cn4 |
| flavor | m1.small (2) |
| id | 6c9a28c8-c52c-4046-ad83-00e12ee0994d |
| security_groups | [{u'name': u'default'}] |
| OS-SRV-USG:terminated_at | None |
| user_id | b25a17a8fcfa44098e45e0cfe5ae4fee |
| name | Ivt_July2_9 |
| created | 2013-07-02T16:59:00Z |
| tenant_id | 066c47e2d09b440aad1845fcea9959ba |
| OS-DCF:diskConfig | MANUAL |
| metadata | {} |
| accessIPv4 | |
| accessIPv6 | |
| progress | 0 |
| OS-EXT-STS:power_state | 4 |
| OS-EXT-AZ:availability_zone | nova |
| config_drive | 1 |
+-------------------------------------+----------------------------------------------------------------------------+

I checked the compute logs for CN10 and cn4 and on the source host CN10 I found an exception for an InstanceNotFound coming from the nova.virt.hyperv.vmops.get_info method (which is actually detailed in bug 1197506).

For some history here, change I19fa61d467edd5a7572040d084824972569ef65a fixed bug 1177811 where a user could resize/migrate an instance from a stopped state and it would go back to ACTIVE state after the resize/migrate was confirmed/rejected. As part of that change, we check the power_state when confirming the migration/resize to see if the user powered on the instance to confirm the resize (if they had resized/migrated from stopped state):

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2300

Now for this bug, it looks like if the instance was migrated, when we check the power_state on the instance, the old host is being asked to get the state but the instance has migrated and we get the InstanceNotFound (which is swallowed here):

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L698

Tags: compute
Revision history for this message
Matt Riedemann (mriedem) wrote :

[root@ngp02-hv1 ~]# nova availability-zone-list
+--------------------------------+----------------------------------------+
| Name | Status |
+--------------------------------+----------------------------------------+
| internal | available |
| |- ngp02-hv1.private.cloud.com | |
| | |- nova-cert | enabled :-) 2013-07-03T19:39:24.152893 |
| | |- nova-conductor | enabled :-) 2013-07-03T19:39:16.801537 |
| | |- nova-consoleauth | enabled :-) 2013-07-03T19:39:18.054133 |
| | |- nova-scheduler | enabled :-) 2013-07-03T19:39:17.842316 |
| | |- nova-console | enabled :-) 2013-07-03T19:39:24.467889 |
| | |- nova-cells | enabled :-) 2013-07-03T19:39:18.174124 |
| nova | available |
| |- cn6 | |
| | |- nova-compute | enabled :-) 2013-07-03T19:39:20.555113 |
| |- ngp02-hv1.private.cloud.com | |
| | |- nova-compute | enabled :-) 2013-07-03T19:39:20.805885 |
| |- cn4 | |
| | |- nova-compute | enabled :-) 2013-07-03T19:39:22.751993 |
| |- cn8 | |
| | |- nova-compute | enabled :-) 2013-07-03T19:39:17.992623 |
| |- CN10 | |
| | |- nova-compute | enabled :-) 2013-07-03T19:39:19.901133 |
| |- cn12 | |
| | |- nova-compute | enabled :-) 2013-07-03T19:39:18.504939 |
+--------------------------------+----------------------------------------+

tags: added: compute hyper-v
Matt Riedemann (mriedem)
tags: removed: hyper-v
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is where the compute API is picking the source host to service the resize-confirm request:

https://github.com/openstack/nova/blob/master/nova/compute/api.py#L1970

Revision history for this message
Matt Riedemann (mriedem) wrote :

Talked about this in IRC today and consensus is we should never be supporting the case that the user stopped the instance, migrated it, powered it on via the hypervisor (below openstack), and then confirmed it and then we have to figure out what the current power state is in the hypervisor. This makes fixing the bug pretty easy because I can just get the power_state from the database rather than the hypervisor. I'll post a fix shortly.

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/35554

Changed in nova:
status: New → In Progress
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/35554
Committed: http://github.com/openstack/nova/commit/fbe792902476fe7f508204c810050157dcc99c0d
Submitter: Jenkins
Branch: master

commit fbe792902476fe7f508204c810050157dcc99c0d
Author: Matt Riedemann <email address hidden>
Date: Wed Jul 3 14:39:13 2013 -0700

    Fix power_state lookup in confirm_resize

    Previously the compute manager's confirm_resize method was trying to get
    the current power_state of the resized/migrated instance from the
    hypervisor layer to account for a case where the user migrated from a
    stopped instance and then powered it on via the hypervisor (below
    openstack) to test it before confirming the migration. This doesn't
    work in the migrate case because the confirm_resize request is serviced
    on the source compute node, and the instance no longer lives there so
    _get_power_state results in an InstanceNotFound. The vm_state would
    then be set to ACTIVE even though the instance's power_state is still
    SHUTDOWN.

    This patch fixes this problem by just getting the instance power_state
    from the database since the scenario of powering on the migrated
    instance outside of openstack is not supported.

    Fixes bug 1197514

    Change-Id: If2ea83cd317aa8897b7b6bd6b43a2067e1074e33

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.