Instance stuck at spawning status after rescheduling

Bug #1327124 reported by Alex Xu
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Alex Xu

Bug Description

After fix this bug https://bugs.launchpad.net/nova/+bug/1326207 , The reschedule works.

But found instance stuck at spawning status after rescheduling.

Envoriment is:
os3 is running ovs agent
os4 is running linuxbridge agent

When boot new instance, instance schedule to os3, but failed.
nova begin to reschedule instance to os4.

But the instance is stuck at spawning status.

os@os3:~/devstack$ nova show vm8
+--------------------------------------+----------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | os4 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | os4 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000034 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | spawning |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2014-06-06T09:18:38Z |
| flavor | m1.tiny (1) |
| hostId | bf5039d22c8737b688f12f58afef5cde768f38732c47e48403932b44 |
| id | b6f6d043-1cfc-4c6d-914d-df4a48134589 |
| image | cirros-0.3.2-x86_64-uec (855f5370-8c60-4840-990d-f89b87bd3d19) |
| key_name | - |
| metadata | {} |
| name | vm8 |
| net1 network | 13.0.0.33 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tenant_id | 5c76520922254aa0a1459dc687bcbc1d |
| updated | 2014-06-06T09:18:40Z |
| user_id | 9be4442569e54c15ac6bab7d30af7d8f |
+--------------------------------------+----------------------------------------------------------------+

Get error from os4 linuxbridge agent:

2014-06-06 17:18:46.570 ERROR neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent [-] Error in agent loop. Devices info: {'current': set(['tap15c2b6bc-51', 'tap2723ce44-87', 'tap484a667a-b1', 'tap8641bd9a-64', 'tap78fef3dd-1f', 'tap66d5a70e-0d']), 'removed': set([]), 'added': set(['tap15c2b6bc-51', 'tap2723ce44-87', 'tap484a667a-b1', 'tap8641bd9a-64', 'tap78fef3dd-1f', 'tap66d5a70e-0d'])}
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent Traceback (most recent call last):
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent File "/opt/stack/neutron/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py", line 1011, in daemon_loop
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent sync = self.process_network_devices(device_info)
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent File "/opt/stack/neutron/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py", line 908, in process_network_devices
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent resync_a = self.treat_devices_added(device_info['added'])
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent File "/opt/stack/neutron/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py", line 946, in treat_devices_added
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent details['port_id']):
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent File "/opt/stack/neutron/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py", line 424, in add_interface
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent tap_device_name)
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent File "/opt/stack/neutron/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py", line 406, in add_tap_interface
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent root_helper=self.root_helper):
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 76, in execute
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent raise RuntimeError(m)
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent RuntimeError:
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'brctl', 'addif', 'brq4103cd57-70', 'tap8641bd9a-64']
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent Exit code: 1
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent Stdout: ''
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent Stderr: "device tap8641bd9a-64 is already a member of a bridge; can't enslave it to bridge brq4103cd57-70.\n"
2014-06-06 17:18:46.570 TRACE neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent

Check the port status:

os@os3:~/devstack$ neutron port-show 8641bd9a-64b2-439f-a9bd-8644787d25b6
+-----------------------+----------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+----------------------------------------------------------------------------------+
| admin_state_up | True |
| allowed_address_pairs | |
| binding:host_id | os3 |
| binding:profile | {} |
| binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} |
| binding:vif_type | ovs |
| binding:vnic_type | normal |
| device_id | b6f6d043-1cfc-4c6d-914d-df4a48134589 |
| device_owner | compute:None |
| extra_dhcp_opts | |
| fixed_ips | {"subnet_id": "c26683fc-122f-4c30-bd05-f27f0499c6af", "ip_address": "13.0.0.33"} |
| id | 8641bd9a-64b2-439f-a9bd-8644787d25b6 |
| mac_address | fa:16:3e:4e:69:3b |
| name | |
| network_id | 4103cd57-70d9-4aa4-9501-5441585278f5 |
| security_groups | 8a836d5b-27fc-499f-9b7d-7d279afaca3d |
| status | BUILD |
| tenant_id | 5c76520922254aa0a1459dc687bcbc1d |
+-----------------------+----------------------------------------------------------------------------------+

It still binding at os3 host.

Alex Xu (xuhj)
Changed in nova:
assignee: nobody → Alex Xu (xuhj)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/98340

Changed in nova:
status: New → In Progress
Alex Xu (xuhj)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/116578

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/116579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.openstack.org/116579
Reason: Merge into https://review.openstack.org/#/c/116578/4

Revision history for this message
Mark Wu (wudx05) wrote :

I also hit this issue, but it happened in the case of evacuating an instance. The phenomenon I saw is that the vm couldn't get ip address after evacuated. It's caused by that neutron refused to activate the port the new host since the port was still owned by the original host before evacuating. Without the process of activating, the bridge forwarding entries can't be populated on new host. Therefore the dhcp request packet from the instantce can't reach dhcp server. So in all the cases which will launch an instance on a different host, like unshelve, evacuate and so on, nova has to call neutron API to update the port binding with new host because neutron has no idea of the change.

I have verified the approach in https://review.openstack.org/#/c/116578/4 by calling the new added update api in the code path of evacuating. In my opinion, this is a very severe issue and it should be fixed as soon as possible.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/116578
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7368d9f6e9e94819b199d0f2cc1ac69706870586
Submitter: Jenkins
Branch: master

commit 7368d9f6e9e94819b199d0f2cc1ac69706870586
Author: He Jie Xu <email address hidden>
Date: Thu Aug 21 15:27:11 2014 +0800

    Add setup/cleanup_instance_network_on_host api for neutron/nova-network

    When reschedule/unshelve/evacuate/rebuild instance, the instance
    is moved from one host to another. The network should be updated
    for the instance. And there are similar APIs
    migrate_instance_start/stop, those two APIs should be used for
    migration case as the name. So adds new APIs used to update
    network resource when instance is moved.

    This patch implements the new APIs both for neutron and
    nova-network.

    Change-Id: I513e0a08b32aa7f38c480488b398cd97d8cdc471
    Related-bug: #1327124

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/98340
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=05cd4824a9749c6067bb3eddc0994126d8940941
Submitter: Jenkins
Branch: master

commit 05cd4824a9749c6067bb3eddc0994126d8940941
Author: He Jie Xu <email address hidden>
Date: Thu Aug 21 21:17:08 2014 +0800

    Update network resource when rescheduling instance

    When instance building failed, compute manager will try to reschedule
    the failed instance. If the network resource already allocated before
    failed, the network resource need update at current compute node and
    next compute node.

    Before rescheduling the failed instance, for nova-network with multihost,
    because the network already setup on the compute node, the floating ip
    should be cleanup at compute node.

    After rescheduling, because the network already allocated, then compute
    manager won't allocate network again. For neutron, the port binding info
    need update to new compute node. For nova-network with multihost, the
    floating ip should be setup on the new compute node.

    Change-Id: Id4822ca0a902b35396025bde2a16411a02c46e0e
    Closes-Bug: #1327124

Changed in nova:
status: In Progress → Fix Committed
Paul Murray (pmurray)
tags: added: juno-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-2 → 2015.1.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/184796

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/184801

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/juno)

Change abandoned by Michal Jura (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/184796
Reason: This will no work out with Juno

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michal Jura (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/184801
Reason: This will no work out with Juno

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/215160

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/juno)

Change abandoned by Roman Podoliaka (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/215160

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.