Evacuation fails with VirtualInterfaceCreateException

Bug #1624052 reported by Artom Lifshitz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Artom Lifshitz

Bug Description

Description:

With Neutron, evacuating an instance results in a VirtualInterfaceCreateException and the evacuation fails.

Steps to reproduce:

1. Boot a VM with a Neutron NIC.
2. Cause the underlying host to be down, for example by stopping the compute service.
3. Evacuate the VM.
4. The evacuation will fail because the destination compute host waits for a network-vif-plugged event from Neutron, which it never received because Nova sends the event to the source compute.

Environment: Neutron, KVM

Revision history for this message
Matt Riedemann (mriedem) wrote :

Might want to talk to Chris Friesen, I thought he worked on a fix for this a long time ago.

tags: added: evacuate neutron
Revision history for this message
Matt Riedemann (mriedem) wrote :

Have you recreate this on master?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like a duplicate of bug 1430042 which is marked as fixed and the fix was backported through to stable/liberty.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Correction, https://review.openstack.org/#/c/169827/ was fixed in liberty GA, so since 12.0.0.

Revision history for this message
Matt Riedemann (mriedem) wrote :

lyarwood pointed out it might actually be bug 1535918 you want.

Revision history for this message
Lee Yarwood (lyarwood) wrote :
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Can we at least have the stacktrace to compare with the proposed fix ?
Putting Incomplete now, but in case it's duplicating with bug 1535918, then we'll update that.

Changed in nova:
status: New → Incomplete
Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: Incomplete → In Progress
Revision history for this message
Artom Lifshitz (notartom) wrote :
Download full text (3.7 KiB)

I spent some time adding LOG.debug statements all over the place while trying to reproduce this bug, and then read bug 1535918. They're the same bug.

To summarise:

a. When spawning a VM, the libvirt driver waits for events from Neutron [1]
b. What events it waits for is decided by _get_neutron_events [2]
c. _get_neutron_events will only return a vif-plugged event if the vif isn't active [3]
d. The network_info that _get_neutron_events gets passed in [2] ultimately comes from [4]
e. If the instance's info cache hasn't been refreshed recently in step d, the vif will be active,
   and _get_neutron_events will return an empty list in c
f. When it comes time to actually wait for events in [5], no waiting is done because the events list
   is empty.
g. The instance boots up just fine even if the vif-plugged event never makes its way to the
   destination compute. Instead, it's received by the source compute service whenever it comes back
   online.

So really, evacuation was always broken, but we got away with it just enough times for nobody to notice because we sometimes got "lucky" with a stale cache (which is actually a bug).

[1] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4824
[2] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4817
[3] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4787
[4] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2810
[5] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L475

Snippets of logs follow, but with some LOG.debug statements that I put in, so not everything will appear the same in a vanilla devstack:

_get_neutron_events:

2016-10-08 11:31:01.824 DEBUG nova.virt.libvirt.driver [req-6a93955f-cc1e-4918-af95-53a7fa1d81fe admin admin] [{"profile": {}, "ovs_interfaceid": "aaac55e8-92c8-4280-a871-fab899661a30", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.10"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}, {"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8:8000:0:f816:3eff:fe0b:e6c2"}], "version": 6, "meta": {"dhcp_server": "2001:db8:8000:0:f816:3eff:fec1:b2e"}, "dns": [], "routes": [], "cidr": "2001:db8:8000::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8:8000::1"}}], "meta": {"injected": false, "tenant_id": "c145e829d46745bf822fac809223b61e", "mtu": 1450}, "id": "296e8f10-41f9-4b81-ba40-272aa8603d3b", "label": "private"}, "devname": "tapaaac55e8-92", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "ovs_hybrid_plug": true}, "address": "fa:16:3e:0b:e6:c2", "active": true, "type": "ovs", "id": "aaac55e8-92c8-4280-a871-fab899661a30", "qbg_params": null}] from (pid=17819) _get_neutron_events /opt/stack/nova/nova/virt/libvirt/driver.py:4786

2016-10-08 11:31:01.824 DEBUG nova.virt.libvirt.driver [req-6a93955f-cc1e-4918-...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/385086

Revision history for this message
Artom Lifshitz (notartom) wrote :
Download full text (3.8 KiB)

I thought and expected that refreshing info_cache right before [1] would reliably break evacuations unless external event dispatching was also fixed. This didn't turn out to to the case.

My original idea was that something was updating the network_info in the database - changing the port from active to down - but that update wasn't being reflected in the info_cache. To catch that elusive "something", I decorated InstanceInfoCache's save method with a wrapper that logged the old and new values of network_info. I can a couple of evacuation with the new decorator, and every time the save method was called, both the new old and value of network_info indicated the port was active.

I then modified [1] to log network_info as it appears in the info_cache and as it's returned by the network API. The network API showed the port as down, while it was still up in the info_cache:

2016-10-08 17:22:15.829 DEBUG nova.compute.manager [req-39dec9fa-38b6-4240-a278-cf6880268dbb admin demo] API network info: [{"profile": {}, "ovs_interfaceid": "00750ea9-c8c3-451b-8d28-884f2e26f510", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.9"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}, {"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8:8000:0:f816:3eff:fe42:9355"}], "version": 6, "meta": {"dhcp_server": "2001:db8:8000:0:f816:3eff:fec1:b2e"}, "dns": [], "routes": [], "cidr": "2001:db8:8000::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8:8000::1"}}], "meta": {"injected": false, "tenant_id": "c145e829d46745bf822fac809223b61e", "mtu": 1450}, "id": "296e8f10-41f9-4b81-ba40-272aa8603d3b", "label": "private"}, "devname": "tap00750ea9c8", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "ovs_hybrid_plug": true}, "address": "fa:16:3e:42:93:55", "active": false, "type": "ovs", "id": "00750ea9-c8c3-451b-8d28-884f2e26f510", "qbg_params": null}] from (pid=25905) _do_rebuild_instance /opt/stack/nova/nova/compute/manager.py:2812

2016-10-08 17:22:15.830 DEBUG nova.compute.manager [req-39dec9fa-38b6-4240-a278-cf6880268dbb admin demo] Info cache network info: [{"profile": {}, "ovs_interfaceid": "00750ea9-c8c3-451b-8d28-884f2e26f510", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.9"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}, {"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8:8000:0:f816:3eff:fe42:9355"}], "version": 6, "meta": {"dhcp_server": "2001:db8:8000:0:f816:3eff:fec1:b2e"}, "dns": [], "routes": [], "cidr": "2001:db8:8000::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8:8000::1"}}...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/371048
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a5b920a197c70d2ae08a1e1335d979857f923b4f
Submitter: Jenkins
Branch: master

commit a5b920a197c70d2ae08a1e1335d979857f923b4f
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 5 14:37:03 2016 -0400

    Send events to all relevant hosts if migrating

    Previously, external events were sent to the instance object's host
    field. This patch fixes the external event dispatching to check for
    migration. If an instance is being migrated, the source and
    destination compute are added to the set of hosts to which the event
    is sent.

    Change-Id: If00736ab36df4a5a3be4f02b0a550e4bcae77b1b
    Closes-bug: 1535918
    Closes-bug: 1624052

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/392219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/392219
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5de902a3163c9c079fab22754388bd4e02981298
Submitter: Jenkins
Branch: stable/newton

commit 5de902a3163c9c079fab22754388bd4e02981298
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 5 14:37:03 2016 -0400

    Send events to all relevant hosts if migrating

    Previously, external events were sent to the instance object's host
    field. This patch fixes the external event dispatching to check for
    migration. If an instance is being migrated, the source and
    destination compute are added to the set of hosts to which the event
    is sent.

    Change-Id: If00736ab36df4a5a3be4f02b0a550e4bcae77b1b
    Closes-bug: 1535918
    Closes-bug: 1624052
    (cherry picked from commit a5b920a197c70d2ae08a1e1335d979857f923b4f)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b1

This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.3

This issue was fixed in the openstack/nova 14.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: master
Review: https://review.openstack.org/385086
Reason: Nothing is technically broken anymore, since the patch that actually fixes the bug has merged. The race is still present I believe, but it doesn't actually affect anything now that event dispatching is fixed.

Revision history for this message
watanabe.isao (watanabe.isao) wrote :

Hello, all.
We had the same problem in a Mitaka(RHEL7) enviroment. Also we confirmed that after this fix is patched, it fixes the problem.
So, if it is OK, I would like to backport this fix to Mitaka.:)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/461678

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/461678
Reason: mitaka is basically end of life

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.