nova should make sure the bridge exists before resuming a VM after an offline snapshot

Bug #1293540 reported by Cristian Tomoiaga
62
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Low
Luo Gangyi
neutron
Confirmed
Undecided
Unassigned

Bug Description

My setup is based on icehouse-2, KVM, Neutron setup with ML2 and the linux bridge agent, CentOS 6.5 and LVM as the ephemeral backend.
The OS should not matter in this, LVM should not matter either, just make sure the snapshot takes the VM offline.

How to reproduce:
1. create one VM on a compute node (make sure only one VM is present).
2. snapshot the VM (offline).
3. linux bridge removes the tap interface from the bridge and decides to remove the bridge also since there are no other interfaces present.
4. nova tries to resume the VM and fails since no bridge is present (libvirt error, can't get the bridge MTU).

Side question:
Why do both neutron and nova deal with the bridge ?
I can understand the need to remove empty bridges but I believe nova should be the one to do it if nova is dealing mainly with the bridge itself.

More information:

During the snapshot Neutron (linux bridge) is called:
(neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent)
treat_devices_removed is called and removes the tap interface and calls self.br_mgr.remove_empty_bridges

On resume:
nova/virt/libvirt/driver.py in the snapshot method fails at:
if CONF.libvirt.virt_type != 'lxc' and not live_snapshot:
                    if state == power_state.RUNNING:
                        new_dom = self._create_domain(domain=virt_dom)

Having more than one VM on the same bridge works fine since neutron (the linux bridge agent) only removes an empty bridge.

description: updated
Aaron Rosen (arosen)
tags: added: network
removed: low-hanging-fruit
Changed in neutron:
status: New → Confirmed
Sean Dague (sdague)
Changed in nova:
status: New → Incomplete
Changed in nova:
status: Incomplete → Confirmed
status: Confirmed → Incomplete
Revision history for this message
Priyanka (priyanka-majeti) wrote :

Can you please provide more details to reproduce the issue?

Revision history for this message
Kalle Happonen (kalle-happonen) wrote :

Responding to this, since I see the same issue.

Network configuration for us is ML2 + linuxbridge + VLAN. It might need to be a config like this to reproduce.

For each network (VLAN) a new bridge is created, e.g. brq6497753f-bd. When the last port is removed from the bridge, the bridge is deleted.

To reproduce:
1. Launch a VM, and schedule it to a node which has no other VMs in that network.
2. Verify that you have a brige brqXXXXXXXX-XX on the hypervisor node with only the vlan interface and the tap interface
3. Snapshot the VM. This will shutdown the VM, which will remove the bridge, since it only had one interface.
4. The snapshot should fail and you should see something like this on the logs

ERROR oslo.messaging._drivers.common [-] Returning exception Cannot get interface MTU on 'brq6497753f-bd': No such device to caller

This *might* be a race condition, so I'm not sure it triggers every time.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

I'm pretty sure that this is a race condition between neutron linuxbridge agent, that will periodically look for empty bridges and remove them, and nova, which will remove and reattach the interface, but doesn't expect the bridge being deleted in between. The race condition involved is that the neutron cleanup job must run exactly in the interval between removal and reattach.

This also happens when rebooting an instance, see https://bugs.launchpad.net/neutron/+bug/1328546 which seems to be the same underlying issue.

tags: added: lb
Luo Gangyi (luogangyi)
Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/149942

Changed in nova:
assignee: nobody → Luo Gangyi (luogangyi)
status: Confirmed → In Progress
Revision history for this message
Luo Gangyi (luogangyi) wrote :

Dears, I have proposed a patch for this bug, please review it https://review.openstack.org/149942 .

Changed in nova:
importance: Undecided → Low
Revision history for this message
Cristian Tomoiaga (ctomoiaga) wrote :

Please note that this affects the rescue process as well. Same as above, the instance is stopped and the bridge removed. Nova tries to boot the new rescue instance and fails since it can't get the MTU for the (removed by linux bridge agent) bridge.

nova.compute.manager [instance: uuid] libvirtError: Cannot get interface MTU on 'brq800452d2-85': No such device

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

I wonder if `linuxnet_interface_driver=nova.network.linux_net.NeutronLinuxBridgeInterfaceDriver` config in nova.config can fix this issue?

ensure the bridge exist[0] when resume the instance.

[0] https://github.com/openstack/nova/blob/master/nova/network/linux_net.py#L1882

Revision history for this message
Luo Gangyi (luogangyi) wrote :

To Jeffrey Zhang,

I don't think this configuration would work since the driver is called by nova. If nova do not call the "plug" method , the code you pointed will not be executed.

Revision history for this message
Luo Gangyi (luogangyi) wrote :

Dears, I have proposed a new patch for this bug, please review it https://review.openstack.org/149942

tags: added: juno-backport-potential
ustcdylan (ustcdylan)
Changed in nova:
status: In Progress → Confirmed
Changed in nova:
status: Confirmed → In Progress
tags: added: linuxbridge
removed: lb
no longer affects: openstack-ansible
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (7.9 KiB)

Per comment 3 and reboot, I see the error in a CI job:

http://logs.openstack.org/37/220737/10/check/gate-functional-neutron-dsvm-ec2api/fd85409/logs/screen-n-cpu.txt.gz#_2015-09-08_15_27_47_699

2015-09-08 15:27:47.699 ERROR oslo_messaging.rpc.dispatcher [req-577eb939-0df9-4a52-bc4c-0f1249ed4293 user-d08258b2 project-dd85f112] Exception during message handling: Cannot get interface MTU on 'qbr843e4493-d1': No such device
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher executor_callback))
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher executor_callback)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 89, in wrapped
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher payload)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 72, in wrapped
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 349, in decorated_function
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher LOG.warning(msg, e, instance=instance)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 322, in decorated_function
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 399, in decorated_function
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-09-08 15:27:47.699 557 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/comput...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

This should be resolved now on the neutron side given Sean Collins removed the code in the neutron linuxbridge agent that removed empty bridges:

https://review.openstack.org/#/q/I4ccc96566a5770384eacbbdc492bf09a514f5b31,n,z

That's been backported to stable/juno so we should be good - I don't think we need the nova side changes now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/149942
Reason: Fixed in neutron.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.