lost net connection when live migration

Bug #1880389 reported by tony
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

Description
===========
I find VM will lost net connection when live migration.

I enabled the live_migration_wait_for_vif_plug and set vif_plugging_timeout=120.

My openstack version is queens, and I used linux bridge as my L2 plugin. and the physical adapter of my host I did bond0 mode.

when the live migration done, I found the ping of VM is broken down, and I use tcpdump to catch the package, I found the package from switch still reach the destination host.

Steps to reproduce
==================
1. enable the live_migration_wait_for_vif_plug = True and vif_plugging_timeout = 120 (remeber restart the linuxbridge-agent in source and destination host)
2. created a new VM ,and then ping the VM which created in step
3. do the live migration action.(before live migration you need to make sure the vlan sub_interface of the network which vm attached not in the dest host. )
4. when the processing of live migration, U will found the ping to the VM was broken down.

Expected result
===============
the ping should not broken down when live migration.

Actual result
=============
the ping was broken down. and the ping package from the physic switch was still set to the source.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   queens

2. Which hypervisor did you use?

   Libvirt + KVM

3. Which storage type did you use?
   ceph

4. Which networking type did you use?
   Neutron with LinuxBirdge
   the network type of VM is Vlan type.

And I found the time of 5 RARP from VM is before the time of the vlan sub interface insert to the linuxbridge bridge when live migration. maybe this will help us to make sure the problem I found.

tags: added: live-migration network
Revision history for this message
sean mooney (sean-k-mooney) wrote :

i suspect this is related to https://bugs.launchpad.net/neutron/+bug/1815989

in the case of ovs we know there is an unavoidable race betwen ovs libvirt and the neuton l2 agent.
or rather there is a race that can only be solved via a change to the l2 agent and removing libvirt form network configutation of the ovs ports.

i did not belive this race existed for linux bridge but its possible that it does.
i will need to check the code but could you provide the libvirt version you are using and ideally
a sample of the libvirt xml that we generated. if you can provide the nova compute log for both the souce and destion host and the neutron linux bridge l2 agent log for the time period during the migration that should contain all the info we need if they are running in debug log level.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

moving to incomplete until logs are provided or i can confirm a code path that could lead to a race

Changed in nova:
status: New → Incomplete
Revision history for this message
tony (yanhongchang5) wrote :

I have collected the log for the live migration. see attachment file.

And the libvirt version is libvirt-python-4.5.0-1.el7.x86_64

The VM image is centos74-mini.

I found The VM Resumed is before the 'brctl addif', so the RARP only can reach the tap device, cloud not reach the virt bridge. please notice this.

dest node nova_compute.log:

2020-05-26 14:56:05.303 6 INFO nova.compute.manager [req-8cd4842d-4783-42b3-9a8e-c4e757b8e6f0 - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Resumed (Lifecycle Event)

dest node neutron_linuxbridge_agent.log:

2020-05-26 14:56:25.743 6 DEBUG neutron.agent.linux.utils [req-fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'brctl', 'addif', 'brq49b34298-a8', 'p2p2.64'] create_process /var/lib/kolla/venv/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87

Revision history for this message
tony (yanhongchang5) wrote :

source host nova_compute.log

Revision history for this message
tony (yanhongchang5) wrote :

source host neutron_linuxbridge_agent.log

Revision history for this message
tony (yanhongchang5) wrote :

dest host nova_compute.log

Revision history for this message
tony (yanhongchang5) wrote :

dest host neutron_linuxbridge_agent.log

Revision history for this message
sean mooney (sean-k-mooney) wrote :

thanks i will try and review the logs short.

in the interim i have set up a two node linuxbridge environment and tried to repoduce this with master.

using master so far i have not been able to messure more then a 4 second drop in connectivity

[1590502142.222506] 64 bytes from 172.24.4.63: icmp_seq=280 ttl=63 time=0.479 ms
[1590502142.430451] 64 bytes from 172.24.4.63: icmp_seq=281 ttl=63 time=0.423 ms
[1590502142.638626] 64 bytes from 172.24.4.63: icmp_seq=282 ttl=63 time=0.493 ms
[1590502142.846505] 64 bytes from 172.24.4.63: icmp_seq=283 ttl=63 time=0.447 ms
[1590502146.945127] 64 bytes from 172.24.4.63: icmp_seq=303 ttl=63 time=1.75 ms
[1590502147.148957] 64 bytes from 172.24.4.63: icmp_seq=304 ttl=63 time=0.813 ms
[1590502147.350779] 64 bytes from 172.24.4.63: icmp_seq=305 ttl=63 time=0.682 ms

while i understand this is a race at least on lightly loaded host this does not appear to happen on master.

its possibel i am just not hitting the race which is why race conditions are so annoying to repoduce. at least in my local testing the RARP packets appear to be recived but i have not done a trafic capture yet to see if all of them are recived or if some of them are dropped.

Revision history for this message
sean mooney (sean-k-mooney) wrote :
Download full text (6.4 KiB)

regarding the VM Resumimg before the 'brctl addif'.

the libvirt xml we generate contains the name of the linux bridge the tap should be added
too. so the linux bridge agent does not need to actully do the brctl addif command.
the tap should already be a member of that bridge when the vm is resumed.

it looks like the vm paused on the source compute node at
020-05-26 14:56:05.246 7 INFO nova.compute.manager [req-fe4495ae-f1a7-4e93-871e-4d034098babd - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Paused (Lifecycle Event)

and it resumed on the dest at
2020-05-26 14:56:05.303 6 INFO nova.compute.manager [req-8cd4842d-4783-42b3-9a8e-c4e757b8e6f0 - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Resumed (Lifecycle Event)

previously at 2020-05-26 14:55:54.639 6 DEBUG nova.virt.libvirt.driver [req-28bba2f2-89d9-4cbb-8b6a-7a9690469c86 b114d7969c0e465fbd15c2911ca4bb23 28e6517b7d6d4064be1bc878b590c40c - default default] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] Plugging VIFs before live migration. pre_live_migration /var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:7621

on the dest node we have started per pluging the network backend which successfully completed at

2020-05-26 14:55:58.769 6 INFO os_vif [req-28bba2f2-89d9-4cbb-8b6a-7a9690469c86 b114d7969c0e465fbd15c2911ca4bb23 28e6517b7d6d4064be1bc878b590c40c - default default] Successfully plugged vif VIFBridge(active=True,address=fa:16:3e:e1:50:ac,bridge_name='brq49b34298-a8',has_traffic_filtering=True,id=b3526533-dc6a-4174-bd3b-c300e78eda62,network=Network(49b34298-a85a-42a9-b264-b3a9242fef8f),plugin='linux_bridge',port_profile=,preserve_on_delete=False,vif_name='tapb3526533-dc')

this happens before we call libvirt to migrate the instance so at this point os-vif has ensured the linux bridge "brq49b34298-a8" is created so that when the vm starts the the tap is directly created in the correct bridge.

at 2020-05-26 14:56:02.172 7 DEBUG nova.compute.manager [req-40866f62-6362-4e9a-910e-365c3452d29f 030ec97d13dd4d9698209595a7ac01c4 ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] Received event network-changed-b3526533-dc6a-4174-bd3b-c300e78eda62 external_instance_event /var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py:8050

we received a network-changed

and then at 2020-05-26 14:56:04.314 7 DEBUG nova.compute.manager [req-379d2410-5959-45ae-89ce-649dca3ed666 030ec97d13dd4d9698209595a7ac01c4 ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] Received event network-vif-plugged-b3526533-dc6a-4174-bd3b-c300e78eda62 external_instance_event /var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py:8050

we recive a network-vif-plugged event which should ideally only be sent by the ml2 driver when the l2 agent has finished wireing up the networking on the destintaiton node.

as you pointed out the l2 agent does not finish adding the vlan subport to the correct bridge until

2020-05-26 14:56:25.743 6 DEBUG neutron.agent.linux.utils [req-fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Running command: ['sudo...

Read more...

Changed in nova:
status: Incomplete → New
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

Thx for the detailed analysis Sean. I have one concern with that. You narrowed it down to changes related to the multiple port binding, and IMO it could be in such case similar issue which we had with ovn driver: https://launchpad.net/bugs/1834045
But the thing is that this functionality was introduced in Rocky IIRC, so in Queens it has to be different issue probably (or this issue but not really caused by multiple port bindings feature).

I will try to take a look into that code on Neutron's side and propose fix for that issue. But I have pretty limited cycles so I can't promise that this will be really soon. So if there is anyone else who would like to take a look into this, that would be great :)

Revision history for this message
tony (yanhongchang5) wrote :

Hi Sean,

thanks for your analysis about the log, I also agree with you.

and I have do some investigation about why the neutron will send the 'network-vif-plugged' signal to nova.

@slawek,
hi slawek,

I don't think the problem https://launchpad.net/bugs/1834045 is same to my, the bug 1834045 is not trigger the notify of 'network-vif-plugged', and my is receive the unexcept single.

below is my analysis, I hope that will help U to locate the issue.

I found that the the pre_live_migration

https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L6299-L6300

here will post port update to neutron-server with profile {"migrating_to": "bj-sjhl-os-compute-test-21-55"}(the "bj-sjhl-os-compute-test-21-55" is source node), and in the process of port update, the neutron-server will clear binding level in here:

https://github.com/openstack/neutron/blob/stable/queens/neutron/plugins/ml2/plugin.py#L1428-L1429

and then changed the port status from ACTIVE to DOWN.

then in here:
https://github.com/openstack/neutron/blob/stable/queens/neutron/plugins/ml2/plugin.py#L1506-L1509
attempt to binding the port again. and the port's status has changed again, from DOWN to ACTIVE.

the port status changed from DOWN to ACTIVE, will trigger the notify 'network-vif-plugged' to nova, and the nova-compute in source host will received the signal, and do the really live migration in libvirt dirver.

so IMO, I think the port status should not change when update with profile{"migrating_to": XXXXX).

Revision history for this message
tony (yanhongchang5) wrote :

Hi Sean,

   i want to know the 'network-vif-plugged' signal which was waiting for source node should be trigger by what or where? I think the signal was trigger by the linuxbridge-agent of destination node, after the vif tap device was scan by the linuxbridge-agent of dest node, the linuxbridge agent can send 'network-vif-plugged' by method notify_port_active_direct(). but how the linuxbridge agent can scan the new tap device in dest node? I found the source node should have received the signal of 'network-vif-plugged', and run guest.migration(), the VM started in dest, and NOW the tap device can be created. I found this is a dead circle. I think there is any problem in proccess of live migration,thx.

Lee Yarwood (lyarwood)
no longer affects: nova
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.