GARP not sent on provider network after live migration

Bug #1866139 reported by Arjun Baindur
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
Incomplete
Undecided
Unassigned

Bug Description

Using Rocky, with OVS. Live migrated a VM on regular VLAN based provider network. Network connectivity was stopped, no GARP packets observed on tcpdump. Things started working after VM initiated traffic, causing MAC to be relearned.

Looking at the code, send_ip_addr_adv_notif(), in ip_lib.py is responsible for using arping utility to send out a GARP. But this is only referenced in l3-agent code. This is a provider network. No routers, no floating IPs.

I see this very old bug in OVN: https://bugs.launchpad.net/networking-ovn/+bug/1545897

But we are not using OVN, and that fix was fixed in OVN code itself. This is Openstack with OVS agent.

How is live migration and GARP handled for fixed IPs?

Revision history for this message
Arjun Baindur (abaindur) wrote :

So more I think about it, in Floating IP/route case, it sends from the fip ns because 1. the host has access to it and 2. the FIP is configured on it fip gateway port. And DVR qrouter updates static ARP entries for east-west routing.

In a provider network, there is no IP configured on the host. It's a pure L2 port that passes thru OVS (a pure L2 switch), and into the VM. OVS does not deal with IPs. the tap and qvo ports don't have an IP configured in them

From what interface would the GARP even be sent out?

[root@arjunlive-10-128-242-125platform9 ~]# arping -U -I tapdb8f8ab2-1f -c 3 10.128.138.119
bind: Cannot assign requested address

Does this mean its intended behavior and live migration isnt possible without network loss, for provider nets/isolated tenant networks?

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi Arjun,

Isn't it qemu's responsibility to send a GARP when it starts the the vm on the destination host? Likely that's why you're not finding the relevant code in openstack, because it's in qemu. But qemu does not anything about floating ips so we have to do that from neutron.

So this may be a qemu bug (qemu not sending the garp), an openstack bug (openstack not configuring/using qemu correctly), or even a later network problem blocking the garp before it reaches the switches needing to learn the new port for the mac.

I hope this helps with debugging.

Changed in neutron:
status: New → Incomplete
Revision history for this message
ignazio (cassano) wrote :

Hello, I got same issue after upgrading to rocky and stein.
So I think something is changed in the code.
Ignazio

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Neutron ovs agent don't sends GARPs for VM's IPs. It should be probably, as Bence said qemu's responsibility to send that.
Hoverwer in Rocky we introduces this new multiple port bindings feature. Maybe there is some race between when Nova is configuring and running qemu process on dst host and how neutron-ovs-agent is wiring this port.
Can You try to check if such garps are maybe visible on tap<port_id> interface and dropped somewhere in br-int/br-ex maybe?

Revision history for this message
ignazio (cassano) wrote :

I am becoming crazy.
I am comparing what's happen on queens vs rocky.
When a migrate on queens send an arp request end receive an arp reply when a ping to the vm is done.
On rocky send an arp request but it does not receive arp reply when a ping to the vm is done.

On rocky the vm starts to respond when the vm start a connecention for example when it polls the ntp.
Infact it I change the ntp polling frequency to some seconds, thw live migration loose only few packets, otherwise it loose 100 or more packets.

Ignazio

Revision history for this message
ignazio (cassano) wrote :

I am becoming crazy.
I am comparing what's happen on queens vs rocky.
When a vm migrate on queens it sends arp request and receive an arp reply when a ping to the vm is done.
On rocky send arp request but it does not receive arp reply when a ping to the vm is done.

On rocky the vm starts to respond when the vm start a connecention for example when it polls the ntp server.
Infact if I change the ntp polling frequency to some seconds, the live migration loose only few packets, otherwise it loose 100 or more packets.

Ignazio

Revision history for this message
ignazio (cassano) wrote :

Hello, some updates.
The issue is not caused by qemu/libvirt.
Under queens with the same versions of qemu/libvirt installed on rocky or stein, provider netrwork vm live migration works fine
Ignazio

Revision history for this message
Oleg Bondarev (obondarev) wrote :

This had a long history some time ago https://bugs.launchpad.net/neutron/+bug/1414559

Revision history for this message
ignazio (cassano) wrote :

Hello Oleg, but it is very old.
Why on queens works ?
It does not work anymore from rocky :-(
Ignazio

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Hi Ignazio, actually the bug says in Nova it was fixed in Queens.
Need to figure out what changes in Rocky led to regression..
I guess the problem is exactly the same - qemu sends packets while the port is not fully processed by OVS agent.

Revision history for this message
ignazio (cassano) wrote :

Hi Oleg, this issue is present also in stein.
I have testing environment on queens, rocky and stein.
On rocky and stein the vm starts to respond only when it send some network packet: for example I changend the vm chrony polling interval to few senconds and when it migrates starts to respond very soon, but it is only for testing.
I am waiting a patch for upgrading production environment from queens to stein.
If I can help for testing, I am available.
Ignazio

Revision history for this message
sean mooney (sean-k-mooney) wrote :

to be clear when useing libvirt/qemu GARP should not be sent on provider network after live migration. qemu shoudl send RARP packets. if you are using qemu 2.6.0 this is broken as i mentioned on the mailing list http://lists.openstack.org/pipermail/openstack-discuss/2020-April/014530.html

neutron is not responsible for sending the RARP packets that is done by qemu but there is a known race when usign the ovs firewall driver. it is tracked by two bugs
specifcly these https://bugs.launchpad.net/neutron/+bug/1815989 and https://bugs.launchpad.net/neutron/+bug/1734320

this bug is just a dubplicate of https://bugs.launchpad.net/neutron/+bug/1815989

to fix this we need to update https://review.opendev.org/#/c/640258 so it works again so that the ovs agent can correctly wire up the ovs port before the qemu tap device is created.
and then in nova we need to update https://review.opendev.org/#/c/602432/

i stoped working on this a few months ago becasue when i had a workign version i could not get it merged and when i finally started to get review i nolonger had time left to work on it.

that siad its coming up a lot now as people seam to finally be deploying rocky+

if people have time to take those over it would be great if someone form the neutron team could complete the neutron agent patch.

if not i might try to make time to look at this again but i will need to go set up an env and try and fiture out how to make the neutron patch work again which will take time.

Revision history for this message
ignazio (cassano) wrote :

Hello Sean, I am facing thix issue also with iptables_hybrid firewall, and if I read well your post seems it does not work only with openvswitch firewall.

Ignazio

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.