OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky

Bug #1815989 reported by Jing Zhang
164
This bug affects 34 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
sean mooney
Train
New
Undecided
Unassigned
Ussuri
New
Undecided
Unassigned
Victoria
New
Undecided
Unassigned
Wallaby
New
Undecided
Unassigned
neutron
Status tracked in Ussuri
Train
In Progress
Undecided
Unassigned
Ussuri
In Progress
Undecided
Rodolfo Alonso
Victoria
In Progress
Undecided
Unassigned
Wallaby
Fix Released
Undecided
Unassigned
os-vif
Invalid
Undecided
Unassigned

Bug Description

This issue is well known, and there were previous attempts to fix it, like this one

https://bugs.launchpad.net/neutron/+bug/1414559

This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova libvirt and neutron ovs agent all run inside containers.

So far the only simply fix I have is to increase the number of RARP packets QEMU sends after live-migration from 5 to 10. To be complete, the nova change (not merged) proposed in the above mentioned activity does not work.

I am creating this ticket hoping to get an up-to-date (for Rockey and onwards) expert advise on how to fix in nova-neutron.

For the record, below are the time stamps in my test between neutron ovs agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd last packet barely made through.

openvswitch-agent.log:

2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port 57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648, 'fixed_ips': [

{'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': '10.0.1.4'}
], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 'mac_address': 'fa:16:3e:de:af:47', 'device': u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']}
2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42

tcpdump for rarp packets:

[root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46

tags: added: ovs
Revision history for this message
Brian Haley (brian-haley) wrote :

Just want to make sure I understand this correctly after reading the bug you referenced.

1. The instance is live-migrated
2. The neutron-ovs-agent on the target node configures the port, but it's after libvirt has
   already starting sending RARPs out
3. A couple of RARPs make it out if you set a config option high enough, for example, 10

I would have figured neutron would have notified nova, and then it would have completed things, triggering the RARPs to happen after that event, but this is not my area of expertise. I'll ask someone that's more familiar with live migration operations to take a look and give their perspective.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this happens as a result of the fact that os-vif is currently not plugging the ovs port and that is instead being done by libvirt.

i had to revert that change due to other neutron issues which slawek has resolved.
https://review.openstack.org/#/c/636824
this nova change https://review.openstack.org/#/c/602432/ tests delegating the port plugging to os-vif. it depends on this change to the requirements repo https://review.openstack.org/#/c/636139/1
which forces the use of os-vif 1.13.1.

when i reverted https://review.openstack.org/#/c/631829/ it was purly to unblock the gate.
the code was correct however i could not merge https://review.openstack.org/#/c/602432/ due
to the unrelated neutron and cinder gate failures.

i expect that we will merge https://review.openstack.org/#/c/636061/ to os-vif soon.
that will allow callers of os-vif to choose if it should create the ovs interface allowing neutorn to wire up the port before qemu starts and preventing the RARPs from being lost.

it should be noted that this will only happen with kernel ovs and the ovs conntrack security group driver are deployed. for ovs-dpdk or kenel ovs with iptabels os-vif always plugs the interface not libvirt and we do not race on libvirt adding the interface, qemu sending the RARPs and neutron wiring up the interface on the br-int.

Revision history for this message
Brian Haley (brian-haley) wrote :

I've added the os-vif component based on Sean's comment.

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

Thanks Sean for detailing the history around this issue. Let us fix the known issue Sean is aware first.

For the record:

(1) At this moment in Rocky, the only deployment scenario I don't see the long ping pause is when VM uses vxLAN and native firewall. Both native OVS and OVS-dpdk are tested, I don't see different behavior between native OVS and OVS-dpdk.

Below are the details observed, consistent and reproducible:

> for VM using vxLAN, long ping delay is only with legacy firewall, and it happens on the direction when the migrating VM goes to a compute that has had its network before. No issue with native firewall.

> for VM using vLAN, long ping delay is with both legacy firewall and native firewall, and it happens on the direction when the migration VM goes to a compute that hasn't had its network before, and the the delay is consistent about 40s.

(2) I am also trying to increase RARP# from 5 to 10 at the QEMU side, and is told the following patch (to be merged, not yet) would allow changing the announce-rounds and the other announce timings at run-time:

http://lists.gnu.org/archive/html/qemu-devel/2019-02/msg01486.html

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Please note the qemu series:
  http://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04128.html

which:
  a) Makes these announce timings and repetitions configurable
  b) Allows a set of announces to be triggered on command.

Hopefully they'll get merged soon.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/640258

Changed in neutron:
assignee: nobody → sean mooney (sean-k-mooney)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/640258
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
yao ning (mslovy11022) wrote :

so how can we deal with this issue? @sean mooney

Revision history for this message
sean mooney (sean-k-mooney) wrote :

im going to unabandon the patch i was working on https://review.openstack.org/640258
and try and get back to this over the next 2 weeks or so.

i had some other issue come up which ment i need to put this on hold for a while.

ill be at the ptg the week after next so ill try to have some updated patches ready to discuss with
people in person and see if we can get an agreed path forward.

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Note that the qemu changes mentioned in #5 are merged into qemu v4.0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661921

Changed in neutron:
assignee: sean mooney (sean-k-mooney) → Yang Li (yang-li)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-lib (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661938

Revision history for this message
Yang Li (yang-li) wrote :
Download full text (3.3 KiB)

I think there are 2 problems cause connectivity broken in the live-migration
1. When the VM is migrated to the destination, and the VM send the rarp packets, but because it's too fast, the openflow and tag haven't been configed in br-int, then the rarp packets will be drop.

2. When the VM is migrated to the destination, the openflow and tag have been configed, then VM send rarp packet, but table 71 flow will drop these packets, because high priority flow(65 70 80 95) doesn't match the rarp packets, only low priority flow(10) will match, but its action is drop, so the packets still cannot been sent out. The table 71 flows:
 cookie=0x6c5958fed07888a7, duration=428986.816s, table=71, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=110,ct_state=+trk actions=ct_clear,resubmit(,71)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=130 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=3, n_bytes=210, idle_age=24441, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=133 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=78, idle_age=24450, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=135 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=136 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=70,icmp6,reg5=0x2d,in_port=45,icmp_type=134 actions=resubmit(,93)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=42, idle_age=24450, priority=95,arp,reg5=0x2d,in_port=45,dl_src=fa:16:3e:b4:db:09,arp_spa=192.168.100.141 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=91, n_bytes=8818, idle_age=24446, priority=65,ip,reg5=0x2d,in_port=45,dl_src=fa:16:3e:b4:db:09,nw_src=192.168.100.141 actions=ct(table=72,zone=NXM_NX_REG6[0..15])
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=2, n_bytes=678, idle_age=24450, priority=80,udp,reg5=0x2d,in_port=45,tp_src=68,tp_dst=67 actions=resubmit(,73)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=80,udp6,reg5=0x2d,in_port=45,tp_src=546,tp_dst=547 actions=resubmit(,73)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=70,udp,reg5=0x2d,in_port=45,tp_src=67,tp_dst=68 actions=resubmit(,93)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=70,udp6,reg5=0x2d,in_port=45,tp_src=547,tp_dst=546 actions=resubmit(,93)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=90, idle_age=24441, priority=65,ipv6,reg5=0x2d,in_port=45,dl_src=fa:16:3e:b4:db:09,ipv6_src=fe80::f816:3eff:feb4:db09 actions=ct(table=72,zone=NXM_NX_REG6[0..15])
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=90, idle_age=24450, priority=10,reg5=0x...

Read more...

Revision history for this message
Yang Li (yang-li) wrote :

I use some virsh commands to avoid the first situation, modify the migration speed to 1M/s, make sure the flows and tag are configed in br-int before live-migration completed.
# virsh migrate-setspeed instance-0000003b --bandwidth 1
# virsh migrate-getspeed instance-0000003b
# virsh migrate --live --persistent --undefinesource instance-0000003b qemu+tcp://node-3/system

And add a new rarp flow into table=71:
priority=95,ct_state=-trk,rarp,reg5=0x13,in_port=19,dl_src=fa:16:3e:09:3d:10 actions=resubmit(,94)

Then there is no more packets dropped in several live-migration tests。

Changed in neutron:
assignee: Yang Li (yang-li) → sean mooney (sean-k-mooney)
Changed in os-vif:
status: New → Invalid
Changed in nova:
status: New → In Progress
assignee: nobody → sean mooney (sean-k-mooney)
importance: Undecided → Medium
Changed in neutron:
assignee: sean mooney (sean-k-mooney) → Rodolfo Alonso (rodolfo-alonso-hernandez)
Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → Oleg Bondarev (obondarev)
Changed in neutron:
assignee: Oleg Bondarev (obondarev) → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
yao ning (mslovy11022) wrote :

Hi, sean

the root cause of this issue seams related to the neutron new port binding api. see: https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html

in rocky release, live migration activate the port by using port binding api in post_live_migration process. However, the libvirt start new instances and the new instance become alive before the port comes to active, and so the RARP packets are finally lost.

we are verified this by revert the port binding api logical, and then the problem solved. like below:
in neutron/agent/rpc.py
class CacheBackedPluginApi(PluginApi):

        if (port_obj.device_owner.startswith(
315 constants.DEVICE_OWNER_COMPUTE_PREFIX) and
316 binding[pb_ext.HOST] != host):
317 LOG.debug("Device %s has no active binding in this host",
318 port_obj)
319 return {'device': device,
320 n_const.NO_ACTIVE_BINDING: True}

skip this if branch, so that the port is always in active state.

also we need to skip port binding api used by nova:
in nova/network/neutronv2/api.py
    def supports_port_binding_extension(self, context):
        """This is a simple check to see if the neutron "binding-extended"
        extension exists and is enabled.

        The "binding-extended" extension allows nova to bind a port to multiple
        hosts at the same time, like during live migration.

        :param context: the user request context
        :returns: True if the binding-extended API extension is available,
                  False otherwise
        """
        self._refresh_neutron_extensions_cache(context)
        return constants.PORT_BINDING_EXTENDED in self.extensions

we directly return false for supports_port_binding_extension, so nova will not call port binding api during live migration. The legacy way is used

we confirm that because we manually call the activate port binding api to activate the port on destination during migration before the vms actived on destination, then the problem is also dispeared.

since the neutron port binding api has its own advantages, so how can we solve it thoroughly. Is that possible to activate the port binding before the vm shutting down on the source host and vm being running on the destination host? @sean mooney

Changed in nova:
assignee: sean mooney (sean-k-mooney) → nobody
Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → nobody
Revision history for this message
sean mooney (sean-k-mooney) wrote :

no we cant activate teh port biding before the vm is runing on the the dest unless you are using post copy live migration.

if you are using post copy live migrate we recive an event form libvirt when the vm is pasused and we activate the port binding at that point which is the first point where we can safely activate the port.

when we activate teh port on the destiatnion the port binding on the source is deactivated and in theory the network backend should prevent any network traffic form been sent or received on the source node. meaning if the ovs agent was correctly isolating the interface when it was deactivated then you would have a netwrok partation if we just activated it before we migrate which is why we do not do that. we cannot start the vm on the dest before stoping it on the souce node as that is entirely managed out side of nova control by libvirt.

if you are not using post copy we activate teh bining in the same place we woudl ahve previously bound the port in post_livemigration.

you are correct that we still have fallback code for neutron version that dont support multiple port bindign for live migration that takes the old path. if hardcoding the check to false fixes the problem it impleies the neutron_l2_agent is incorrectly wiring up ports on hosts that the port is not bound too which is not correct behaviour.

https://review.opendev.org/#/c/640258 allows neutron to wire up the destination port before the vm is created on the destination. https://review.opendev.org/#/c/602432/ was intended to prevent the race between libvirt and the l2 agent.

that is the only way i know of too fix this currently.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i could add a workaround config option to nova to forcully disable the multiple portbinding flow for live migration but that would really just be a hack and it would increase network downtime in some cases so it could not be enabled by default. it also has other issues like the inablity to migrate between ovs hosts with different firewall driver backends.

Revision history for this message
ignazio (cassano) wrote :

Hello,
I did not understand if I must patch only nova/network/neutronv2/api.py and nova/workarounds.py
or also neuttron/agent/rpc.py as yao ning suggested.

Revision history for this message
Xing Zhang (xingzhang) wrote :

Add for #15, we should also skip in neutron/agent/rpc.py for workaround to disable port binding:

diff --git a/neutron/agent/rpc.py b/neutron/agent/rpc.py
index 130b18e..5ee396c 100644
--- a/neutron/agent/rpc.py
+++ b/neutron/agent/rpc.py
@@ -311,13 +311,6 @@ class CacheBackedPluginApi(PluginApi):
         binding = utils.get_port_binding_by_status_and_host(
             port_obj.bindings, constants.ACTIVE, raise_if_not_found=True,
             port_id=port_obj.id)
- if (port_obj.device_owner.startswith(
- constants.DEVICE_OWNER_COMPUTE_PREFIX) and
- binding[pb_ext.HOST] != host):
- LOG.debug("Device %s has no active binding in this host",
- port_obj)
- return {'device': device,
- n_const.NO_ACTIVE_BINDING: True}
         net = self.remote_resource_cache.get_resource_by_id(
             resources.NETWORK, port_obj.network_id)
         net_qos_policy_id = net.qos_policy_id

diff --git a/nova/network/neutronv2/api.py b/nova/network/neutronv2/api.py
index f73fee4..088c3f0 100644
--- a/nova/network/neutronv2/api.py
+++ b/nova/network/neutronv2/api.py
@@ -1226,8 +1226,7 @@ class API(base_api.NetworkAPI):
         :returns: True if the binding-extended API extension is available,
                   False otherwise
         """
- self._refresh_neutron_extensions_cache(context)
- return constants.PORT_BINDING_EXTENDED in self.extensions
+ return False

     def bind_ports_to_host(self, context, instance, host,
                            vnic_type=None, profile=None):

Revision history for this message
ignazio (cassano) wrote :

The neutron agent must be patched on compute node ?
And nova on controllers ?

Revision history for this message
ignazio (cassano) wrote :

The neutron agent rpc.py on compute nodes contains the code you suggested so it does not need to be patched, right ?

I patched the nova/network/neutronv2/api.py as you suggested and the nova/conf/workarounds.py as suggested by Sean here:

https://review.opendev.org/#/c/724386/

I restarded all nova and neutron services.

Any case it does not solve the problem on provider networks on stein and rocky.
To reproduce the issue I tried with an instance that does not generate traffic (I also disabled its chronyd daemon). After live migration it stops to respond.
If I enable the vm chronyd with a polling time of 2 seconds, it responds after live migration because it sends packets every 2 seconds.

Revision history for this message
Xing Zhang (xingzhang) wrote :

Both are needed.

* nova-conductor and nova-compute services with workaround patch
* neutron/agent/rpc.py neutron-openvswitch-agent service on compute node by removing the lines described in https://bugs.launchpad.net/neutron/+bug/1815989/comments/19

Restart the services and do live migration.

Revision history for this message
ignazio (cassano) wrote :

I am sorry . I have not understand the output of diff command .Now is clear. I am going to test it

Revision history for this message
ignazio (cassano) wrote :

It works.
I changed the nova/conf/workarounds.py and nova/network/neutronv2/api.py only on controllers nodes.
I changed neutron/agent/rpc,py on on compute nodes (deleting lines as you suggested).

Restarded services and now it works :-)

Thanks

Revision history for this message
yao ning (mslovy11022) wrote :

anyone help to fix a workaround in neutron?

Revision history for this message
ignazio (cassano) wrote :

These patches seems to work fine on stein but not on rocky.

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Interesting having issues with this now after moving to Train and CentOS 8.

I can workaround this by sending the new announce_self (which generates RARP and/or GARP or w/e the NIC driver implements itself) to the instance after it's been migrated to another compute node and stops responding.

virsh qemu-monitor-command instance-000000e6 --hmp announce_self

I wonder if we can workaround this somehow if there is no clear path in sight with the patches that has been proposed already. Since libvirt doesn't support it we cannot set migration parameters to use the new announce parameters.

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Note that you need QEMU >= 4.0 to support announce_self

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/741529

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Rodolfo do you have any clear path on how we can work this bug?

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/640258
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7fd2725cb14b81f442eb57a38755829270ff2c43
Submitter: Zuul
Branch: master

commit 7fd2725cb14b81f442eb57a38755829270ff2c43
Author: Sean Mooney <email address hidden>
Date: Fri Mar 1 04:43:20 2019 +0000

    Do not skip ports with ofport unset or invalid

    This change removes the "_check_ofport" function and its use form
    the ovs_lib.py file.

    By skipping ports without a unique ofport in the "get_vifs_by_ids"
    and "get_vifs_by_id" functions, the OVS agent incorrectly treated
    newly added port with an ofport of -1 as removed ports in the
    "treat_devices_added_or_updated" function.

    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I79158baafbb99bee99a1d687039313eb454d3a9b
    Partial-Bug: #1734320
    Partial-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/748296

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by sean mooney (<email address hidden>) on branch: master
Review: https://review.opendev.org/748296

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Sorry for asking again, do we have a clear path on how to solve this?

Revision history for this message
James Denton (james-denton) wrote :

Also hoping for guidance on how to address this. Thanks!

Revision history for this message
sean mooney (sean-k-mooney) wrote :

we figured out what was cause issue for the neutron openvswitch firewall driver yesterday.
rodolfo has propsoed the testing patch https://review.opendev.org/#/c/753314/ bugfix.
that should allow https://review.opendev.org/#/c/602432/20 to move forward and hopefully merge next week once the neutron patch is merged. the nova patch also requires https://review.opendev.org/#/c/751642/5 which should also be merged shortly.

once these are all merged on the master branch we can start the backport process.

the main code for the fix has been available for some time but there were a number of edgecases that had to be adressed. we started looking at these patches again about 2 weeks ago an i think
the main edgecaes have now been addressed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/754475

Revision history for this message
James Denton (james-denton) wrote :

@sean-k-mooney - Thanks for the quick response. We're using the hybrid driver, but also experience the RARP issues. However, looking back through this bug, it seems like https://review.opendev.org/#/c/640258 might address our issue, especially since the issue appears to be due to the amount of time it takes to apply the VLAN tag to the OVS port.

Do you think we can backport https://review.opendev.org/#/c/640258 all the way back to (at least) Stein?

Revision history for this message
sean mooney (sean-k-mooney) wrote :

for the hybrid iptables driver os-vif adds a veth pair between the ovs bridge an a linux bridge so teh of-port will not be -1 so it will have no effect in that case.

this only changes teh behavior when using the ovs firewall driver.

but yes we will proably backport it to queens.

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Just to shime in, we also have this issue using the hybrid iptables driver. But if https://review.opendev.org/#/c/640258 solves the issue and is backported we are happy :)

tags: added: in-stable-victoria
Changed in nova:
status: In Progress → Fix Released
21 comments hidden view all 101 comments
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/797142

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/797144

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/797291

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/797316

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/797428

Revision history for this message
ignazio (cassano) wrote :

Hello, I got same issue on kolla ansible wallaby with ubuntu source images.
Ignazio

Revision history for this message
ignazio (cassano) wrote :

Hello, I did not understand if this bug affects only openvswitch or also ovn.
Please, anyone can explain ?

Revision history for this message
sean mooney (sean-k-mooney) wrote :

ovn is broken in a different way but with a similar effect.
there is yet another vector for a race between libvirt and ovn in the ovn case.

ovn will not install openflow rules until libvirt creates the tap on the destinatnion host.
as a result there is an interval of time where the vm can be runnign on the dest but ovn has not installed the flows if the

there was a bug in the nova fix which is resolved by
https://review.opendev.org/c/openstack/nova/+/797142/1
https://review.opendev.org/c/openstack/nova/+/797428

basically the nova fix was turned off due to an issue in the code related to who python properties work.

that is corrected by those last too patch and that is being squshed into the backport of the nova patch.

now that we know ovn has different requirement that we must fullfile to mitigate the race condition in the ovn case i am prototyping https://review.opendev.org/c/openstack/os-vif/+/798055
as a solution taht will actully also fix the ovn varent of this.

funementailly if you nighevly live migrate between two hsts with ovn you will always have a race condition between libvirt and ovn unless we can remove the depency on the of-portid that only gets assigned wehn the device is created on the datapath on the host.
to work around this we are oging to create per port ovs bridges and connnoect those to the br-int

this is similar to how hybrid plug works with ml2/ovs and iptabel or to how we implement trunk port in ml2/ovs. unlike hybrid plugs we will not be using linux bridges and we will not have the performace impact that intoduces

the trunk bridge concept is not new in ml2/ovs so the port bridge design is more or less well tested since we are using the same techniuq just for a different usecase.

ironicly the fix for ovn will also work for ml2/ovs and would have avoid some of the change we already made for ml2/ovs to adress this in the first place but it was intially dismissed as it results in a more complex bridge toplogy which at the time we did nto want to intoduce since ml2/ovs did not need it.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

ignazio (cassano) https://review.opendev.org/c/openstack/nova/+/790447 is the wallabyt backport taht will adress this for ml2/ovs

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/797142
Committed: https://opendev.org/openstack/nova/commit/99cf5292c782828ea33a3448ae0e7da028d5d176
Submitter: "Zuul (22348)"
Branch: master

commit 99cf5292c782828ea33a3448ae0e7da028d5d176
Author: Stephen Finucane <email address hidden>
Date: Fri Jun 18 17:55:50 2021 +0100

    objects: Fix VIFMigrateData.supports_os_vif_delegation setter

    We're using a 'profile' getter/setter in the 'VIFMigrateData' object to
    ensure we transform the JSON encoded string we store in the database to
    an actual dictionary on load and vice versa on save. However, because
    the getter is returning a new object (constructed from 'json.loads')
    rather than a reference to the original data (which is a string),
    modifying this object doesn't actually change the underlying data in the
    object. We were relying on this broken behavior to set the
    'supports_os_vif_delegation' attribute of the 'VIFMigrateData' object
    and trigger the delegated port creation introduced in change
    I11fb5d3ada7f27b39c183157ea73c8b72b4e672e, which means that code isn't
    actually doing anything yet. Resolve this.

    Change-Id: I362deb1088c88cdcd8219922da9dc9a01b10a940
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1734320
    Related-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/797428
Committed: https://opendev.org/openstack/nova/commit/fa0fb2fe3d61de1cb871c48ee97053cf2fb5827a
Submitter: "Zuul (22348)"
Branch: master

commit fa0fb2fe3d61de1cb871c48ee97053cf2fb5827a
Author: Stephen Finucane <email address hidden>
Date: Tue Jun 22 11:37:22 2021 +0100

    libvirt: Always delegate OVS plug to os-vif

    In change I11fb5d3ada7f27b39c183157ea73c8b72b4e672e, we started
    delegating plugging of OVS ports to os-vif to work around a number of
    bugs. However, this was only introduced for live migration. Plugging is
    still handled by libvirt for spawn. This results in an odd situation,
    whereby an interface of type 'bridge' will be use when creating the
    instance initially, only for this to change to 'ethernet' on live
    migration. Resolve this by *always* delegating plugging to os-vif. This
    is achieved by consistently setting the 'delegate_create' attribute of
    'nova.network.model.VIF' to 'True', which will later get transformed to
    the 'create_port' attribute of the 'os_vif.objects.vif.VIFOpenVSwitch'
    object(s) created in 'nova.network.os_vif_util._nova_to_osvif_vif_ovs'
    and ultimately result in delegate port creation.

    Note that we don't need to worry about making the setting of
    'delegate_create' conditional on whether we're looking at an OVS port or
    not: this will be handled by '_nova_to_osvif_vif_ovs'. We also don't
    need to worry about unsetting this attribute before a live migration:
    the 'delegate_create' attribute is always overridden as part of
    'nova.objects.migrate_data.VIFMigrateData.get_dest_vif'.

    Change-Id: I014c5a81752f86c6b99d19d769c42f318e18e676
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1734320
    Related-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/770844
Committed: https://opendev.org/openstack/nova/commit/c0a36d917794fed77e75ba9ed853c01a77b540bd
Submitter: "Zuul (22348)"
Branch: stable/train

commit c0a36d917794fed77e75ba9ed853c01a77b540bd
Author: Sean Mooney <email address hidden>
Date: Wed Dec 16 13:12:13 2020 +0000

    only wait for plugtime events in pre-live-migration

    This change modifies _get_neutron_events_for_live_migration
    to filter the event to just the subset that will be sent
    at plug-time.

    Currently neuton has a bug where by the dhcp agent
    send a network-vif-plugged event during live migration after
    we update the port profile with "migrating-to:"
    this cause a network-vif-plugged event to be sent for
    configuration where vif_plugging in nova/os-vif is a noop.

    when that is corrected the current logic in nova cause the migration
    to time out as its waiting for an event that will never arrive.

    This change filters the set of events we wait for to just the plug
    time events.

    Conflicts:
        nova/compute/manager.py
        nova/tests/unit/compute/test_compute_mgr.py

    Related-Bug: #1815989
    Closes-Bug: #1901707
    Change-Id: Id2d8d72d30075200d2b07b847c4e5568599b0d3b
    (cherry picked from commit 8b33ac064456482158b23c2a2d52f819ebb4c60e)
    (cherry picked from commit ef348c4eb3379189f290217c9351157b1ebf0adb)
    (cherry picked from commit d9c833d5a404dfa206e08c97543e80cb613b3f0b)

tags: added: in-stable-train
Revision history for this message
ignazio (cassano) wrote :

Hello All,
In the meantime that an official solution will be ready, as regards kolla wallaby, I have created a service that does the discovery of instances on each compute node and, finding a new one, announces it with the command:
docker exec nova_libvirt virsh qemu-monitor-command $ instance --hmp announce_self.

Using the above procedure I lose 2/3 packets during live migration.
Ignazio

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this is already fixed on master we need to do the backport but at this point that is all that is left. its also fix in neutron at this point. the neutron backport has alreay been done i belive. to train. i have also fixed ovn for kernel ovs and smubmitted patches for vhost-user with ovn last weeks.

docker exec nova_libvirt virsh qemu-monitor-command $ instance --hmp announce_self. will be a semi effective workaround when using ml2/ovs. it will have no effect for ovn as it has a different unerlying cause specificaly it looses packets because the ovs flows are not installed to sending arp packets does not help with that.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/820682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/741529
Committed: https://opendev.org/openstack/nova/commit/d44e24efe28e825fbfd2c75a032bf2d10109a439
Submitter: "Zuul (22348)"
Branch: master

commit d44e24efe28e825fbfd2c75a032bf2d10109a439
Author: Tobias Urdin <email address hidden>
Date: Thu Jul 16 21:29:32 2020 +0200

    libvirt: Add announce-self post live-migration workaround

    This patch adds a workaround that can be enabled
    to send an announce_self QEMU monitor command
    post live-migration to send out RARP frames
    that was lost due to port binding or flows not
    being installed.

    Please note that this makes marks the domain
    in libvirt as tainted.

    See previous information about this issue in
    the [1] bug.

    [1] https://bugs.launchpad.net/nova/+bug/1815989

    Change-Id: I7a6a6fe5f5b23e76948b59a85ca9be075a1c2d6d
    Related-Bug: 1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/825064

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/825175

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/825176

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/825177

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/825178

melanie witt (melwitt)
no longer affects: nova/xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/828148

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/825064
Committed: https://opendev.org/openstack/nova/commit/a8981422afdd09f8cfea053e592c15e771fbe969
Submitter: "Zuul (22348)"
Branch: stable/xena

commit a8981422afdd09f8cfea053e592c15e771fbe969
Author: Tobias Urdin <email address hidden>
Date: Thu Jul 16 21:29:32 2020 +0200

    libvirt: Add announce-self post live-migration workaround

    NOTE(melwitt): This is the combination of two commits, the workaround
    config option and a followup change to add a note that enabling the
    workaround will cause the guest domain to be considered tainted by
    libvirt.

    This patch adds a workaround that can be enabled
    to send an announce_self QEMU monitor command
    post live-migration to send out RARP frames
    that was lost due to port binding or flows not
    being installed.

    Please note that this makes marks the domain
    in libvirt as tainted.

    See previous information about this issue in
    the [1] bug.

    [1] https://bugs.launchpad.net/nova/+bug/1815989

    Related-Bug: 1815989

    Update announce self workaround opt description

    This updates the announce self workaround config opt
    description to include info about instance being set
    as tainted by libvirt.

    Change-Id: I8140c8fe592dd54fc09a9510723892806db49a56
    (cherry picked from commit 2aa1ed5810b67b9a8f18b2ec5e21004f93831168)

    Change-Id: I7a6a6fe5f5b23e76948b59a85ca9be075a1c2d6d
    (cherry picked from commit d44e24efe28e825fbfd2c75a032bf2d10109a439)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/825178
Committed: https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 9609ae0bab30675e184d1fc63aec849c1de020d0
Author: Tobias Urdin <email address hidden>
Date: Thu Jul 16 21:29:32 2020 +0200

    libvirt: Add announce-self post live-migration workaround

    NOTE(melwitt): This is the combination of two commits, the workaround
    config option and a followup change to add a note that enabling the
    workaround will cause the guest domain to be considered tainted by
    libvirt.

    This patch adds a workaround that can be enabled
    to send an announce_self QEMU monitor command
    post live-migration to send out RARP frames
    that was lost due to port binding or flows not
    being installed.

    Please note that this makes marks the domain
    in libvirt as tainted.

    See previous information about this issue in
    the [1] bug.

    [1] https://bugs.launchpad.net/nova/+bug/1815989

    Related-Bug: 1815989

    Update announce self workaround opt description

    This updates the announce self workaround config opt
    description to include info about instance being set
    as tainted by libvirt.

    Change-Id: I8140c8fe592dd54fc09a9510723892806db49a56
    (cherry picked from commit 2aa1ed5810b67b9a8f18b2ec5e21004f93831168)

    Change-Id: I7a6a6fe5f5b23e76948b59a85ca9be075a1c2d6d
    (cherry picked from commit d44e24efe28e825fbfd2c75a032bf2d10109a439)
    (cherry picked from commit a8981422afdd09f8cfea053e592c15e771fbe969)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/797316

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/797291

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/797144

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/wallaby)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/790447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/stein)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/820682
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ussuri)

Change abandoned by "Rodolfo Alonso <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/754475
Reason: If needed, please restore the patch and address the comments.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)
Download full text (4.4 KiB)

Reviewed: https://review.opendev.org/c/openstack/nova/+/790447
Committed: https://opendev.org/openstack/nova/commit/23a4b27dc0c156ad6cbe5260d518da3fd62294b8
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 23a4b27dc0c156ad6cbe5260d518da3fd62294b8
Author: Stephen Finucane <email address hidden>
Date: Fri Apr 30 12:51:35 2021 +0100

    libvirt: Delegate OVS plug to os-vif

    os-vif 1.15.0 added the ability to create an OVS port during plugging
    by specifying the 'create_port' attribute in the 'port_profile' field.
    By delegating port creation to os-vif, we can rely on it's 'isolate_vif'
    config option [1] that will temporarily configure the VLAN to 4095
    (0xfff), which is reserved for implementation use [2] and is used by
    neutron to as a dead VLAN [3]. By doing this, we ensure VIFs are plugged
    securely, preventing guests from accessing other tenants' networks
    before the neutron OVS agent can wire up the port.

    This change requires a little dance as part of the live migration flow.
    Since we can't be certain the destination host has a version of os-vif
    that supports this feature, we need to use a sentinel to indicate when
    it does. Typically we would do so with a field in
    'LibvirtLiveMigrateData', such as the 'src_supports_numa_live_migration'
    and 'dst_supports_numa_live_migration' fields used to indicate support
    for NUMA-aware live migration. However, doing this prevents us
    backporting this important fix since o.vo changes are not backportable.
    Instead, we (somewhat evilly) rely on the free-form nature of the
    'VIFMigrateData.profile_json' string field, which stores JSON blobs and
    is included in 'LibvirtLiveMigrateData' via the 'vifs' attribute, to
    transport this sentinel. This is a hack but is necessary to work around
    the lack of a free-form "capabilities" style dict that would allow us do
    backportable fixes to live migration features.

    Note that this change has the knock on effect of modifying the XML
    generated for OVS ports: when hybrid plug is false will now be of type
    'ethernet' rather than 'bridge' as before. This explains the larger than
    expected test damage but should not affect users.

    Changes:
      lower-constraints.txt
      requirements.txt
      nova/network/os_vif_util.py
      nova/tests/unit/virt/libvirt/test_vif.py
      nova/tests/unit/virt/libvirt/test_driver.py
      nova/virt/libvirt/driver.py

    NOTE(stephenfin): Change I362deb1088c88cdcd8219922da9dc9a01b10a940
    ("objects: Fix VIFMigrateData.supports_os_vif_delegation setter") which
    contains an important fix for the original change, is squashed into this
    change. In addition, the os-vif version bump we introduced in the
    original version of this patch is not backportable and as a result, we
    must introduce two additional checks. Both checks ensure we have a
    suitable version of os-vif and skip the new code paths if not. The first
    check is in the libvirt driver's 'check_can_live_migrate_destination'
    function, which as the name suggests runs on the destination host early
    in the live migration process. If os-v...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/825177
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/825176
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova wallaby-eom

This issue was fixed in the openstack/nova wallaby-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/825175
Reason: stable/victoria branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/victoria if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/wallaby)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/828148
Reason: stable/wallaby branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/wallaby if you want to further work on this patch.

Revision history for this message
norman shen (jshen28) wrote (last edit ):

curious to know if if we can achieve 0 packet loss for live migration after applying above fix ?

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

I will assume you mean the announce-self patch, there is a time window between the VM having been unpaused already and post live-migration action is running on the destination nova-compute, so it's not fully zero-impact zero-packetloss in that small window.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

0 packet loss is not the end goal and its not something we can ever commit too.
all we can do is try to minimise the packet loss but the variance of technology stacks and network toploiges means we can never provide a hard sla.

spiciricly qemu/libvirt don't provide any sla with regards to packetloss nor does ovs in general.
ml2/ovs and ovn its self has been enhanced to duplicate packets and send them to both the source and dest to try and cover the gap between unpasuse and network connectivity.
it uses the sending of RARP packet to trigger the disabling of the packet replication and fully activate the forwarding on the dest. i belive the window is now in the 10s of ms but its not 0 and we will likely never get to 0

we just don't control enough of the moving parts.

live migration is intended for host maintenance and we always recommend minimising the traffic to the guest when doing it if possible.

for non nfv/vnf workloads the downtime should mostly not be noticeable and TCP retansmaition will paper over it.

Displaying first 40 and last 40 comments. View all 101 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.