neutron

Live migration packet loss increasing as the number of security group rules increases

Bug #1970606 reported by Yusuf Güngör on 2022-04-27

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Won't Fix	High	Unassigned

Bug Description

Hi,

We lose too many packets during live migration. (After post_live_migration starts)

After investigation we have recognized that it is related with the number of security group rules which are applied to instance.

We are loosing 26 ping if there exist 90 security rules applied to instance. (Security group count does not matter, 1 group 90 rules or 3 group with 30 rules)

After detaching some rules from instance and let the instance have only 4 security group rules and then tried to migrate again. In that case we are only loosing 3 pings.

Do you have any idea? If this is caused by migrating the ovs flows, than is there any solution?

Environment Details:
OpenStack Wallaby Cluster installed via kolla-ansible to Ubuntu 20.04.2 LTS Hosts. (Kernel:5.4.0-90-generic)
There exist 5 controller+network node.
"neutron-openvswitch-agent", "neutron-l3-agent" and "neutron-server" version is "18.1.2.dev118"
OpenvSwitch used in DVR mode with router HA configured. (l3_ha = true)
We are using a single centralized neutron router for connecting all tenant networks to provider network.
We are using bgp_dragent to announce unique tenant networks.
Tenant network type: vxlan
External network type: vlan

See original description

Tags:

Yusuf Güngör (yusuf2) on 2022-04-27

description:

updated

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2022-04-27:

Do you have "live_migration_wait_for_vif_plug" nova config set to True (default)?

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-04-27 (last edit on 2022-04-28):

Hi Oleg, thanks for reply.

Yes, we have live_migration_wait_for_vif_plug as True. Even it is True by default still explicitly written to config file too. (Under compute section)

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-04-28 (last edit on 2022-04-28):

We have examined the bugs below:

https://bugs.launchpad.net/neutron/+bug/1901707
https://bugs.launchpad.net/neutron/+bug/1815989
https://bugs.launchpad.net/neutron/+bug/1414559
https://bugs.launchpad.net/neutron/+bug/1880389

Tried the configs below.

----------------- neutron.conf -----------------
[nova]
live_migration_events = True
------------------------------------------------

----------------- nova.conf -----------------
[DEFAULT]
vif_plugging_timeout = 600
vif_plugging_is_fatal = False
debug = True

[compute]
live_migration_wait_for_vif_plug = True

[workarounds]
enable_qemu_monitor_announce_self = True

[libvrirt]
live_migration_permit_post_copy=true
live_migration_timeout_action=force_complete
live_migration_permit_auto_converge=true
------------------------------------------------

While migrating instance, from the nova-compute logs, we have seen that, ping loss starts when the source node prints "Received event network-vif-unplugged" log. Ping loss continues until the "Received event network-vif-plugged" log printed on destination.

So, somehow "live_migration_wait_for_vif_plug" config parameter is not works?

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2022-04-28:

Thanks for update Yusuf

Changed in neutron:
status:	New → Confirmed
importance:	Undecided → High
tags:	added: ovs-fw

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-04-28:

nova-compute-source.txt Edit (184.9 KiB, text/plain)

Added source compute node - nova logs

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-04-28:

nova-compute-destination.txt Edit (131.5 KiB, text/plain)

Added destination compute node - nova logs

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-04-28:

neutron-source.txt Edit (3.7 KiB, text/plain)

Added source compute node - neutron logs

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-04-28:

neutron-destination.txt Edit (5.2 KiB, text/plain)

Added destination compute node - neutron logs

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2022-05-10:

Hello:

Let me first confirm that you are using Wallaby and OVS backend.

In the OVS backend, there are you type of plugs: native and hybrid. The native plug is used by default and can be used with the OVS native firewall. The hybrid plug is used when the OVS iptables firewall is used.

When using the hybrid plug, the TAP port is created when Nova/os-vif creates the L1 port. This TAP port is connected to the linux bridge where the iptables rules will be set. Neutron OVS agent has time to set the OVS rules (fewer ones) and when the VM is unpaused in the destination host, there is no disruption (or the time is shorter). You can switch to iptables FW if the time disruption is critical for your operation.

When using native plug, the port is created but not the TAP port. That means there is no ofport and the OVS OF rules can't be set. It is at the very last time, when the VM is unpaused, when libvirt creates the TAP port. At this point the OVS agent starts applying the OVS OF rules. The more rules you have, the bigger could be the time gap.

In Neutron Wallaby you can use "live_migration_events" [1] (removed in Zed, now is True by default). That needs the Nova patch [2], that was merged in this release. Check first if Nova has it. That will reduce the live migration disruption, but won't remove it all.

In Neutron master you can use "openflow_processed_per_port" [3]. This option will allow the OVS agent to write all OF rules related to a single port in one single transaction. That should reduce the disruption too.

In any case, Neutron does not have a SLA for the live-migration network disruption time; we provide a best effort promise but nothing else.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/766277
[2]https://review.opendev.org/c/openstack/nova/+/767368
[3]https://review.opendev.org/c/openstack/neutron/+/806246

Changed in neutron:
status:	Confirmed → Won't Fix

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-06-09 (last edit on 2022-06-09):

#10

Hi Rodolfo, thanks you for detailed explanation.

Yes we are using Wallaby version, OVS backend and native openvswitch firewall driver.

We have tried the "live_migration_events" but as you mentioned it decrease the packet loss a bit. (We have patched the nova too)

We are not able to test the new "openflow_processed_per_port" parameter for now, because it is only on master and not backported to older releases. We could not found planned time for backporting this parameter to older versions like Wallaby.

It is ok if Neutron does not have a SLA for the live-migration network disruption time but it is very hard to use as is on the production. Imagine a firmware upgrade operation on a bare metal and live migrating the instances to another host. We are facing network disruption up to 2 minutes when migrating every instance :(

We are using DVR and could not find an alternative firewall to openvswitch firewall, do you know any?

We also not able to explain this situation to business associates when migrating to OpenStack from other paid cloud services. People feel like there should be very little (a few ping loss) network disruption time on live migration.

We are really appreciate the great works of Neutron and thanks for the best effort promise. Please accept this as a feedback. Thanks you.

Revision history for this message

Sven Kieske (s-kieske) wrote on 2024-01-30:

#11

Hi Rodolfo,

I understand this might be a difficult problem to solve, but could you maybe explain in a little bit more detail what you think the point of "live migration" is, when an arbitrary number of packets can be dropped during it?

Do you think any application running inside a guest VM should be able to cope with arbitrary network drops? If that's the case I don't honestly see what the usecase for "live" migration is, because then I can also just cold migrate an instance? In the latter case I also have a network timeout anyway.

I think the point of live migration is, to have no network interruption during migration and being able to keep workload running.

Also: where is this documented, beside this bugtracker entry which is not official information, is it?

I can't even find this information in the "internal" live-migration docs:

https://docs.openstack.org/neutron/latest/contributor/internals/live_migration.html

So maybe at least update that, so users are not surprised by this behaviour?

Thanks