OVS drops RARP packets by QEMU upon live-migration - VM temporarily disconnected

Bug #1414559 reported by Liaz Kamper
96
This bug affects 19 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Sahid Orentino
Queens
Fix Committed
Medium
Matt Riedemann
neutron
Fix Released
Medium
Oleg Bondarev

Bug Description

When live-migrating a VM the QEMU send 5 RARP packets in order to allow re-learning of the new location of the VM's MAC address.
However the VIF creation scheme between nova-compute and neutron-ovs-agent drops these RARPs:
1. nova creates a port on OVS but without the internal tagging.
2. At this stage all the packets that come out from the VM, or QEMU process it runs in, will be dropped.
3. The QEMU sends five RARP packets in order to allow MAC learning. These packets are dropped as described in #2.
4. In the meanwhile neutron-ovs-agent loops every POLLING_INTERVAL and scans for new ports. Once it detects a new port is added. it will read the properties of the new port, and assign the correct internal tag, that will allow connection of the VM.

The flow above suggests that:
1. RARP packets are dropped, so MAC learning takes much longer and depends on internal traffic and advertising by the VM.
2. VM is disconnected from the network for a mean period of POLLING_INTERVAL/2

Seems like this could be solved by direct messages between nova vif driver and neutron-ovs-agent

Revision history for this message
Dongcan Ye (hellochosen) wrote :

Also encounterd in my environment.
I use ovs + vlan mode, after live migrated , vm send RARP packets. But the RARP packets are not taking vlan tag, so they can't send to outer.

Changed in neutron:
assignee: nobody → sean mooney (sean-k-mooney)
status: New → Incomplete
Revision history for this message
sean mooney (sean-k-mooney) wrote :

hi i have been looking into this today and have not been able to reproduce.

i have been testing with the head of master and each time i live migrate i a receiving the RAAP on the appropriate vlan tag.

i have not found a specific commit as of yet that fixes this issue but it appears to be resolved on the current master.

can you re validate or provided more information.

if not i think this can be marked as invalid as the behavior is no longer present.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i will try to revalidate with larger fedora vms running iperf to try and generate some more cpu,memory and network load in the vms i am live-migrating to see if that effect the result tomorrow.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i have tried to recreate this bug using both nano cirrius images and
a 4 cpu 1GB Ubuntu image and have not been able to reproduce the behavior sited.

i used the the cirros nano image to test live migrating a very small vm which should have a very short time scale.
in the larger image i ran and iperf server and a iperf client connected over the loopback interface to generate a lot of cup load and memory load to test a longer migrations.

bringing up the br-int interface and sniffing for RARP packets show that the RARP packets are correctly vlan tagged in both cases.

sudo tshark -V -i br-int rarp
...
Frame 11: 64 bytes on wire (512 bits), 64 bytes captured (512 bits) on interface 0
    Interface id: 0
    Encapsulation type: Ethernet (1)
    Arrival Time: Aug 26, 2015 11:36:39.533142000 IST
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1440585399.533142000 seconds
    [Time delta from previous captured frame: 0.350099000 seconds]
    [Time delta from previous displayed frame: 0.350099000 seconds]
    [Time since reference or first frame: 465.125834000 seconds]
    Frame Number: 11
    Frame Length: 64 bytes (512 bits)
    Capture Length: 64 bytes (512 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:vlan:arp]
Ethernet II, Src: fa:16:3e:47:8b:b0 (fa:16:3e:47:8b:b0), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
    Destination: Broadcast (ff:ff:ff:ff:ff:ff)
        Address: Broadcast (ff:ff:ff:ff:ff:ff)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
    Source: fa:16:3e:47:8b:b0 (fa:16:3e:47:8b:b0)
        Address: fa:16:3e:47:8b:b0 (fa:16:3e:47:8b:b0)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: 802.1Q Virtual LAN (0x8100)
802.1Q Virtual LAN, PRI: 0, CFI: 0, ID: 1
    000. .... .... .... = Priority: Best Effort (default) (0)
    ...0 .... .... .... = CFI: Canonical (0)
    .... 0000 0000 0001 = ID: 1
    Type: RARP (0x8035)
    Padding: 0000000000000000000000000000
    Trailer: 00000000
Address Resolution Protocol (reverse request)
    Hardware type: Ethernet (1)
    Protocol type: IP (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reverse request (3)
    Sender MAC address: fa:16:3e:47:8b:b0 (fa:16:3e:47:8b:b0)
    Sender IP address: 0.0.0.0 (0.0.0.0)
    Target MAC address: fa:16:3e:47:8b:b0 (fa:16:3e:47:8b:b0)
    Target IP address: 0.0.0.0 (0.0.0.0)

unless the original reporter can provide more information on how to reproduce
i will mark this as invalid at the end of the week as it appeares to be fixed.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

marking as invalid as the bug cannot be reproduced.
please reopen if this is still an issue for you and you can provide more info on how to recreate.

Changed in neutron:
status: Incomplete → Invalid
Changed in neutron:
assignee: sean mooney (sean-k-mooney) → Oleg Bondarev (obondarev)
Revision history for this message
Radomir Dopieralski (deshipu) wrote :

We have a bug reported from our customers about this, and we have been able to reproduce it in our testing environments without any problems -- there is a noticeable pause in network connectivity of the guest right after migration.

We have a bug for it: https://bugs.launchpad.net/nova/+bug/1511430

Revision history for this message
Radomir Dopieralski (deshipu) wrote :

One note: this is much more noticeable when running a guest with an older kernel -- it's several dozen seconds with RHEL6 but only fractions of a second with RHEL7 guest.

Changed in neutron:
status: Invalid → Triaged
importance: Undecided → Medium
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/246898

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Changed in nova:
assignee: nobody → Oleg Bondarev (obondarev)
status: New → In Progress
Changed in neutron:
assignee: Oleg Bondarev (obondarev) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Oleg Bondarev (obondarev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/281137

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/297170

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/mitaka)

Change abandoned by Sergey Belous (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/297170

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/281137
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=62bcfc595dc49a7035b95daadc72e8744c48c8e7
Submitter: Jenkins
Branch: master

commit 62bcfc595dc49a7035b95daadc72e8744c48c8e7
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 16 19:10:03 2016 +0300

    Add ability to filter migrations by instance uuid

    This will be used by dependent patch.

    Partial-Bug: #1414559
    Change-Id: I20470487287fa2a7aa919507073f75181368c3c0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/246898
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b7c303ee0a16a05c1fdb476dc7f4c7ca623a3f58
Submitter: Jenkins
Branch: master

commit b7c303ee0a16a05c1fdb476dc7f4c7ca623a3f58
Author: Oleg Bondarev <email address hidden>
Date: Wed Nov 18 12:15:09 2015 +0300

    Notify nova with network-vif-plugged in case of live migration

     - during live migration on pre migration step nova plugs instance
       vif device on the destination compute node;
     - L2 agent on destination host detects new device and requests device
       info from server;
     - server does not change port status since port is bound to another
       host (source host);
     - L2 agent processes device and sends update_device_up to server;
     - again server does not update status as port is bound to another host;

    Nova notifications are sent only in case port status change so in this case
    no notifications are sent.

    The fix is to explicitly notify nova if agent reports device up from a host
    other than port's current host.

    This is the fix on neutron side, the actual fix of the bug is on nova side:
    change-id Ib1cb9c2f6eb2f5ce6280c685ae44a691665b4e98

    Closes-Bug: #1414559
    Change-Id: Ifa919a9076a3cc2696688af3feadf8d7fa9e6fc2

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Let's look at backportability, I agree.

Revision history for this message
Gaudenz Steinlin (gaudenz-debian) wrote :

We are hitting the same issue with the neutron ml2 linuxbridge driver. As pointed out by Andreas Scheuring in https://review.openstack.org/#/c/246910/ nova just creates the bridge but does not plug the physical interface to it. If the VM migration succeeds in a short time, then the RARP packets are sent out before the neutron-linuxbridge-agent loop to detect new interfaces had a chance to detect the new interface and correctly set up the bridge's physical interface.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/413555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/415190

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/415190

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/246910
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/413555
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/497457

Changed in nova:
assignee: Oleg Bondarev (obondarev) → sahid (sahid-ferdjaoui)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/506182
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=38b3d4e16ac76d97f64f264c68ef9b88d66e0324
Submitter: Jenkins
Branch: master

commit 38b3d4e16ac76d97f64f264c68ef9b88d66e0324
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Sep 21 15:42:54 2017 +0200

    ml2: fix update_device_up to send lm events with linux bridge

    In case of a live migration and with linux bridge the events are not
    sent to Nova, because the port UUID returned by _device_to_port_id may
    be a truncated UUID and the current plugin._get_port() can't find it.

    Related-Bug: #1414559
    Change-Id: Icb039ae2d465e3822ab07ae4f9bc405c1362afba
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/510013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/497456
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23446a9552b5be3b040278646149a0f481d0a005
Submitter: Zuul
Branch: master

commit 23446a9552b5be3b040278646149a0f481d0a005
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Aug 24 09:47:29 2017 -0400

    libvirt: add method to configure migration speed

    In this commit we are enhancing guest object to control the maximum bw
    to perform migration.

    Related-Bug: #1414559
    Change-Id: I35470773b8c467449ed71217fdb4b6c82f455e33
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/pike)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/510013
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in nova:
assignee: sahid (sahid-ferdjaoui) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/510013
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=50988f3f3665adb4dd481aff5515a81878c8e067
Submitter: Zuul
Branch: stable/pike

commit 50988f3f3665adb4dd481aff5515a81878c8e067
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Sep 21 15:42:54 2017 +0200

    ml2: fix update_device_up to send lm events with linux bridge

    In case of a live migration and with linux bridge the events are not
    sent to Nova, because the port UUID returned by _device_to_port_id may
    be a truncated UUID and the current plugin._get_port() can't find it.

    Related-Bug: #1414559
    Change-Id: Icb039ae2d465e3822ab07ae4f9bc405c1362afba
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>
    (cherry picked from commit 38b3d4e16ac76d97f64f264c68ef9b88d66e0324)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/497457
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4
Submitter: Zuul
Branch: master

commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Aug 24 09:13:09 2017 -0400

    libvirt: slow live-migration to ensure network is ready

    In Neutron, commit b7c303ee0a16a05c1fdb476dc7f4c7ca623a3f58 introduced
    events sent during a live migration when the VIFs are plugged on
    destination node.

    The Linux bridge agent mechanism driver is detecting new networks on
    the destination host only when the TAP devices are created, and these
    tap devices are only created when libvirt starts the migration. As a
    result, we must actually start the migration and then slow it as we
    wait for the neutron events.

    This change ensures we wait for these events.

    Depends-On: Icb039ae2d465e3822ab07ae4f9bc405c1362afba

    Closes-Bug: #1414559
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>
    Change-Id: I407034374fe17c4795762aa32575ba72d3a46fe8

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/557930

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
assignee: Matt Riedemann (mriedem) → sahid (sahid-ferdjaoui)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/559032

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/559034

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/561160

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/557930
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=841b0fbea6373ea2ae123f851fb90555faff12e2
Submitter: Zuul
Branch: stable/queens

commit 841b0fbea6373ea2ae123f851fb90555faff12e2
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Aug 24 09:13:09 2017 -0400

    libvirt: slow live-migration to ensure network is ready

    In Neutron, commit b7c303ee0a16a05c1fdb476dc7f4c7ca623a3f58 introduced
    events sent during a live migration when the VIFs are plugged on
    destination node.

    The Linux bridge agent mechanism driver is detecting new networks on
    the destination host only when the TAP devices are created, and these
    tap devices are only created when libvirt starts the migration. As a
    result, we must actually start the migration and then slow it as we
    wait for the neutron events.

    This change ensures we wait for these events.

    Depends-On: https://review.openstack.org/506182/

    Closes-Bug: #1414559
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>
    Change-Id: I407034374fe17c4795762aa32575ba72d3a46fe8
    (cherry picked from commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.3

This issue was fixed in the openstack/nova 17.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/561160
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d9c6610cb60f2da152f567aaa1b0cdb9edef2957
Submitter: Zuul
Branch: stable/ocata

commit d9c6610cb60f2da152f567aaa1b0cdb9edef2957
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Sep 21 15:42:54 2017 +0200

    ml2: fix update_device_up to send lm events with linux bridge

    In case of a live migration and with linux bridge the events are not
    sent to Nova, because the port UUID returned by _device_to_port_id may
    be a truncated UUID and the current plugin._get_port() can't find it.

    Conflicts:
            neutron/tests/unit/plugins/ml2/test_rpc.py

    NOTE(lyarwood): Test conflict introducing an additional mock for ml2_db,
    required prior to Pike and Ia15c63f94d2c67791da3b65546e59f6929c8c685.

    Related-Bug: #1414559
    Change-Id: Icb039ae2d465e3822ab07ae4f9bc405c1362afba
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>
    (cherry picked from commit 38b3d4e16ac76d97f64f264c68ef9b88d66e0324)
    (cherry picked from commit 50988f3f3665adb4dd481aff5515a81878c8e067)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/559032
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ff747792b8f5aefe1bebb01bdf49dacc01353348
Submitter: Zuul
Branch: stable/pike

commit ff747792b8f5aefe1bebb01bdf49dacc01353348
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Aug 24 09:13:09 2017 -0400

    libvirt: slow live-migration to ensure network is ready

    In Neutron, commit b7c303ee0a16a05c1fdb476dc7f4c7ca623a3f58 introduced
    events sent during a live migration when the VIFs are plugged on
    destination node.

    The Linux bridge agent mechanism driver is detecting new networks on
    the destination host only when the TAP devices are created, and these
    tap devices are only created when libvirt starts the migration. As a
    result, we must actually start the migration and then slow it as we
    wait for the neutron events.

    This change ensures we wait for these events.

    Depends-On: Icb039ae2d465e3822ab07ae4f9bc405c1362afba

    Conflicts:
            nova/tests/unit/virt/libvirt/test_driver.py
            nova/virt/libvirt/driver.py

    NOTE(lyarwood): The driver.py conflict is due to additional QEMU and
    LIBVIRT version constants being present in Queens. The test_driver.py
    conflict being the result of our need to assert byte strings in Pike as
    also seen by I9b545ca8aa6dd7b41ddea2d333190c9fbed19bc1 and resolved by
    I85cd9a903fba310b5ae7bedeed118ca4ea98dff6 in Queens.

    Closes-Bug: #1414559
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>
    Change-Id: I407034374fe17c4795762aa32575ba72d3a46fe8
    (cherry picked from commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4)
    (cherry picked from commit 224ac25cb1cca60e051cf4dc821eb549e8b36ff2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/586965

Revision history for this message
SharonBarak (sharonbarak) wrote :

Hi , just reproduce very same bug in queens latest ...
openvsw log:
2018-07-26T*14:15:31*.743Z|01441|bridge|INFO|bridge br-int: added interface tap603f25db-55 on port 42

  last rarp:
            14:15:34.*025948 B fa:16:3e:d2:a6:c8 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4

openvsw flows:
2018-07-26T14:15:34.773Z|01442|connmgr|INFO|br-int<->unix#1615: 50 flow_mods in the last 0 s (50 adds)
2018-07-26T14:15:39.301Z|01443|connmgr|INFO|br-int<->unix#1617: 50 flow_mods in the last 0 s (50 adds)
2018-07-26T14:15:39.602Z|01444|connmgr|INFO|br-int<->unix#1619: 1 flow_mods in the last 0 s (1 deletes)

so last rarp fall inbetween the ovs port creation and the flows creation
how can we/you just increase the number of rarp qemu send out ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/586965
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0f1c2cbfd46c0dad72a7f0fb0a1c7fb91fd0c6b
Submitter: Zuul
Branch: stable/pike

commit e0f1c2cbfd46c0dad72a7f0fb0a1c7fb91fd0c6b
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Thu Aug 24 09:47:29 2017 -0400

    libvirt: add method to configure migration speed

    In this commit we are enhancing guest object to control the maximum bw
    to perform migration.

    Closes-Bug: #1783635
    Related-Bug: #1414559
    Change-Id: I35470773b8c467449ed71217fdb4b6c82f455e33
    Signed-off-by: Sahid Orentino Ferdjaoui <email address hidden>
    (cherry picked from commit 23446a9552b5be3b040278646149a0f481d0a005)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ocata)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/559034

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Sharon your issue looks similar to [0], if you can try with this patch.

[0] https://review.openstack.org/#/c/505731/

Revision history for this message
SharonBarak (sharonbarak) wrote :

10x sahid , but this (processutils.execute('brctl', 'setageing', bridge, 0) )already implemented ...

in my case , we use neutron as containers also nova & libvirt ,
question , i saw in driver.py you comment in _get_neutron_events_for_live_migration
 # TODO(sahid): Currently we only use the mechanism of waiting
 # for neutron events during live-migration for linux-bridge.

we use ovs ... so i assume your mechanisim not impact us ?

do you know where in qemu we can trigger more rarp's ? nothing will happend in case more will be trigger ....

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Oh sorry I did not notice you were talking about Queens. No nothing should change for OVS. About QEMU I remember a patch [0] that to give ability to trigger the self announce but It has not been accepted. Perhaps there is an other solution that has been merged.

[0] https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03296.html

Revision history for this message
SharonBarak (sharonbarak) wrote :

Hi sahid , what do you mean by "nothing should change for OVS ... " qemu wait for neutron events for port & rules creation in destination to start migration or boost migration ?

the log i attached clearly show no sync ...
am i missing something ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.5

This issue was fixed in the openstack/nova 16.1.5 release.

Revision history for this message
Yang Li (yang-li) wrote :
Download full text (3.3 KiB)

I think there are 2 problems cause connectivity broken in the live-migration
1. When the VM is migrated to the destination, and the VM send the rarp packets, but because it's too fast, the openflow and tag haven't been configed in br-int, then the rarp packets will be drop.

2. When the VM is migrated to the destination, the openflow and tag have been configed, then VM send rarp packet, but table 71 flow will drop these packets, because high priority flow(65 70 80 95) doesn't match the rarp packets, only low priority flow(10) will match, but its action is drop, so the packets still cannot been sent out. The table 71 flows:
 cookie=0x6c5958fed07888a7, duration=428986.816s, table=71, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=110,ct_state=+trk actions=ct_clear,resubmit(,71)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=130 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=3, n_bytes=210, idle_age=24441, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=133 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=78, idle_age=24450, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=135 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=95,icmp6,reg5=0x2d,in_port=45,icmp_type=136 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=70,icmp6,reg5=0x2d,in_port=45,icmp_type=134 actions=resubmit(,93)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=42, idle_age=24450, priority=95,arp,reg5=0x2d,in_port=45,dl_src=fa:16:3e:b4:db:09,arp_spa=192.168.100.141 actions=resubmit(,94)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=91, n_bytes=8818, idle_age=24446, priority=65,ip,reg5=0x2d,in_port=45,dl_src=fa:16:3e:b4:db:09,nw_src=192.168.100.141 actions=ct(table=72,zone=NXM_NX_REG6[0..15])
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=2, n_bytes=678, idle_age=24450, priority=80,udp,reg5=0x2d,in_port=45,tp_src=68,tp_dst=67 actions=resubmit(,73)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=80,udp6,reg5=0x2d,in_port=45,tp_src=546,tp_dst=547 actions=resubmit(,73)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=70,udp,reg5=0x2d,in_port=45,tp_src=67,tp_dst=68 actions=resubmit(,93)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=0, n_bytes=0, idle_age=24454, priority=70,udp6,reg5=0x2d,in_port=45,tp_src=547,tp_dst=546 actions=resubmit(,93)
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=90, idle_age=24441, priority=65,ipv6,reg5=0x2d,in_port=45,dl_src=fa:16:3e:b4:db:09,ipv6_src=fe80::f816:3eff:feb4:db09 actions=ct(table=72,zone=NXM_NX_REG6[0..15])
 cookie=0x6c5958fed07888a7, duration=24454.927s, table=71, n_packets=1, n_bytes=90, idle_age=24450, priority=10,reg5=0x...

Read more...

Revision history for this message
Yang Li (yang-li) wrote :

I use some virsh commands to avoid the first situation, modify the migration speed to 1M/s, make sure the flows and tag are configed in br-int before live-migration completed.
# virsh migrate-setspeed instance-0000003b --bandwidth 1

And add a new rarp flow into table=71:
priority=95,ct_state=-trk,rarp,reg5=0x13,in_port=19,dl_src=fa:16:3e:09:3d:10 actions=resubmit(,94)

Then there is no more packets dropped in several live-migration tests。
# virsh migrate --live --persistent --undefinesource instance-0000003b qemu+tcp://node-3/system

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.