After VM migration, tunnels not getting removed with L2Pop ON, when using multiple api_workers in controller

Bug #1443421 reported by chandrasekaran natarajan
50
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Jesse Pretorius
Trunk
Fix Released
High
Jesse Pretorius
neutron
Fix Released
High
shihanzhang

Bug Description

Using multiple api_workers, for "nova live-migration" command,
a) tunnel flows and tunnel ports are always removed from old host
b) and other hosts(sometimes) not getting notification about port delete from old host. So in other hosts, tunnel ports and flood flows(except unicast flow about port) for old host still remain.
Root cause and fix is explained in comments 12 and 13.

According to bug reporter, this bug can also be reproducible like below.
Setup : Neutron server HA (3 nodes).
Hypervisor – ESX with OVsvapp
l2 POP is on Network node and off on Ovsvapp.

Condition:
Make L2 pop on OVs agent, api workers =10 in the controller.

On network node,the VXLAN tunnel is created with ESX2 and the Tunnel with ESX1 is not removed after migrating VM from ESX1 to ESX2.

Attaching the logs of servers and agent logs.

stack@OSC-NS1:/opt/stack/logs/screen$ sudo ovs-vsctl show
662d03fb-c784-498e-927c-410aa6788455
    Bridge br-ex
        Port phy-br-ex
            Interface phy-br-ex
                type: patch
                options: {peer=int-br-ex}
        Port "eth2"
            Interface "eth2"
        Port br-ex
            Interface br-ex
                type: internal
    Bridge br-tun
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "vxlan-6447007a"
            Interface "vxlan-6447007a"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="100.71.0.41", out_key=flow, remote_ip="100.71.0.122"} <<<<<<<<<<<< This should have been deleted after MIGRATION.
        Port "vxlan-64470082"
            Interface "vxlan-64470082"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="100.71.0.41", out_key=flow, remote_ip="100.71.0.130"}
        Port br-tun
            Interface br-tun
                type: internal
        Port "vxlan-6447002a"
            Interface "vxlan-6447002a"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="100.71.0.41", out_key=flow, remote_ip="100.71.0.42"}
    Bridge "br-eth1"
        Port "br-eth1"
            Interface "br-eth1"
                type: internal
        Port "phy-br-eth1"
            Interface "phy-br-eth1"
                type: patch
                options: {peer="int-br-eth1"}
    Bridge br-int
        fail_mode: secure
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "int-br-eth1"
            Interface "int-br-eth1"
                type: patch
                options: {peer="phy-br-eth1"}
        Port br-int
            Interface br-int
                type: internal
        Port int-br-ex
            Interface int-br-ex
                type: patch
                options: {peer=phy-br-ex}
        Port "tap9515e5b3-ec"
            tag: 11
            Interface "tap9515e5b3-ec"
                type: internal
    ovs_version: "2.0.2"

Revision history for this message
chandrasekaran natarajan (chandrasekaran-natarajan) wrote :
summary: - After ESX VM migration, VXLAN tunnel port is not getting removed on
- network node when api_workers value is 10 in the controller.
+ After VM migration, tunnels not getting removed with L2Pop ON, when
+ using multiple api_workers in controller
Changed in neutron:
assignee: nobody → Vivekanandan Narasimhan (vivekanandan-narasimhan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/175383

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/175383
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Assaf Muller (amuller) wrote :

FYI I've already closed four duplicates that point to this bug. Setting to high priority.

Changed in neutron:
importance: Undecided → High
Revision history for this message
Vivekanandan Narasimhan (vivekanandan-narasimhan) wrote :

Thanks Assaf for finally upgrading this bug's priority . I will post rebase the review and post a new patchset.

Assaf Muller (amuller)
tags: added: l2-pop
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/175383
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
venkata anil (anil-venkata) wrote :

@vivkenandan
Can I work on this bug?

Thanks
Anil

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Anil, go for it!

Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → High
milestone: none → mitaka-2
Assaf Muller (amuller)
Changed in neutron:
assignee: Vivekanandan Narasimhan (vivekanandan-narasimhan) → venkata anil (anil-venkata)
Revision history for this message
venkata anil (anil-venkata) wrote :

@chandrasekaran.natarajan
Can you please let me know which nova command did you use for migration?
Is it "nova migrate" ?

Thanks
Anil

Changed in neutron:
status: In Progress → Incomplete
Revision history for this message
venkata anil (anil-venkata) wrote :

I tried with one controller( and api workers =10) and two compute nodes, with l2pop on all the three nodes.

With both "nova migrate" and "nova live-migration", I see l2pop working properly(i.e tunnels are removed from old host after migration).

Note: "nova migrate" is a two step process. So, only after "nova resize-confirm", l2pop is deleting tunnels from previous host.
http://osdir.com/ml/openstack-cloud-computing/2013-01/msg00522.html

Changed in neutron:
status: Incomplete → Invalid
Revision history for this message
venkata anil (anil-venkata) wrote :

Looks like some corner case needs to be handled, investigating further.

Changed in neutron:
status: Invalid → In Progress
Revision history for this message
venkata anil (anil-venkata) wrote :

Behaviour of l2pop on vm migration:

update_port_postcommit is invoked twice for a migrated port.

The first time the update_port_postcommit is triggered because of nova invocation of update_port() that tells neutron that the port is being made available on the new host. Nova retains the port-status in ACTIVE state only since nova's update_port invocation does not alter the port-status. So during this first update-port-postcommit invocation the l2-pop driver uses the opportunity to save the old host information that is available in the existing port-binding (aka original_host in port-context).

The second time update_port_postcommit is triggered as part of invocation of get_device_details by the neutron agent on the new host. This invocation attempts to bind the port which results in the migrated port-state to transition to BUILD state. The l2pop driver uses this opportunity to tear down the tunnels from the old host using the saved information in the prior update_port_postcommit.

The update_device_down from the old host would not make it until the update_port_postcommit because, the update_device_down call would return early due to port not bound to this old host.

What this fix does:
VM Migrations trigger update_port_postcommit calls. Such calls would land up on different api workers in an environment
where multiple api_workers are configured for neutron server. In such situations, in-memory object references
maintained inside l2pop mechdriver instance of an api_worker, is not visible to other mechdriver instances in other api threads. This results in vm migrations not handling tunnel cleanup correctly, resulting in accumulation of tunnels on nodes. So the fix persists l2pop migrated port state in DB, so the state is visible to all api/rpc_workers.

description: updated
Revision history for this message
venkata anil (anil-venkata) wrote :

Nova follows below steps during migration,
(Nova computes in src and dest hosts calls rpc among them during migration)
1) nova compute in src host calls pre_live_migration on dest host. In dest host, pre_live_migration calls vif_driver.plug(i.e creates tap iface).
2) If pre_live_migration on dest host fails, src host calls rollback_live_migration_at_destination
3) If pre_live_migration on dest host success,
   a) src calls live migration and then post_live_migration
   b) post_live_migration will happen in src host. In post_live_migration, src host first calls
      i) post_live_migration_at_source which calls unplug_vifs(tap iface from old host removed).
      ii) then calls post_live_migration_at_dest. post_live_migration_at_dest(in dest host) calls update_port_binding_for_instance(i.e port_update with new host, already tap iface created in pre_live_migration)

No rollback if failure happens in live_migration or post_live_migration.

To test failures, I stopped ovs-agent in dest(agent is not alive. confirmed through agent-list, alive = XXX). Then ran migration. Still migration was showing as success(checked through nova migration-list).

Revision history for this message
venkata anil (anil-venkata) wrote :

Steps followed in neutron during migration,
1) Old host calls update_port_status(with status=DOWN) through update_device_down.
2) Nova calls port_update with old and new hosts and without changing port status
3) New host calls update_port_status(with status=BUILD) through get_device_details
4) Finally, new host calls update_port_status(with status=ACTIVE) through update_device_up

If step 2 happens before step 1 -
a) update_device_down returns early without calling update_port_status in step1(so not caling l2pop),
   and other hosts can't be notified about port removal from old host.
    And port will be in ACTIVE state(previous state).
b) To notify about port removal from old host, l2pop stores old host in step2.
c) In step3, l2pop notifies other agents about old host this from saved info.

Two ways to fix this issue -
1) Instead of using in-memory object references to store old host, we store it in DB
https://review.openstack.org/175383
2) Another suggested way is to notify about port removal instep b itself, instead of "storing it and retriving it later"(steps b and c). Suggested this in https://review.openstack.org/#/c/215467/7

If change https://review.openstack.org/#/c/215467/7 is modified according to suggestions, then no need for https://review.openstack.org/175383 (and we can abandon it), and close this bug.

Changed in neutron:
assignee: venkata anil (anil-venkata) → shihanzhang (shihanzhang)
no longer affects: openstack-ansible/liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/215467
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c5fa665de3173f3ad82cc3e7624b5968bc52c08d
Submitter: Jenkins
Branch: master

commit c5fa665de3173f3ad82cc3e7624b5968bc52c08d
Author: shihanzhang <email address hidden>
Date: Fri Aug 21 09:51:59 2015 +0800

    ML2: update port's status to DOWN if its binding info has changed

    This fixes the problem that when two or more ports in a network
    are migrated to a host that did not previously have any ports in
    the same network, the new host is sometimes not told about the
    IP/MAC addresses of all the other ports in the network. In other
    words, initial L2population does not work, for the new host.

    This is because the l2pop mechanism driver only sends catch-up
    information to the host when it thinks it is dealing with the first
    active port on that host; and currently, when multiple ports are
    migrated to a new host, there is always more than one active port so
    the condition above is never triggered.

    The fix is for the ML2 plugin to set a port's status to DOWN when
    its binding info changes.

    This patch also fixes the bug when nova thinks it should not wait
    for any events from neutron because all ports are already active.

    Closes-bug: #1483601
    Closes-bug: #1443421
    Closes-Bug: #1522824
    Related-Bug: #1450604

    Change-Id: I342ad910360b21085316c25df2154854fd1001b2

Changed in neutron:
status: In Progress → Fix Released
tags: added: kilo-backport-potential liberty-backport-potential
Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

As this patch has been included in Mitaka, I'm marking it as resolved for the OpenStack-Ansible 13.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/300539

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/300559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/300539
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a38cb93dde1633005e9e66e6b7ecec9e726304bb
Submitter: Jenkins
Branch: stable/liberty

commit a38cb93dde1633005e9e66e6b7ecec9e726304bb
Author: venkata anil <email address hidden>
Date: Fri Apr 1 14:52:01 2016 +0000

    ML2: update port's status to DOWN if its binding info has changed

    This fixes the problem that when two or more ports in a network
    are migrated to a host that did not previously have any ports in
    the same network, the new host is sometimes not told about the
    IP/MAC addresses of all the other ports in the network. In other
    words, initial L2population does not work, for the new host.

    This is because the l2pop mechanism driver only sends catch-up
    information to the host when it thinks it is dealing with the first
    active port on that host; and currently, when multiple ports are
    migrated to a new host, there is always more than one active port so
    the condition above is never triggered.

    The fix is for the ML2 plugin to set a port's status to DOWN when
    its binding info changes.

    This patch also fixes the bug when nova thinks it should not wait
    for any events from neutron because all ports are already active.

    Closes-bug: #1483601
    Closes-bug: #1443421
    Closes-Bug: #1522824
    Related-Bug: #1450604
    (cherry picked from commit c5fa665de3173f3ad82cc3e7624b5968bc52c08d)

    Conflicts: neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Change-Id: I342ad910360b21085316c25df2154854fd1001b2

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/306300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/kilo)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/300559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Dave Walker (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/306300
Reason:
stable/kilo closed for 2015.1.4

This release is now pending its final release and no freeze exception has
been seen for this changeset. Therefore, I am now abandoning this change.

If this is not correct, please urgently raise a thread on openstack-dev.

More details at: https://wiki.openstack.org/wiki/StableBranch

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 7.1.0

This issue was fixed in the openstack/neutron 7.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/175383
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

tags: removed: kilo-backport-potential liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.