HA routers interact badly with l2pop

Bug #1365476 reported by Assaf Muller on 2014-09-04
78
This bug affects 15 people
Affects Status Importance Assigned to Milestone
neutron
High
Mike Kolesnik
Kilo
Undecided
Unassigned

Bug Description

Since internal HA router interfaces are created on more than a single agent, this interacts badly with l2pop that assumes that a Neutron port is located in a certain place in the network. We'll need to report to l2pop when a HA router transitions to an active state, so the port location is changed.

Patch is here:
https://review.openstack.org/#/c/141114/

Assaf Muller (amuller) on 2014-09-04
tags: added: l2-pop
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Changed in neutron:
assignee: nobody → Sylvain Afchain (sylvain-afchain)
Changed in neutron:
importance: Medium → High
Changed in neutron:
status: Confirmed → In Progress

Yes I saw that Mathieu :)

I'm working on a fix, close to the DVR one approach for the binding. So lot of redundant code, I'll rebase it on top of the race condition fix, and we will see how to manage with the refactoring part.

Finally, I have been unable to reproduce the issue. With the arp responder set to False and since the regular mac learning still have an higher priority than the rule pushed by l2pop I do not see any issue here. Tested with DVR/L2Pop

In my previous comment I wrote something about the Arp response, but we don't care here :)

Phil Hopkins (phil-hopkins-a) wrote :

My environment is Ubuntu 14.04, with a source install of OpenStack Juno. I am running Linuxbridges using VXLAN for tunnels.

There is clearly a problem, HA routers keepalived changes are not reflected in the fdb's on the various nodes. If the l3 node that is showing to be master in the keepalived process fails and causes another l3 node to be the new master, the fbd tables are never changed on the various nodes causing a communications failure for the VMs.

Also if the master network node (as decided by the keepalived processes) is rebooted, keepalived detects the failure and moves the VIP to a new node. Once the rebooted node comes on-line and reports that it is well, neutron sets up the router namespace and the keepalived and conntrackd processes on this node, l2pop also sees the new node and sets the fdb's on all nodes to point this node that just came alive. This causes the VMs to lose communication.

Li Ma (nick-ma-z) wrote :

Well, I think I also hit this problem. I'm using the same environment, Linuxbridge+VxLAN.

Changed in neutron:
status: In Progress → Confirmed
Mike Smith (michael-smith6) wrote :

I am hitting this issue consistently in my setup currently. I have a multi-node devstack setup with 1 controller, 2 network nodes, and 2 compute nodes. In my script I setup 1 router (HA), 2 networks (one subnet each), and 2 VMs (one on each subnet).

Pings to/from nova VMs fail because the packets are directed to the passive router instance instead of the active router instance. I assume this is from L2 pop since the router has two ports associated with it.

Once I turn off L2 pop on all nodes, my script (and pings) work fine.

Mike Smith (michael-smith6) wrote :

Sorry - to add to my setup above I am running with vxlan as well.

david martin (dmartls1) on 2014-11-25
Changed in neutron:
status: Confirmed → In Progress
david martin (dmartls1) wrote :

sorry about that, did not intend to change the status

Changed in neutron:
status: In Progress → Confirmed

So, the current issue is not due to the mac learning but because the tunnels are not created to the correct host. It seems that the best/simplest way to fix think is to allow a port to be bound to several host.

Changed in neutron:
status: Confirmed → In Progress
Assaf Muller (amuller) wrote :

I know that Robert Kukura wanted to generalize this concept so that all ports could be bound to multiple hosts. This way distributed ports and single-bind ports are just sub-cases of a generalized concept.

Yes I had a discussion with Robert about the refactoring work especially the DB schema, the multiple ports binding and the way we could leverage it.

Here is the refactoring bug https://bugs.launchpad.net/neutron/+bug/1367391

Mike Kolesnik (mkolesni) on 2014-12-10
Changed in neutron:
assignee: Sylvain Afchain (sylvain-afchain) → Mike Kolesnik (mkolesni)
Assaf Muller (amuller) wrote :

This could be backported, depending on how we end up solving it. If we do backport the fix, we could backport the fix for https://bugs.launchpad.net/neutron/+bug/1397209 as well.

tags: added: juno-backport-potential

lets split the bug to dedicate this one to an OVS implementation.

the corresponding Linuxbridge bug is here :
https://bugs.launchpad.net/neutron/+bug/1411752

summary: - HA routers interact badly with l2pop
+ OVS : HA routers interact badly with l2pop
Kyle Mestery (mestery) on 2015-03-31
Changed in neutron:
milestone: none → liberty-1
Kyle Mestery (mestery) on 2015-03-31
Changed in neutron:
milestone: liberty-1 → kilo-rc1
Kyle Mestery (mestery) wrote :

After discussion with Carl, moving this one to Liberty. We can backport it to Kilo once it merges in Liberty.

Changed in neutron:
milestone: kilo-rc1 → liberty-1
Changed in neutron:
assignee: Mike Kolesnik (mkolesni) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Mike Kolesnik (mkolesni)
Changed in neutron:
assignee: Mike Kolesnik (mkolesni) → Assaf Muller (amuller)
Assaf Muller (amuller) on 2015-04-13
tags: removed: juno-backport-potential
Assaf Muller (amuller) on 2015-06-02
summary: - OVS : HA routers interact badly with l2pop
+ HA routers interact badly with l2pop

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/141114
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
assignee: Assaf Muller (amuller) → Mike Kolesnik (mkolesni)
Thierry Carrez (ttx) on 2015-06-23
Changed in neutron:
milestone: liberty-1 → liberty-2
Assaf Muller (amuller) on 2015-06-25
Changed in neutron:
milestone: liberty-2 → none
Changed in neutron:
assignee: Mike Kolesnik (mkolesni) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Mike Kolesnik (mkolesni)
Changed in neutron:
assignee: Mike Kolesnik (mkolesni) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Mike Kolesnik (mkolesni)
Assaf Muller (amuller) on 2015-07-28
Changed in neutron:
milestone: none → liberty-3
description: updated

Reviewed: https://review.openstack.org/141114
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=91d3a0219a43a2c06ea4043cc9eeb518815df391
Submitter: Jenkins
Branch: master

commit 91d3a0219a43a2c06ea4043cc9eeb518815df391
Author: Mike Kolesnik <email address hidden>
Date: Tue Apr 7 19:54:09 2015 -0400

    Update port bindings for master router

    An HA port needs to point to the correct host (where the master router
    is running) in order for L2Population to work.

    Hence, this patch introduces two fixes:
    * When a port owned by an HA router is up we make sure it points to the
      right node where the master is running, or a random node if there is
      no master yet (This corner case is fixed by the 2nd bullet point).

    * When a L3 agent reports it's hosting a master, we need to update the
      port binding to the host the master is now running on. This fixes
      both routers with no elected master (Yet) and failovers.

    This patch also changes the L3 HA failover test to use l2pop.
    Note that the test does not pass when using l2pop without this patch.

    Closes-Bug: #1365476
    Co-Authored-By: Assaf Muller <email address hidden>
    Change-Id: I8475548947526d8ea736ed7aa754fd0ca475cae2

Changed in neutron:
status: In Progress → Fix Committed
Assaf Muller (amuller) wrote :

@Liu: I removed the 'known issues' section. For what it's worth, anyone can edit that page.

Download full text (37.3 KiB)

Reviewed: https://review.openstack.org/211492
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a7b91632fc65ab9d2687298c68b1d715866d0356
Submitter: Jenkins
Branch: feature/pecan

commit 966203f89dee8fe61fb2dce654e36e510e80380f
Author: Sukhdev Kapur <email address hidden>
Date: Wed Jul 1 16:30:44 2015 -0700

    Neutron-Ironic integration patch

    This patch is in preparation for the integration
    of Ironic and Neutron. A new vnic_type is being
    added so that ML2 drivers can filter for all
    Ironic ports based upon match for 'baremetal'.
    Nova/Ironic will set this vnic_type when issuing
    port-create request to neutron.
    (e.g. binding:vnic_type = 'baremetal' )

    Change-Id: I25dc9472b31db052719db503a10c1fb1a55572ef
    Partial-Implements: blueprint neutron-ironic-integration

commit 236e408272bcb9b8e957524864e571b5afdc4623
Author: Oleg Bondarev <email address hidden>
Date: Tue Jul 7 12:02:58 2015 +0300

    DVR: fix router scheduling

    Fix scheduling of DVR routers to not stop scheduling once
    csnat portion was scheduled. See bug report for failing
    scenario.

    This partially reverts
    commit 3794b4a83e68041e24b715135f0ccf09a5631178
    and fixes bug 1374473 by moving csnat scheduling
    after general dvr router scheduling, so double binding does
    not happen.

    Closes-Bug: #1472163
    Related-Bug: #1374473
    Change-Id: I57c06e2be732e47b6cce7c724f6b255ea2d8fa32

commit e152f93878b9bb6af7cfedc9e045892fcf7d0615
Author: Assaf Muller <email address hidden>
Date: Sat Aug 8 21:15:03 2015 +0300

    TESTING.rst love

    Change-Id: I64b569048f8f87ea2fe63d861302b4020d36493d

commit 633c52cca1b383af2c900e1663c8682114acd177
Author: sridhargaddam <email address hidden>
Date: Wed Aug 5 10:49:33 2015 +0000

    Avoid dhcp_release for ipv6 addresses

    dhcp_release is only supported for IPv4 addresses [1] and not for
    IPv6 addresses [2]. There will be no effect when it is called with
    IPv6 address. This patch adds a corresponding note and avoids calling
    dhcp_release for IPv6 addresses.

    [1] http://manpages.ubuntu.com/manpages/trusty/man1/dhcp_release.1.html
    [2] http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2013q2/007084.html

    Change-Id: I8b8316c9d3d011c2a687a3a1e2a4da5cf1b5d604

commit 2de8fad17402f38bbc30204ee2f4f99cf21cb69d
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Aug 10 06:11:06 2015 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I2b423e83a7d0ac8b23239f81fe33dd8382c6fff6

commit fef79dc7b9162e03c8891645494c115b52d4d014
Author: Henry Gessau <email address hidden>
Date: Mon Aug 3 23:30:34 2015 -0400

    Consistent layout and headings for devref

    The lack of convention for heading levels among the independently
    written devref documents was starting to make the Table of Contents
    look rather messy when rendered in HTML.

    This patch does not cover the "Neutron Internals" section since its
    layo...

tags: added: in-feature-pecan
Adolfo Duarte (adolfo-duarte) wrote :

tested with trhee node seup by failing the HA network between the two snat-nodes (ifconfig tap--xxxxx down).

I see the standby node take over and programming the correct ip addresses, however the ha_router_agent_port_bindings db is not updated. i.e. the old snat still says inactive, while the new snat says stand by.
Therfor the "switch" in the datapath does not happen and the vms cannot ping the new gateway (on the new snat).

Am I testing this incorrectly? is doing ifconfig .... down not enough to trigger a failover in the database?

Adolfo Duarte (adolfo-duarte) wrote :

There appears to be no issue. I was not waiting long enough for failover to register. Everything seems to work as expected.

Mike Kolesnik (mkolesni) wrote :

Yes it could take a while due to the need to wait for the server to be notified who's actually master.
Thanks for testing this!

Reviewed: https://review.openstack.org/211166
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7c2727c4cdb791cc259b86c3dec128cfdb8cbd18
Submitter: Jenkins
Branch: stable/kilo

commit 7c2727c4cdb791cc259b86c3dec128cfdb8cbd18
Author: Mike Kolesnik <email address hidden>
Date: Tue Apr 7 19:54:09 2015 -0400

    Update port bindings for master router

    An HA port needs to point to the correct host (where the master router
    is running) in order for L2Population to work.

    Hence, this patch introduces two fixes:
    * When a port owned by an HA router is up we make sure it points to the
      right node where the master is running, or a random node if there is
      no master yet (This corner case is fixed by the 2nd bullet point).

    * When a L3 agent reports it's hosting a master, we need to update the
      port binding to the host the master is now running on. This fixes
      both routers with no elected master (Yet) and failovers.

    (cherry picked from commit 91d3a0219a43a2c06ea4043cc9eeb518815df391)

    Conflicts:
     neutron/api/rpc/handlers/l3_rpc.py

    Closes-Bug: #1365476
    Co-Authored-By: Assaf Muller <email address hidden>
    Change-Id: I8475548947526d8ea736ed7aa754fd0ca475cae2

tags: added: in-stable-kilo
Thierry Carrez (ttx) on 2015-09-03
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-10-15
Changed in neutron:
milestone: liberty-3 → 7.0.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers