L3 HA integration with l2pop assumes control plane is operational for fail over

Bug #1522980 reported by Assaf Muller
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Carl Baldwin

Bug Description

Note: This is a soft requirement for DVR + L3 HA.

L3 HA did not work with l2pop at all, and that was fixed here:
https://bugs.launchpad.net/neutron/+bug/1365476 via https://review.openstack.org/#/c/141114/.

However, the solution is sub optimal because it assumes the control plane is operational for fail over to work correctly.
Without l2pop, L3 HA can fail over successfully if the database, messaging server, neutron-server and destination L3 agent are dead. With l2pop, all four are needed. This is because for fail over to work, the destination L3 agent notices that a router has transitioned to master, and notifies neutron-server via RPC. At which point neutron-server updates all of the internal router port's 'binding:host' value to point to the target node, and l2pop code is executed in order to update the L2 agents.

Instead, I'd like fail over to rely solely on the data plane regardless if l2pop is on or off. One such solution would be something similar to patch set 9 of the patch: https://review.openstack.org/#/c/141114/9//COMMIT_MSG. The idea is to tell l2pop to treat HA router ports as replicated ports (Which they are), so that tunnel endpoints would be created against all nodes that host replicas of the router, and the destination MAC address of the port would not be learned via l2pop, but via the fallback regular MAC learning mechanism. This means that we lost some of the advantage of l2pop, but I think it is essential to correct operation of L3 HA.

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/255237

Changed in neutron:
status: New → In Progress
Revision history for this message
venkata anil (anil-venkata) wrote :

 For HA router ports, only flood flows are notified and no unicast address are notified.

When flood flows are notified, agents create tunnel ports to HA router agents
and add flood flows for that HA router agent(and with that network) in br-tun.
ARP reply entries for HA router ports are not added in br-tun as
unicast addresses are not notified.

1) All HA routers create tunnel ports(and flood flows) to all agents hosting the network.
All HA routers create tunnels among themselves if tunnel is not created between them through above step.
(This will happen if the agent hosting HA router will have no other port in the network.)

2) All agents will be notified about the HA router agent. Along with that,
these agents will get notification about other HA router agents,
if they(i.e HA router agent) are not already hosting any other port in that network.
The intention is, other agents having ports on same network, create tunnel ports to
all HA router agents(if no tunnel ports are already present)

Note: In DB "ml2_port_bindings", shows that HA router port is bound to only one agent.
We will use "ha_router_agent_port_bindings" and "routerports" tables to get all agents
hosting the HA router. Some of these HA router agents might already have tunnel flows
to other agents(as they may be hosting other ports on the network).
So we skip them and consider only HA router agents hosting only HA router port.

Assaf Muller (amuller)
description: updated
Revision history for this message
venkata anil (anil-venkata) wrote :

Change https://review.openstack.org/#/c/255237/ supports adding flood flow entries between slave nodes and other agents.

L2pop driver is notified about Ha router port only when port is bound on master node(and also when slave transitions into master).

But, In the following scenarios, l2pop for HA router ports can fail.

1) When l2 agent in slave node is restarted, slave l2 agent deletes previous flood entries. But plugin won't notifies l2pop driver about port up on slave agent, so l2pop driver can't notify slave agent about network ports. Thus flood entries lost on slave node.

2) When HA router added to new (slave) agent, plugin won't notifies l2pop driver about port up on slave agent, so l2pop driver can't process HA port for this agent. Thus flood entries can't be created between this slave node and other agents.

3) When Ha router removed from existing slave agent, plugin won't notifies l2pop driver about port down on slave agent, so l2pop driver can't notify other agents about HA router removal from this agent. Thus other agents still have flood entries to this agent.

To fix these issues, plugin should be updated to notify l2pop driver for port up/down on slave agents(along with master).
This will be taken care in this change https://review.openstack.org/#/c/282874/
Change https://review.openstack.org/#/c/282874/ will also update l2pop driver to process HA router port up/down notifications from plugin.

Revision history for this message
venkata anil (anil-venkata) wrote :

Change https://review.openstack.org/#/c/255237/ sets up flood flows between slave agents and other agents.

But when l2 agent in slave node is restarted
1) existing flows in slave(created by change 255237) will be deleted.
2) then l2 agent calls update_device_up for router interface port
3) update_device_up won't call update_port_status as
   port_bound_to_host fails(port is bound to master host and not this slave host)
4) hence l2pop driver is not called for router interface port,
   so l2pop driver can't notify network ports to slave node.
5) Because of this, slave nodes can't have flood entries to other agents
If neutron server is down at this time and this slave becomes master
at the same time, this new master router can't communicate with
vms through the router interface port.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/323993

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/324302

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/323993
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=77bfd82c3cf766724b12629cf23902a1758fe94a
Submitter: Jenkins
Branch: master

commit 77bfd82c3cf766724b12629cf23902a1758fe94a
Author: venkata anil <email address hidden>
Date: Wed Jun 1 15:42:38 2016 +0000

    Rename ml2_dvr_port_bindings to make it generic

    Distributed port binding need to be implemented for HA router ports
    to fix bug 1522980. HA ports can use existing DVR implementation for
    multiple port binding. So we have to make current DVR port binding
    implementation generic, so that all distributed ports(like DVR, HA)
    can use it.

    As part of making it generic, we rename 'ml2_dvr_port_bindings' table
    to 'ml2_distributed_port_bindings', so that all distributed ports
    (DVR, HA ..) can use this table.

    Partial-Bug: #1595043
    Partial-Bug: #1522980
    Change-Id: I24650b7dee6305f801b457c4f21c8b16fb0eb6e0

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/282874
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

ping?

Changed in neutron:
status: In Progress → Incomplete
assignee: venkata anil (anil-venkata) → nobody
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Needs a new owner?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/339982

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/340031

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/339982
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=89cd4d07d173d54f05aee1524a6376226ac1bd80
Submitter: Jenkins
Branch: master

commit 89cd4d07d173d54f05aee1524a6376226ac1bd80
Author: venkata anil <email address hidden>
Date: Sat Jul 9 08:20:03 2016 +0000

    Rename dvr portbinding functions

    As part of making DVR portbinding implementation generic, we rename
    dvr portbinding functions as distributed portbinding functions.
    In next patch we make dvr logic for port binding generic,
    to be useful for all distributed router ports(for example, HA).

    Partial-Bug: #1595043
    Partial-Bug: #1522980
    Change-Id: I402df76c64299156d4ed48ac92ede1e8e9f28f23

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by venkata anil (<email address hidden>) on branch: master
Review: https://review.openstack.org/340031
Reason: Preferring patch 255237 over this as backporting alembic migration may not be allowed

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by venkata anil (<email address hidden>) on branch: master
Review: https://review.openstack.org/324302
Reason: Preferring patch 255237 over this as backporting alembic migration may not be allowed

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by venkata anil (<email address hidden>) on branch: master
Review: https://review.openstack.org/323314
Reason: Prefering patch 255237 over this
1) Backporting alembic migration may not be allowed
2) To avoid special handling for TOR drivers

note: We may visit these patches(340031, 324302, 323314) later to solve 1488015.

Revision history for this message
venkata anil (anil-venkata) wrote :

Change https://review.openstack.org/#/c/255237/ is the complete fix for this bug.

description: updated
Assaf Muller (amuller)
Changed in neutron:
milestone: none → newton-3
importance: Medium → High
Changed in neutron:
milestone: newton-3 → newton-rc1
Changed in neutron:
assignee: venkata anil (anil-venkata) → Carl Baldwin (carl-baldwin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/255237
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26d8702b9d7cc5a4293b97bc435fa85983be9f01
Submitter: Jenkins
Branch: master

commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01
Author: venkata anil <email address hidden>
Date: Thu Aug 4 07:14:47 2016 +0000

    l2pop fdb flows for HA router ports

    This patch makes L3 HA failover not depended on neutron components
    (during failover).

    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.

    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.

    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count

    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e

Changed in neutron:
status: In Progress → Fix Released
tags: added: mitaka-backport-potential
Revision history for this message
Dongcan Ye (hellochosen) wrote :

Hello, is there a plan backport to stable/mitaka?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc1

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/382210

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc1

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/382210
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c06ff65dbbc515ef5a70acc98dee4f5cb57dce71
Submitter: Jenkins
Branch: stable/mitaka

commit c06ff65dbbc515ef5a70acc98dee4f5cb57dce71
Author: venkata anil <email address hidden>
Date: Thu Aug 4 07:14:47 2016 +0000

    l2pop fdb flows for HA router ports

    This patch makes L3 HA failover not depended on neutron components
    (during failover).

    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.

    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.

    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count

    Conflicts:
            neutron/db/l3_hamode_db.py
            neutron/plugins/ml2/drivers/l2pop/db.py
            neutron/plugins/ml2/drivers/l2pop/mech_driver.py
            neutron/plugins/ml2/rpc.py
            neutron/tests/unit/plugins/ml2/drivers/l2pop/test_db.py
            neutron/tests/unit/plugins/ml2/drivers/l2pop/test_mech_driver.py
            neutron/tests/unit/plugins/ml2/test_rpc.py

    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e
    (cherry picked from commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.4.0

This issue was fixed in the openstack/neutron 8.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.