Bug #1602614 “DVR + L3 HA loss during failover is higher that it...” : Bugs : neutron

Ann Taraday (akamyshnikova) on 2016-07-13

description:

updated

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2016-07-13:

#1

Will ask adolfo to triage it.

OpenStack Infra (hudson-openstack) on 2016-07-22

Changed in neutron:
assignee:	nobody → venkata anil (anil-venkata)
status:	New → In Progress

Revision history for this message

venkata anil (anil-venkata) wrote on 2016-07-26:

#2

Existing approach -

steps that happen without neutron involvement during failover
1) keepalived assigns ip address to qr-xx and sends GARP

neutron involvement during failover
2) keepalived-state-change-monitor writes to unix domain socket about address assignment
3) l3 agent will spawn radvd and metadata proxy processes, and then call neutron server for update_routers_states
4) While server updating the router states, it calls port update with new master router agent
5) plugin's port update will notify l2 agent
6) l2 agent will again wireup and call
    a) port status update with BUILD as status
    b) then again port status update with ACTIVE as status
7) plugin side, l2pop driver will process each port status update.
   When status transitions from BUILD->ACTIVE, l2pop driver will notify agents to add unicast and multicast entries for this port and agent.
8) l2 agent's l2pop will create flood flows(i.e table 22) and unicast flows(i.e table 20) on br-tun
With this, complete wiring for the packets between agents is finished and now, and vms across nodes can use the new HA master router.

New approach (i.e - https://review.openstack.org/#/c/323314/)

During router interface add(and not during failover), port is bound and wired up on all HA agents. So l2pop will create multicast flows(and no unicast flows) to all HA agents.

steps that happen without neutron involvement during failover
1) keepalived assigns ip address to qr-xx and sends GARP.
When the nodes receive GARP, they will add unicast flow to master router in table 20 because of learning action in table 10.
Without neutron involvement during failover, all flows are established and vms can reach router.

2) neutron involvement during failover -
no involvement for wiring up and creating flows. Without neutron involvement, vms can reach master HA router.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-09: Change abandoned on neutron (master)

#3

Change abandoned by venkata anil (<email address hidden>) on branch: master
Review: https://review.openstack.org/323314
Reason: Prefering patch 255237 over this
1) Backporting alembic migration may not be allowed
2) To avoid special handling for TOR drivers

note: We may visit these patches(340031, 324302, 323314) later to solve 1488015.

Revision history for this message

venkata anil (anil-venkata) wrote on 2016-08-22:

#4

Change https://review.openstack.org/#/c/255237/ is the complete fix for this bug.

In this change, flood flows to all HA router's nodes are established during router interface add, through l2pop. After transition to master, keepalived sends GARP. When compute nodes receives this GARP message, they add unicast flow to the master HA node. So to create flows during failover, no interaction between neutron components( i.e processing of port binding and update, and also l3 agent and server interaction for updating router states) is required, hence reduction in ovs flow setup time and minimize packet loss.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-08-22:

#5

This is essentially a duplicate of https://bugs.launchpad.net/neutron/+bug/1522980. I think that proper HA for DVR SNAT traffic is a high priority Newton fix.

Changed in neutron:
milestone:	none → newton-3
importance:	Undecided → High

Armando Migliaccio (armando-migliaccio) on 2016-09-01

Changed in neutron:
milestone:	newton-3 → newton-rc1

OpenStack Infra (hudson-openstack) on 2016-09-08

Changed in neutron:
assignee:	venkata anil (anil-venkata) → Carl Baldwin (carl-baldwin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-09: Fix merged to neutron (master)

#6

Reviewed: https://review.openstack.org/255237
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26d8702b9d7cc5a4293b97bc435fa85983be9f01
Submitter: Jenkins
Branch: master

commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01
Author: venkata anil <email address hidden>
Date: Thu Aug 4 07:14:47 2016 +0000

l2pop fdb flows for HA router ports

This patch makes L3 HA failover not depended on neutron components
(during failover).

    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.

    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.

    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count

    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e

Reviewed:  https://review.openstack.org/255237
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26d8702b9d7cc5a4293b97bc435fa85983be9f01
Submitter: Jenkins
Branch:    master

commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01
Author: venkata anil <anilvenkata@redhat.com>
Date:   Thu Aug 4 07:14:47 2016 +0000

l2pop fdb flows for HA router ports
    
    This patch makes L3 HA failover not depended on neutron components
    (during failover).
    
    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.
    
    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.
    
    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count
    
    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-26: Fix included in openstack/neutron 9.0.0.0rc1

#7

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-05: Fix proposed to neutron (stable/mitaka)

#8

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/382210

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-18: Fix included in openstack/neutron 9.0.0.0rc1

#9

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-14: Fix merged to neutron (stable/mitaka)

#10

Reviewed: https://review.openstack.org/382210
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c06ff65dbbc515ef5a70acc98dee4f5cb57dce71
Submitter: Jenkins
Branch: stable/mitaka

commit c06ff65dbbc515ef5a70acc98dee4f5cb57dce71
Author: venkata anil <email address hidden>
Date: Thu Aug 4 07:14:47 2016 +0000

l2pop fdb flows for HA router ports

This patch makes L3 HA failover not depended on neutron components
(during failover).

    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.

    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.

    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count

    Conflicts:
            neutron/db/l3_hamode_db.py
            neutron/plugins/ml2/drivers/l2pop/db.py
            neutron/plugins/ml2/drivers/l2pop/mech_driver.py
            neutron/plugins/ml2/rpc.py
            neutron/tests/unit/plugins/ml2/drivers/l2pop/test_db.py
            neutron/tests/unit/plugins/ml2/drivers/l2pop/test_mech_driver.py
            neutron/tests/unit/plugins/ml2/test_rpc.py

    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e
    (cherry picked from commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01)

Reviewed:  https://review.openstack.org/382210
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c06ff65dbbc515ef5a70acc98dee4f5cb57dce71
Submitter: Jenkins
Branch:    stable/mitaka

commit c06ff65dbbc515ef5a70acc98dee4f5cb57dce71
Author: venkata anil <anilvenkata@redhat.com>
Date:   Thu Aug 4 07:14:47 2016 +0000

l2pop fdb flows for HA router ports
    
    This patch makes L3 HA failover not depended on neutron components
    (during failover).
    
    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.
    
    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.
    
    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count
    
    Conflicts:
            neutron/db/l3_hamode_db.py
            neutron/plugins/ml2/drivers/l2pop/db.py
            neutron/plugins/ml2/drivers/l2pop/mech_driver.py
            neutron/plugins/ml2/rpc.py
            neutron/tests/unit/plugins/ml2/drivers/l2pop/test_db.py
            neutron/tests/unit/plugins/ml2/drivers/l2pop/test_mech_driver.py
            neutron/tests/unit/plugins/ml2/test_rpc.py
    
    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e
    (cherry picked from commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01)

tags:

added: in-stable-mitaka

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-01: Fix included in openstack/neutron 8.4.0

#11

This issue was fixed in the openstack/neutron 8.4.0 release.

neutron

DVR + L3 HA loss during failover is higher that it is expected

Bug Description

Other bug subscribers

Remote bug watches