Update permanent ARP entries for allowed_address_pair IPs in DVR Routers

Bug #1774459 reported by Swaminathan Vasudevan
98
This bug affects 16 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
High
Unassigned

Bug Description

We have a long term issue with Allowed_address_pairs IP which associated with unbound ports and DVR routers.
The ARP entry for the allowed_address_pair IP does not change based on the GARP issued by any keepalived instance.

Since DVR does the ARP table update through the control plane, and does not allow any ARP entry to get out of the node to prevent the router IP/MAC from polluting the network, there has been always an issue with this.

A recent patch in master https://review.openstack.org/#/c/550676/ to address this issue was not successful.

This patch helped in updating the ARP entry dynamically from the GARP message. But the entry has to be Temporary(NUD - reachable). Only if it is set to 'reachable' we were able to update it on the fly from the GARP message, without using any external tools.

But the problem here is, when we have VMs residing in two different subnets (Subnet A and Subnet B) and if a VM from the Subnet B which is on a different isolated node and is trying to ping the VRRP IP in the Subnet A, the packet from the VM comes to the router namespace where the ARP entry for the VRRP IP is available as reachable. While it is reachable the VM is able to send couple of pings, and later within in 15 sec, the pings timeout.

The reason is that the Router is in turn trying to make sure that if the IP/MAC combination for the VRRP IP is still valid or not, since the entry in the ARP table is "REACHABLE" and not "PERMANENT".
When it tries to re-ARP for the IP, the ARP entries are blocked by the DVR flow rules in the br-tun and so the ARP timesout and the ARP entry in the Router Namespace becomes incomplete.

Option A:
So the way to address this situation is to make use of some GARP sniffer tool/utility that would be running in the router namespace to sniff a GARP packet with a specific IP as a filter. If that IP is seen in the GARP message, the tool/utility should in-turn try to reset the ARP entry for the VRRP IP as permanent. ( This is one option ). This is very performance intensive and so not sure if it would be helpful. So we should probably make it configurable, so that people can use it if required.

Option B:
The other option is, instead of running it on all nodes and in all router-namespace, we can probably just run it on the network_node router_namespace, or in the network node host, and then send a message to the neutron that there was a change in IP/MAC somehow and then neutron will then communicate to all the hosts to do an ARP update for the given IP/MAC. ( Just an idea not sure how simple it is when compared to the former)

Any ideas or thoughts would be helpful.

Boden R (boden)
tags: added: rfe
Miguel Lavalle (minsel)
tags: added: rfe-triaged
removed: rfe
Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Brian Haley (brian-haley) wrote :

I'm not a big fan of a process running that is snooping on traffic, it's most likely going to cause a performance issue.

Can doing this like the keepalived_state_change code work? It uses "ip monitor" to watch for events and triggers action, and could be modified to look for "neigh" events.

Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Hi Brian, yes I can take a look at the keepalived_state_change code and see how it works and how it can be used in our case.
But not sure if it can be as such used in our case, since the keepalived in our case is running in side the VM and not in the Namespace.

Changed in neutron:
status: New → Confirmed
Miguel Lavalle (minsel)
Changed in neutron:
importance: Wishlist → Critical
importance: Critical → High
tags: removed: rfe-triaged
Miguel Lavalle (minsel)
summary: - RFE: Update permanent ARP entries for allowed_address_pair IPs in DVR
- Routers
+ Update permanent ARP entries for allowed_address_pair IPs in DVR Routers
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I don't think that we can use the IP-Monitor for our purpose.
We should definitely come up with a GARP-sniff tool similar to IP-Monitor and then use if for our purpose.
If we think that the performance will be an issue. We should probably come up with a dedicated node, that is doing the sniff and reporting it back to Neutron Server.
That way neutron server can then do an rpc update to all agents to add a permanent entry.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

Is there any other way to collect the same information that would be collected by garp-sniffing?
What info needs to be propagated?

Also what about putting a flow rule into the openflow tables that route certain packets to all other compute nodes?

For example, a rule could be added to forward GARP packets to all other "members" of dvr group?

And then you can hang a smarter process off each router namespace and process them however is required. Or make it part of the dvr code.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

Also on the performance aspect of sniffing garp packets, the reason a system would suffer performance is if it is processing ALL packets, what if we can filter to only garp packets *before* the monitoring tool gets them.
Its basically the same thing as what a host must do. All hosts have to listen to garp packets, so I am not sure you would get any more traffic than what you would get already.
its just instead of dropping the packets on the floor, you pull them into user space for processing.
Perhaps the term "sniffing" is what we need to avoid.
Again we could put a flow rule that pulls specifically garp packets from an ip address/mac combo and give those packets to a process in user space.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

One other simple solution would be to forward a packet ( GARP ) for the MAC's that are configured for Allowed_address_pair to the Ryu controller. The controller can process the packet under the 'packet_in_handler' and then try to create an ARP response entry in the ARP Responder for the MAC and IP.

Before writing the entry probably we should check for the current flows configured in the ARP Responder table and if there is a duplicate entry for the IP/MAC combination, then delete it and rewrite the flows for the GARP packet.

Today we don't have a packet_in_handler for the Ryu controller app.
Similar to this.
@handler.set_ev_cls(ofp_event.EventOFPPacketIn, handler.MAIN_DISPATCHER)
def packet_in_handler(self, ev):
    pkt = packet.Packet(array.array('B', ev.msg.data))
    for p in pkt:
        print p.protocol_name, p

My knowledg is pretty limited in L2 openflow, so if there are any L2 Openflow experts could comment on this, if this would work or not.

Revision history for this message
Rossella Sblendido (rossella-o) wrote :

Swami I am not sure this last proposal is better in terms of performance than sniffing garp. I don't have all the details and I am new to this problem but have you ever tried instead of setting the NUD to reachable to change the timeout so that arp entries become stale pretty quickly and the GARP can update them? A combination between frequent GARP and entries that get stale quickly might fix this.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Rossella thanks for your feedback. The issue is we are seeing the garp updates the arp entry.But when there is ping to that IP, the router tries to re-arp to confirm the IP and that is were it fails.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Not sure if we can do something similar to ARP Responder to add a dynamic ARP Responder rule to send an ARP reply for the GARP'd MAC.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Adding a rule similar to ARP Responder may not be possible. So the best bet is to forward the GARP packets to an output port (tap port created for this purpose). Then a separate process can listen on the tap port and then parse the packet and provide info to the Network node ( neutron-server) to update the ARP entry on all nodes.

Please provide me your thoughts on this.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

If we feel that this process running in the compute node will reduce the performance, then what we should do is probably have a new agent type running this process and then communicating to the Network node. That way we don't need to run this process on all compute nodes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/601336

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Adding a rule dynamically to the ARP responder is only possible if we can forward the GARP packet to the openflow controller where the openflow controller can process the IN_PACKETS and then make a decision on creating a flow in the ARP responder table.
Is this possible in neutron openflow native drivers today. Any l2 openflow experts can comment on it.
This would be the simplest approach otherwise, we need to forward it to the user space where we process the packet and then send it to Neutron server to take necessary action.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

After talking with Miguel Ajo and Daniel Alvarez, plan is to intecept GARP packets and forward to local controller for processing
Swami had a WIP patch that inserted an OF rule to intercept, would need update to send to controller
https://review.openstack.org/#/c/601336/
For ryu look here:
 https://ryu.readthedocs.io/en/latest/ofproto_v1_0_ref.html#packet-in-message
https://ryu.readthedocs.io/en/latest/writing_ryu_app.html?highlight=packet%20in
ovn-controller handles incoming packets (packet-in) as controller here:
https://github.com/openvswitch/ovs/blob/769e6223daf3d6e51963dc3ee938a01fdc71a0d0/ovn/controller/pinctrl.c#L1221
https://github.com/openvswitch/ovs/blob/769e6223daf3d6e51963dc3ee938a01fdc71a0d0/ovn/controller/pinctrl.c#L1121
ODL is tracking this on https://jira.opendaylight.org/browse/NETVIRT-1402
Basically flows will be programmed dynamically when a GARP is recognized by the controller.

Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/616272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/628027

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/651905

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/628027
Reason: Abandoning this patch since I have a better one to address this issue.
https://review.openstack.org/#/c/651905/

Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/651905
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=52b537ca22b2d7d81a84b2f75de577d8dffee94c
Submitter: Zuul
Branch: master

commit 52b537ca22b2d7d81a84b2f75de577d8dffee94c
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Apr 11 11:12:24 2019 -0700

    DVR: Modify DVR flows to allow ARP requests to hit ARP Responder table

    DVR does the ARP table update through the control plane, and does not
    allow any ARP requests to get out of the node.

    In order to address the allowed address pair VRRP IP issue with DVR,
    we need to add an ARP entry into the ARP Responder table for the
    allowed address pair IP ( which is taken care by the patch in [1])

    This patch adds a rule in the br-int to redirect the packet
    destinated to the router to the actual router-port and also moves
    the arp filtering rule to the tunnel or the physical port based on the
    configuration.

    By adding the above rule it allows the ARP requests to reach the
    ARP Responder table and filters the ARP requests before it reaches
    the physical network or the tunnel.

    [1] https://review.opendev.org/#/c/601336/
    Related-Bug: #1774459

    Change-Id: I3905ea56ca0ff35bdd96c818719e6d63a3eb5a72

Changed in neutron:
status: Confirmed → In Progress
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/669938

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/685779

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/685779
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=09c4e0e970b02ed992aab762505d17f1decca551
Submitter: Zuul
Branch: stable/stein

commit 09c4e0e970b02ed992aab762505d17f1decca551
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Apr 11 11:12:24 2019 -0700

    DVR: Modify DVR flows to allow ARP requests to hit ARP Responder table

    DVR does the ARP table update through the control plane, and does not
    allow any ARP requests to get out of the node.

    In order to address the allowed address pair VRRP IP issue with DVR,
    we need to add an ARP entry into the ARP Responder table for the
    allowed address pair IP ( which is taken care by the patch in [1])

    This patch adds a rule in the br-int to redirect the packet
    destinated to the router to the actual router-port and also moves
    the arp filtering rule to the tunnel or the physical port based on the
    configuration.

    By adding the above rule it allows the ARP requests to reach the
    ARP Responder table and filters the ARP requests before it reaches
    the physical network or the tunnel.

    [1] https://review.opendev.org/#/c/601336/
    Related-Bug: #1774459

    Change-Id: I3905ea56ca0ff35bdd96c818719e6d63a3eb5a72
    (cherry picked from commit 52b537ca22b2d7d81a84b2f75de577d8dffee94c)

tags: added: in-stable-stein
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/616272
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/669938
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ea85e39660abaa224c159c42214fdb7042302bea
Submitter: Zuul
Branch: master

commit ea85e39660abaa224c159c42214fdb7042302bea
Author: Brian Haley <email address hidden>
Date: Tue Jul 9 15:50:35 2019 -0400

    Force arp_responder to True when DVR and tunneling enabled

    After [1] and [2], the ARP responder needs to be enabled
    if DVR and tunneling are enabled or ARP will not work. If
    it is False we will log a message and force it to True.

    [1] https://review.opendev.org/#/c/651905/
    [2] https://review.opendev.org/#/c/653883/

    Change-Id: I934062c970effe5194056b0786f84f3246850701
    Related-bug: #1774459

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/700805

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/700988

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/705200

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/705201

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/700988
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dc8c1deeee7ad46958a771567516e91ca52cd156
Submitter: Zuul
Branch: stable/train

commit dc8c1deeee7ad46958a771567516e91ca52cd156
Author: Brian Haley <email address hidden>
Date: Tue Jul 9 15:50:35 2019 -0400

    Force arp_responder to True when DVR and tunneling enabled

    After [1] and [2], the ARP responder needs to be enabled
    if DVR and tunneling are enabled or ARP will not work. If
    it is False we will log a message and force it to True.

    [1] https://review.opendev.org/#/c/651905/
    [2] https://review.opendev.org/#/c/653883/

    Change-Id: I934062c970effe5194056b0786f84f3246850701
    Related-bug: #1774459
    (cherry picked from commit ea85e39660abaa224c159c42214fdb7042302bea)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/rocky)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/705200

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/queens)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/705201

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/stein)

Change abandoned by Brian Haley (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/700805
Reason: Abandoning based on my last comment, since this would require additional code that isn't present in this branch.

Revision history for this message
wang (yunhua) wrote :

stay tuned

Changed in neutron:
assignee: Brian Haley (brian-haley) → Slawek Kaplonski (slaweq)
Revision history for this message
Jörg Frede (frede-r) wrote :

Is someone still working on this bug? Looks to me like it doesn't has any progress since February.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Jorg, Yes, I recently revived patch https://review.opendev.org/#/c/601336/ and I'm working on it currently.

Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/601336
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Hi Slawek Kaplonski, is someone still working on this bug? Thanks.

Changed in neutron:
status: New → In Progress
Revision history for this message
Yusuf Güngör (yusuf2) wrote :

We are having trouble to use Octavia LB with HA because of this situation. Octavia keepalived VIP arp record is not updated with gratuitous arp. Details asked to Octavia team but they have nothing to do about this. Octavia Issue: https://storyboard.openstack.org/#!/story/2009765

Revision history for this message
Darrick Horton (vmaccel) wrote :

This is also affecting us with Octavia LB on Xena. Is anyone actively working on this?

Revision history for this message
Alexandre Perreault (alexperreault) wrote :

We are also having problems with ARP when using octavia LB and DVR routers in Yoga.

Revision history for this message
Justin Alford (jlalford) wrote :

We are also running into this issue in both Queens and Victoria (both DVR on OVS). We have been able to work around the GARP issue using the neutron API during a failover, but it would save us a lot of api cycles if the garp was able to trigger updates across the cluster

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers