Gratuitous ARPs are not sent during master transition

Bug #1952907 reported by Damian Dąbrowski
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
LIU Yulong

Bug Description

* High level description:

When a router transitions to MASTER state, keepalived should send GARPs but it fails because qg-* interface is down(it comes up about 1 sec after that, so it might be some race condition)
Keepalived should also send another GARPs after 60 seconds(garp_master_delay) but it doesn't(probably because first ones fail, but I'm not 100% sure).

When I add random port to this router to trigger keepalived's reload, then all GARPs are sent properly(because netns is already configured and qg-* interface is up for the whole time)

* Pre-conditions:

Operating System: Ubuntu 20.04
Keepalived version: 2.0.19
Affected neutron releases:
  - my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD as of 29.09.2021)
  - my prod env: Victoria
  - (most likely all versions after this change https://review.opendev.org/c/openstack/neutron/+/707406)

* Step-by-step reproduction:

Simply perform a failover on HA router.
The same goal may be also achieved by removing all l3 agents from the router, and then adding one, so:

# openstack router create neutron-bug --ha
# openstack router set --external-gateway public neutron-bug
# neutron l3-agent-list-hosting-router neutron-bug
# (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug
# (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug
(GARPs are not sent)
# openstack router add port neutron-bug test-port
(GARPs are sent properly)

* Expected output:

Gratuitous ARPs should be sent from router's namespace during MASTER transition.

* Actual output:

Gratuitous ARPs are not sent.
Keepalived complains about: Error 100 (Network is down) sending gratuitous ARP on qg-4a2f0239-5c for 172.29.249.194
qg-* interface wakes up about 1 second after keepalived tries to send GARPs.

* Root cause

Currently neutron keeps qg- interface down for BACKUP agents: https://review.opendev.org/c/openstack/neutron/+/707406
Keepalived's MASTER transition takes place before keepalived-state-change notifies neutron-l3-agent about state change.
As a result, neutron-l3-agent links qg- interface after keepalived's MASTER transition, which simply means that keepalived can't send GARPs during this transition, because qg- interface is down then.

* Proposed solutions

1. Revert https://review.opendev.org/c/openstack/neutron/+/707406 and always keep qg- interfaces up
I'm not sure, but maybe we don't need above change anymore because it was fixed in keepalived: https://github.com/acassen/keepalived/commit/b10bbfc2a2b216487cea5a586c55765275e41253

2. Send delayed GARPs by keepalived_state_change.py
Change proposal: https://review.opendev.org/c/openstack/neutron/+/821433

3. Send GARPs also for FIPs(like it's done for non-HA routers by ./agent/l3/legacy_router.py)
Change proposal: https://review.opendev.org/c/openstack/neutron/+/821434

P.S. As solutions 2. and 3. only sends GARPs, we may also need to fix IPv6's NDP. Besides ARPs, keepalived also fails to send unsolicited neighbor advertisements. I'm not sure about it though, I don't know much about IPv6.

* Attachments:

Keepalived logs: https://paste.openstack.org/raw/811372/
Interfaces inside router's netns + tcpdump from master transition: https://paste.openstack.org/raw/811373/

Tags: l3-ha
description: updated
tags: added: l3-ha
Revision history for this message
LIU Yulong (dragon889) wrote (last edit ):

Actually I've noticed that issue about 1.5 years ago. And there is a patch [1] which is going to deal with such issue, but it does not get enough attentions from upstream. Thank you for the bug report. Maybe you can try this fix.

[1] https://review.opendev.org/c/openstack/neutron/+/712474

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Damian Dąbrowski (damiandabrowski) wrote :

Thanks for reply!

I tested this fix but unfortunately it didn't solve the issue. I've added my comments there and still trying to find a proper solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/821433

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/821434

description: updated
Revision history for this message
Damian Dąbrowski (damiandabrowski) wrote :

Hey,

I spent few additional hours on this problem and I came up with some ideas.
I have added 'Root cause' and 'Proposed solutions' to the bug description.

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

Hello Damian, did you try running only the revert, and see how it behaves? I am particularly interested if it solves your issue AND it doesn't trigger https://bugs.launchpad.net/neutron/+bug/1859832 again.

Side note: We might want to isolate on which branches to backport this revert as older branches are mapping to certain OS versions which might not have the right version of keepalived. (Could you check how far we can backport the revert too, please?)

Thank you in advance.

Revision history for this message
Magnus Bergman (magnusbe) wrote :

Just a general status update: Discussions about the path forward for this issue was on the agenda for the 2022-02-25 neutron_drivers meeting, but due to lack of time it was postponed to the next such meeting (2022-03-04).

The potential revert (or rework) of https://review.opendev.org/c/openstack/neutron/+/707406 as a solution to this issue will also most certainly be discussed during that meeting.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/836198

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Damian, Magnus:

I've pushed https://review.opendev.org/c/openstack/neutron/+/836198. This patch reverts [1] and creates a iptables DROP filter for all IPv6 packets before flushing the interface IPv6 address. Then, the DROP rule is deleted and the IPv6 address is added to keepalived. At this point this interface and the IPv6 link-local address is controller by keepalived.

This patch should handle both this bug and LP#1859832.

During the driver's meeting you said you were able to test any new patch. Please, if you have time that would be very helpful.

Regards.

Revision history for this message
Damian Dąbrowski (damiandabrowski) wrote :

Thanks a lot!
I'm going to test this patch next week

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/836581

Revision history for this message
Stefan Hoffmann (mr-hopeman) wrote :

Hi guys,

we faced the similar issue but also that failover takes long with many routers. So fixes for this Bug may also fix our problem.

What do you think about keepalived taking care of the iptables filters or even the interface up/down (also therefore the garp needs to be send again, due to notify_master/backup scripts seems to be executed after keepalived sends the garp)?

Or should we open a new Bug for the failover issue instead?

Nevertheless we can test provided fixes.

Regards

Revision history for this message
Stefan Hoffmann (mr-hopeman) wrote :

regarding the proposed solution step 1:
[1] don't help us there, as we don't use VMAC for the qg interfaces. Also keepalived only monitors the ha interface, not the qg interface, so I guess, the fix don't do anything at this interface.

So we can revert [2] only, if we have a nice way to handle issues with MLDv2 and other discovery packets.

[1] https://github.com/acassen/keepalived/commit/b10bbfc2a2b216487cea5a586c55765275e41253
[2] https://review.opendev.org/c/openstack/neutron/+/707406

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/821433
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/821434
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/839671

Revision history for this message
Damian Dąbrowski (damiandabrowski) wrote :

I came up with another way to solve this problem: https://review.opendev.org/c/openstack/neutron/+/839671

Please have a look when You have a moment

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/839671
Committed: https://opendev.org/openstack/neutron/commit/5288593fafe6636fc14b8873465866d20de26935
Submitter: "Zuul (22348)"
Branch: master

commit 5288593fafe6636fc14b8873465866d20de26935
Author: Damian Dabrowski <email address hidden>
Date: Thu Apr 28 02:54:25 2022 +0200

    [L3-HA] Disable automatic link-local address assignment for HA routers

    In order to get both [1] and [2] fixed, we set
    `net.ipv6.conf.all.addr_gen_mode=1` in HA router namespace to
    prevent auto-assigning link-local address(lla) to the interfaces.
    We don't need lla auto-assignment as keepalived manages them.
    With this change, we will have link-local addresses only on active
    router, which will prevent 'dadfailed' and MLD packets will not be
    sent from standby router.

    Previously we also reverted [3] to always keep qg-* interface up on both
    active&standby router's instance, no matter if keepalived is started or
    not.
    Without link-local address assigned, backup router's instance won't
    send any packets, so I see no reason to keep qg-* interface down.

    [1] https://bugs.launchpad.net/neutron/+bug/1952907
    [2] https://bugs.launchpad.net/neutron/+bug/1859832
    [3] https://review.opendev.org/c/openstack/neutron/+/834162

    Closes-Bug: #1952907
    Related-Bug: #1859832
    Depends-On: https://review.opendev.org/c/openstack/neutron/+/834162
    Change-Id: I306f14aa6b7e8bb69a81f441be337bc1a584d3b2

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/846607

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/846607
Committed: https://opendev.org/openstack/neutron/commit/2365abfd007abd166fcc7cced2509da7763b3769
Submitter: "Zuul (22348)"
Branch: master

commit 2365abfd007abd166fcc7cced2509da7763b3769
Author: Damian Dabrowski <email address hidden>
Date: Mon Jun 20 13:42:20 2022 +0200

    Add a release note for 834162

    I forgot to write a release note when pushing change 834162 [1].
    It may be an important change for operators so it's good to have a
    release note about that.

    [1] https://review.opendev.org/c/openstack/neutron/+/834162

    Related-Bug: #1952907
    Change-Id: Ie707f461af11357d6eaa004bc98c7eb09a62202f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/836198
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.0.0.0rc1

This issue was fixed in the openstack/neutron 21.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.