neutron

HA router master instance in error state because qg-xx interface is down

Bug #1916024 reported by Rodolfo Alonso on 2021-02-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Rodolfo Alonso

Bug Description

BZ reference: https://bugzilla.redhat.com/show_bug.cgi?id=1929829

Sometimes a router is created with all the instances in standby mode because the qg-xx interface is in down state and there isn't connectivity:

Steps to reproduce:
1. for i in $(seq 10); do ./create.sh $i; done
3. Check FIP connectivity to detect the error
4. for i in $(seq 10); do ./delete.sh $i; done

Scripts: http://paste.openstack.org/show/802777/

Seems to be a race condition between L3 agent and keepalived configuring qg-xxx interface:
- /var/log/messages: http://paste.openstack.org/show/802778/
- L3 agent logs: http://paste.openstack.org/show/802779/

When keepalive is setting the qg-xxx interface IP addresses, the interface disappears from udev and reappears again (I still don't know why yet). The log in journalctl looks the same as when a new interface is created.

Since [1], the L3 agent controls the GW interface status (up or down). If the L3 agent do not link up the interface, the router namespace won't be able to send/receive any traffic.

[1]https://review.opendev.org/q/I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05

Tags:

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2021-02-18

Changed in neutron:
assignee:	nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2021-02-18

tags:

added: l3-ha

Oleg Bondarev (obondarev) on 2021-02-18

Changed in neutron:
importance:	Undecided → High

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2021-02-18:

Related patch: https://review.opendev.org/c/openstack/neutron/+/776427

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2021-02-18:

If accepted, this patch should be backported up to Queens. See https://review.opendev.org/q/I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05.

tags:

added: queens-backport-potential

Oleg Bondarev (obondarev) on 2021-02-19

Changed in neutron:
status:	New → In Progress

Bernard Cafarelli (bcafarel) on 2021-06-11

tags:

added: neutron-proactive-backport-potential

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2021-06-11

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Christian Rohmann (christian-rohmann) wrote on 2021-06-22:

Rodolfo may I point you to https://bugs.launchpad.net/neutron/+bug/1927868 or in particular my comment at https://bugs.launchpad.net/neutron/+bug/1927868/comments/26.

This issue is NOT fixed unfortunately. Don't get me wrong I am quite happy that error logging was added here. But every time this error is logged a state is reached which will NOT be fixed without manual intervention (i.e. triggering a keepalived switchover).

So a real fix would be to either have neutron remove itself from being master for a vRouter in case this issue is hit, or a least do some sort of reconciliation to attempt to reach a working state again. Otherwise one ends up with the keepalived master in non working state and two other nodes in backup state not taking over.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2021-06-22:

Hello Christian:

Then you are hitting another bug. What the patch is solving is that situation when the interface disappears and reappears again, while keepalived is configuring it.

If you still see this error message this is because:
- keepalived didn't finish configuring this interface. If the interface doesn't appear in this time (3 seconds), maybe you have another problem.
- the interface was deleted

What you need to identify is what is happening in your system and why, when the interface is going to be set UP, is not found in the kernel.

Regards.

Revision history for this message

Christian Rohmann (christian-rohmann) wrote on 2021-06-23:

Rodolfo I am sorry, but I believe you misunderstood my comment.

I was not implying your change was *causing* the issue, but rather that it even more clearly shows that the vRouter can reach a state (on the master) in which Neutron will fail to reach a usable state (i.e. because the qg-interface is down) and then there only is an error logged about it (thanks to your recent addition).

I was simply saying that allowing the state of a vRouter to reach such a deadlock situation in which a node is master (keepalived master that is) but then it's not functioning and ONLY logs an error is not really a fix for this condition.

Certainly the first thing would be to find the root cause or race condition that seldomly causes this. As I wrote before, we have been experiencing this occasionally on our Train installation and that was and is prior to your changes anyways.

My comment then was just about the reported issue " HA router master instance in error state because qg-xx interface is down", and that is NOT (always) fixed by your change.

TL;DR:

* you did not cause it
* you did not make it worse
* but, your change did not fix it for all cases

Slawek Kaplonski (slaweq) on 2021-07-02

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-22: Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/neutron/+/801857

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-28: Change abandoned on neutron (stable/stein)

Change abandoned by "Edward Hope-Morley <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/neutron/+/801857

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2022-03-08:

Hi, can we should set the bug status from Fix Released to Confirmed since the fix was reverted so the problem still exists.

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2022-03-14:

This issue should be solved by https://bugs.launchpad.net/neutron/+bug/1952907

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2022-03-17:

#10

Add a bit more context on the current state of the issue described in this bug; the fix for this bug was subsequently reverted due to a regression that it introduced (see bug 1927868) and therefore this issue can once again present itself. The reason why qg- ports are marked as DOWN is because of the patch landed as part of bug 1859832 and as I understand it there is now consensus from upstream [1] to revert that patch as well since it goes against the grain of what L3HA is doing and a better solution is needed to fix that particular issue. I have not found a bug open yet for the revert so I plan to open one as I have had a go at the revert for myself and have seen good results.

[1] https://meetings.opendev.org/meetings/neutron_drivers/2022/neutron_drivers.2022-03-04-14.03.log.txt

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.