HA router master instance in error state because qg-xx interface is down

Bug #1916024 reported by Rodolfo Alonso
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Rodolfo Alonso

Bug Description

BZ reference: https://bugzilla.redhat.com/show_bug.cgi?id=1929829

Sometimes a router is created with all the instances in standby mode because the qg-xx interface is in down state and there isn't connectivity:

(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router1
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+---------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+---------------------------+----------------+-------+----------+
| 3b93ec23-48fa-4847-bbb2-f8903e9865f9 | networker-1.redhat.local | True | :-) | standby |
| 41b8d1a8-4695-445a-916a-d12db523eb91 | controller-0.redhat.local | True | :-) | standby |
| 4533bd88-d2d1-4320-9e39-6fcb2a5cc236 | networker-0.redhat.local | True | :-) | standby |
+--------------------------------------+---------------------------+----------------+-------+----------+
(overcloud) [stack@undercloud-0 ~]$

Steps to reproduce:
1. for i in $(seq 10); do ./create.sh $i; done
3. Check FIP connectivity to detect the error
4. for i in $(seq 10); do ./delete.sh $i; done

Scripts: http://paste.openstack.org/show/802777/

Seems to be a race condition between L3 agent and keepalived configuring qg-xxx interface:
- /var/log/messages: http://paste.openstack.org/show/802778/
- L3 agent logs: http://paste.openstack.org/show/802779/

When keepalive is setting the qg-xxx interface IP addresses, the interface disappears from udev and reappears again (I still don't know why yet). The log in journalctl looks the same as when a new interface is created.

Since [1], the L3 agent controls the GW interface status (up or down). If the L3 agent do not link up the interface, the router namespace won't be able to send/receive any traffic.

[1]https://review.opendev.org/q/I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
tags: added: l3-ha
Changed in neutron:
importance: Undecided → High
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

If accepted, this patch should be backported up to Queens. See https://review.opendev.org/q/I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05.

tags: added: queens-backport-potential
Changed in neutron:
status: New → In Progress
tags: added: neutron-proactive-backport-potential
Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Rodolfo may I point you to https://bugs.launchpad.net/neutron/+bug/1927868 or in particular my comment at https://bugs.launchpad.net/neutron/+bug/1927868/comments/26.

This issue is NOT fixed unfortunately. Don't get me wrong I am quite happy that error logging was added here. But every time this error is logged a state is reached which will NOT be fixed without manual intervention (i.e. triggering a keepalived switchover).

So a real fix would be to either have neutron remove itself from being master for a vRouter in case this issue is hit, or a least do some sort of reconciliation to attempt to reach a working state again. Otherwise one ends up with the keepalived master in non working state and two other nodes in backup state not taking over.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Christian:

Then you are hitting another bug. What the patch is solving is that situation when the interface disappears and reappears again, while keepalived is configuring it.

If you still see this error message this is because:
- keepalived didn't finish configuring this interface. If the interface doesn't appear in this time (3 seconds), maybe you have another problem.
- the interface was deleted

What you need to identify is what is happening in your system and why, when the interface is going to be set UP, is not found in the kernel.

Regards.

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Rodolfo I am sorry, but I believe you misunderstood my comment.

I was not implying your change was *causing* the issue, but rather that it even more clearly shows that the vRouter can reach a state (on the master) in which Neutron will fail to reach a usable state (i.e. because the qg-interface is down) and then there only is an error logged about it (thanks to your recent addition).

I was simply saying that allowing the state of a vRouter to reach such a deadlock situation in which a node is master (keepalived master that is) but then it's not functioning and ONLY logs an error is not really a fix for this condition.

Certainly the first thing would be to find the root cause or race condition that seldomly causes this. As I wrote before, we have been experiencing this occasionally on our Train installation and that was and is prior to your changes anyways.

My comment then was just about the reported issue " HA router master instance in error state because qg-xx interface is down", and that is NOT (always) fixed by your change.

TL;DR:

* you did not cause it
* you did not make it worse
* but, your change did not fix it for all cases

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/neutron/+/801857

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/stein)

Change abandoned by "Edward Hope-Morley <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/neutron/+/801857

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi, can we should set the bug status from Fix Released to Confirmed since the fix was reverted so the problem still exists.

Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Add a bit more context on the current state of the issue described in this bug; the fix for this bug was subsequently reverted due to a regression that it introduced (see bug 1927868) and therefore this issue can once again present itself. The reason why qg- ports are marked as DOWN is because of the patch landed as part of bug 1859832 and as I understand it there is now consensus from upstream [1] to revert that patch as well since it goes against the grain of what L3HA is doing and a better solution is needed to fix that particular issue. I have not found a bug open yet for the revert so I plan to open one as I have had a go at the revert for myself and have seen good results.

[1] https://meetings.opendev.org/meetings/neutron_drivers/2022/neutron_drivers.2022-03-04-14.03.log.txt

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.