Don't set HA ports down while L3 agent restart.

Bug #1959151 reported by Krzysztof Tomaszewski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Krzysztof Tomaszewski

Bug Description

Because of the fix for bug #1597461[1] L3 agent puts all it's
HA ports down during initialization phase. Unfortunately such
operation can break already working L3 communication when
you restart agent service (rewiring port from down state to
up can takes few seconds and some VRRP packages could be lost
so router HA state change may be triggered).

This is an effect of calling:
self.plugin_rpc.update_all_ha_network_port_statuses
in neutron/agent/l3/agent.py#L393 during L3 agent
initialization phase in _check_ha_router_process_status.

Restarting agent process should not affect already working
configuration (customer traffic).

Possibly workaround would be to put HA ports to DOWN state
only on host restart and not on every L3 agent restart.

[1] https://bugs.launchpad.net/neutron/+bug/1597461

Changed in neutron:
assignee: nobody → Krzysztof Tomaszewski (labedz)
Changed in neutron:
status: New → In Progress
Revision history for this message
Krzysztof Tomaszewski (labedz) wrote :
Changed in neutron:
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Krzysztof Tomaszewski (labedz) wrote (last edit ):

Seems that problem is slightly different than I thought.

Issue is that condition in l3/agent.py L#388

388 if (not (vrrp_pcount / 2 >= self.ha_router_count and
389 vrrp_st_pcount >= self.ha_router_count)):

should check if there are keepalived and neutron-keepalived-state-change
processes (not) running as a precondition to do HA-* port reset.

Unfortunately this condition is disturbed by wrong values returned by
tool linux_utils.get_process_count_by_name() in L#379 (at least on Ubuntu).

Variable vrrp_st_pcount will be wrong because implementation
of linux_utils.get_process_count_by_name():

196 def get_process_count_by_name(name):
197 """Find the process count by name."""
198 return len([p for p in psutil.process_iter(['name']) if
199 p.info['name'] == name])

is getting process name but not considering that for
neutron-keepalived-state-change it is the Python interpreter:

stack@devstack1:~$ python3
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import psutil
>>> for p in psutil.process_iter(['name']):
... if p.pid == 1443651: print(p.info['name'])
...
/usr/bin/python
>>>
stack@devstack1:~$
stack@devstack1:~$ cat /proc/1443651/cmdline
/usr/bin/python3.8 /usr/local/bin/neutron-keepalived-state-change --router_id=1a3 <CUT>
stack@devstack1:~$ cat /proc/1443651/comm
/usr/bin/python
stack@devstack1:~$

That's makes condition in l3/agent.py L#388 valid (almost) every time
which leads agent to reset port HA-* states on (almost) every restart.

Suggestion could be to change that condition to be based on
net namespace existence or 'fix' get_process_count_by_name() util.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/826545
Committed: https://opendev.org/openstack/neutron/commit/f430cd00725f8303f5313cb7784c9aed4b982e62
Submitter: "Zuul (22348)"
Branch: master

commit f430cd00725f8303f5313cb7784c9aed4b982e62
Author: labedz <email address hidden>
Date: Thu Jan 27 00:13:40 2022 +0100

    Don't set HA ports down while L3 agent restart.

    Because of the fix for bug[1] and issue with linux_utils
    get_process_count_by_name() L3 agent puts all it's HA ports down
    during initialization phase. Unfortunately such operation can break
    already working L3 communication. Rewiring ha-* port from down state to
    up can takes few seconds and some VRRP packages could be lost then.
    That triggers keepalived on other node so router HA state change
    may be triggered.

    This change prevents putting HA ports down when during initialization
    phase L3 agent finds already configured own net namespaces. Existance
    of such net namespace is a good proof that there is a network
    configuration existing so host wasn't rebooted so most probably it is
    just agent restart.

    [1] https://bugs.launchpad.net/neutron/+bug/1597461

    Closes-Bug: #1959151
    Change-Id: Id9c906b2d141c3bedd80fb5f868190f8a4b66f54

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.