packet loss during active L3 HA agent restart

Bug #1846198 reported by enax on 2019-10-01
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned
openstack-ansible
Undecided
James Denton

Bug Description

Deployment:

Openstack-ansible 19.0.3(Stein) with two dedicated network nodes(is_metal=True) + linuxbridge + vxlan.
Ubuntu 16.04.6 4.15.0-62-generic

neutron l3-agent-list-hosting-router R1
+--------------------------------------+---------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+---------------+----------------+-------+----------+
| 1b3b1b5d-08e7-48a1-ab8d-256d94099fb6 | test-network2 | True | :-) | standby |
| fa402ada-7716-4ad4-a004-7f8114fb1edf | test-network1 | True | :-) | active |
+--------------------------------------+---------------+----------------+-------+----------+

How to reproduce: Restart the active l3 agent. (systemctl restart neutron-l3-agent.service)

test-network1 server side events:

systemctl restart neutron-l3-agent: @02:58:56.135635630
ip monitor terminated (kill -9) @02:58:56.208922038
vip ips removed @02:58:56.268074480
keepalived terminated @02:58:57.318596743
l3-agent terminated @02:59:07.504366398
keepalived-state-change terminated @03:01:07.735281710

test-network1 journal:
  @02:58:56 test-network1 systemd[1]: Stopping neutron-l3-agent service...
  @02:58:56 test-network1 Keepalived_vrrp[24400]: VRRP_Instance(VR_217) sent 0 priority
  @02:58:56 test-network1 Keepalived_vrrp[24400]: VRRP_Instance(VR_217) removing protocol Virtual Routes
  @02:58:56 test-network1 Keepalived_vrrp[24400]: VRRP_Instance(VR_217) removing protocol VIPs.
  @02:58:56 test-network1 Keepalived_vrrp[24400]: VRRP_Instance(VR_217) removing protocol E-VIPs.
  @02:58:56 test-network1 Keepalived[24394]: Stopping
  @02:58:56 test-network1 neutron-keepalived-state-change[24278]: 2019-10-01 02:58:56.193 24278 DEBUG neutron.agent.linux.utils [-] enax_custom_log: pid: 24283, signal: 9 kill_process /openstack/venvs/neutron-19.0.4.dev1/lib/python2.7/site-packages/neutron/agent/linux/utils.py:243
  @02:58:56 test-network1 audit[24089]: USER_END pid=24089 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:session_close acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'
  @02:58:56 test-network1 sudo[24089]: pam_unix(sudo:session): session closed for user root
  @02:58:56 test-network1 audit[24089]: CRED_DISP pid=24089 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'
  @02:58:57 test-network1 Keepalived_vrrp[24400]: Stopped
  @02:58:57 test-network1 Keepalived[24394]: Stopped Keepalived v1.3.9 (10/21,2017)

TCPDUMP qrouter-24010932-a0a4-4454-9539-27c1535c5ed8 ha-57528491-1b:
  @02:58:53.130735 IP 169.254.195.168 > 224.0.0.18: VRRPv2, Advertisement, vrid 217, prio 50, authtype simple, intvl 2s, length 20
  @02:58:55.131926 IP 169.254.195.168 > 224.0.0.18: VRRPv2, Advertisement, vrid 217, prio 50, authtype simple, intvl 2s, length 20
  @02:58:56.188558 IP 169.254.195.168 > 224.0.0.18: VRRPv2, Advertisement, vrid 217, prio 0, authtype simple, intvl 2s, length 20
  @02:58:56.215889 IP 169.254.195.168 > 224.0.0.22: igmp v3 report, 1 group record(s)
  @02:58:56.539804 IP 169.254.195.168 > 224.0.0.22: igmp v3 report, 1 group record(s)
  @02:58:56.995386 IP 169.254.194.242 > 224.0.0.18: VRRPv2, Advertisement, vrid 217, prio 50, authtype simple, intvl 2s, length 20
  @02:58:58.998565 ARP, Request who-has 169.254.0.217 (ff:ff:ff:ff:ff:ff) tell 169.254.0.217, length 28
  @02:58:59.000138 ARP, Request who-has 169.254.0.217 (ff:ff:ff:ff:ff:ff) tell 169.254.0.217, length 28
  @02:58:59.001063 ARP, Request who-has 169.254.0.217 (ff:ff:ff:ff:ff:ff) tell 169.254.0.217, length 28
  @02:58:59.002173 ARP, Request who-has 169.254.0.217 (ff:ff:ff:ff:ff:ff) tell 169.254.0.217, length 28
  @02:58:59.003018 ARP, Request who-has 169.254.0.217 (ff:ff:ff:ff:ff:ff) tell 169.254.0.217, length 28
  @02:58:59.003860 IP 169.254.194.242 > 224.0.0.18: VRRPv2, Advertisement, vrid 217, prio 50, authtype simple, intvl 2s, length 20
  @02:59:01.004772 IP 169.254.194.242 > 224.0.0.18: VRRPv2, Advertisement, vrid 217, prio 50, authtype simple, intvl 2s, length 20

After l3-agent restart

neutron l3-agent-list-hosting-router R1
+--------------------------------------+---------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+---------------+----------------+-------+----------+
| 1b3b1b5d-08e7-48a1-ab8d-256d94099fb6 | test-network2 | True | :-) | active |
| fa402ada-7716-4ad4-a004-7f8114fb1edf | test-network1 | True | :-) | standby |
+--------------------------------------+---------------+----------------+-------+----------+

Logs and configs in the attachment.

enax (enax1) wrote :
description: updated
enax (enax1) on 2019-10-01
description: updated
Brian Haley (brian-haley) wrote :

How much packet loss are you noticing? I don't think there is any guarantee there will be zero loss on an L3-HA failover, it should just be short enough such that connections don't drop.

tags: added: l3-ha
Changed in neutron:
status: New → Incomplete
enax (enax1) wrote :

Shouldn't be there any L3-HA failover from regular l3-agent restart.
The packet loss depends from the amount of vms, in production with 200+ virtual machines, around 160+ (2-3 minutes outage, but its normal, need to arping every ip address)

James Denton (james-denton) wrote :

What version of keepalived are you using? Previous testing for us showed versions < ~1.2.16 resulted in longer-than-expected failover times. Keepalived v1.2.19, specifically, showed positive results.

enax (enax1) wrote :

keepalived:
  Installed: 1:1.3.9-1ubuntu0.18.04.2~cloud1
  Candidate: 1:1.3.9-1ubuntu0.18.04.2~cloud1
  Version table:
 *** 1:1.3.9-1ubuntu0.18.04.2~cloud1 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/queens/main amd64 Packages

enax (enax1) wrote :

I tried even with the newest one already, compiled from source (2.0.18), no change same behavior.

Brian Haley (brian-haley) wrote :

In this case the active router changed because keepalived was killed, I didn't see that in the l3-agent log, did I miss it?

enax (enax1) wrote :

It's in the journal log, nowhere else.

Brian Haley (brian-haley) wrote :

Can you check the sysctl script to see if it's killing keepalived? I don't have an Ubuntu install here.

enax (enax1) wrote :

Systemd can't manage this, every namespace have it's own keepalived instance.
L3-agent or neutron-keepalived-state-change should manage it, probably l3-agent.
I checked what you requested, just in case :) - no running keepalived process in systemctl.

Slawek Kaplonski (slaweq) wrote :

Hi,

I just tested on devstack with linuxbridge agent. When I restarting l3 agent service, keepalived was still running, also ha interface in router's namespace was fine all the time.
What version of Neutron are You using? Can You also check if in Your case e.g. ha port in router's namespace isn't removed and created again during L3 agent restart?

enax (enax1) wrote :

I'm not familiar with this devstack, is it support L3 HA? - just because without this HA functionality works for me also.
Neutron version is 14.0.3.dev44, HEAD detached at 5424bdc from opendev.org/openstack/neutron.
HA port is fine, just the vip goes down immediately, and coming up on the other node in 2-3 seconds.

Brian Haley (brian-haley) wrote :

There are a few changes in stable/stein after the one you mentioned related to L3-HA, specifically adac5d9b7a72 sticks out. If you can upgrade to at least that and retry it would be great.

Devstack is a tool used to quickly install openstack for testing, https://docs.openstack.org/devstack/latest/

Since you're using a distro release (there is no 14.0.3 tag in the neutron repo), you might need to ask them for an update.

Slawek Kaplonski (slaweq) wrote :

Yes, patch pointed by Brian my definitely fix this issue.
I tried it on single node devstack but with router created as HA. So it spawned keepalived process and created HA interface like in "regular" HA case. The only difference was that there was no second host with keepalived in same "cluster" so failover couldn't happend. I did it like that because I wanted to check if there is no any flap of HA interface or keepalived process during agent restart.

enax (enax1) wrote :

Okay, newest version up and running.
neutron_git_repo: https://opendev.org/openstack/neutron
neutron_git_install_branch: add87accc5105124241042a5f0933b76e93648f4
neutron_git_project_group: neutron_all
neutron_git_track_branch: stable/stein

Still the same issue, except there is some neutron-keepalived-state-change stack trace in the journal log now.

Slawek Kaplonski could you provide me your neutron's related configs from devstack, maybe the problem is more simple.

enax (enax1) wrote :

I solved it finally.
Systemd default killmode for services is control-group, this is why l3-agent restarts wiped the ip monitor and keepalived processes.
With killmode=process everything works fine.
This bug belongs to openstack-ansible.

Bernard Cafarelli (bcafarel) wrote :

Thanks for the confirmation, marking invalid for neutron then and adding OSA

Changed in neutron:
status: Incomplete → Invalid
affects: neutron → openstack-ansible
Changed in openstack-ansible:
status: Invalid → New
Changed in neutron:
status: New → Invalid
Changed in openstack-ansible:
assignee: nobody → James Denton (james-denton)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers