2/3 snat namespace transitions to master

Bug #1863110 reported by Marek Grudzinski
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

neutron version: 14.0.2
general deployment version: stein
deployment method: kolla-ansible
neutron configuration:
 - l3 = ha
 - agent_mode = dvr_snat
 - ovs
general info: multi node deployment, ca ~100 computes

when spawning larger heat stacks with multiple instances (think k8s infrastructure) sometimes (roughly 50%) we get a "split brain" on snat namespaces.

logs looks like this on one of the three controller/network nodes.

11:53:43.402 Handling notification for router 2a218a31-2ef6-406a-a719-17965600e182, state master 11:53:43.403 enqueue /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha.py:50
Router 2a218a31-2ef6-406a-a719-17965600e182 transitioned to master

and then this happens on another of the three controller/network nodes.

11:53:57.582 Handling notification for router 2a218a31-2ef6-406a-a719-17965600e182, state master enqueue /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha.py:50
11:53:57.583 Router 2a218a31-2ef6-406a-a719-17965600e182 transitioned to master

so neutron sets up all routes in both controller nodes and wrecks havoc on session that instances are creating to the outside. obviously deleting the routes from the faulty namespace solves the issue.
i can't really find the reason for it being promoted to master even when looking through the debug logs. would greatly appreciate any helpful pointers.
the only thing i can think of is some kind of race condition happening and therefor everything in neutron looks fine.

Tags: l3-ha
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

It's keepalived process which decides which node is master and which is backup. Can You check in keepalived logs - maybe there is some info about what is the reason of such problem.

tags: added: l3-ha
Revision history for this message
Brian Haley (brian-haley) wrote :

I think we've seen this with an old version of keepalived, can you verify you have a new(er) version?

Revision history for this message
Marek Grudzinski (ivve) wrote :

Hello Slawek & Brian,

Yes but logs say nothing of value and I don't understand why both are becoming master. Will paste keepalived logs below.

The version of keepalived is bundled with the kolla l3 neutron agent, the version of the neutron is 14.0.2 and keepalived below. I could try to repackage the container with a newer custom version. Do you have any recommendation on versions to use?

ii keepalived 1:1.3.9-1ubuntu0.18.04.2 amd64 Failover and monitoring daemon for LVS clusters

Revision history for this message
Marek Grudzinski (ivve) wrote :
Download full text (7.1 KiB)

2020-02-18 08:21:27.455 3129610 INFO neutron.common.config [-] Logging enabled!
2020-02-18 08:21:27.455 3129610 INFO neutron.common.config [-] /var/lib/kolla/venv/bin/neutron-keepalived-state-change version 14.0.2
2020-02-18 08:21:27.456 3129610 DEBUG neutron.common.config [-] command line: /var/lib/kolla/venv/bin/neutron-keepalived-state-change --router_id=abfd6fde-a4b5-436e-9eb4-7da3ee926279 --namespace=snat-abfd6fde-a4b5-436e-9eb4-7da3ee926279
--conf_dir=/var/lib/neutron/ha_confs/abfd6fde-a4b5-436e-9eb4-7da3ee926279 --monitor_interface=ha-5952f8d5-dd --monitor_cidr=169.254.0.92/24 --pid_file=/var/lib/neutron/external/pids/abfd6fde-a4b5-436e-9eb4-7da3ee926279.monitor.pid.neutron-keepalived-state-change-monitor --state_path=/var/lib/neutron --user=42435 --group=42435 setup_logging /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/common/config.py:103
2020-02-18 08:21:27.463 3129726 DEBUG neutron.agent.common.async_process [-] Launching async process [ip netns exec snat-abfd6fde-a4b5-436e-9eb4-7da3ee926279 ip -o monitor address]. start /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/common/async_process.py:112
2020-02-18 08:21:27.464 3129726 DEBUG neutron.agent.linux.utils [-] Running command: ['ip', 'netns', 'exec', 'snat-abfd6fde-a4b5-436e-9eb4-7da3ee926279', 'ip', '-o', 'monitor', 'address'] create_process /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
2020-02-18 08:21:27.472 3129726 DEBUG neutron.agent.linux.utils [-] Found cmdline ['ip', 'netns', 'exec', 'snat-abfd6fde-a4b5-436e-9eb4-7da3ee926279', 'ip', '-o', 'monitor', 'address'] for rocess with PID 3129727. get_cmdline_from_pid /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/linux/utils.py:339
2020-02-18 08:21:28.473 3129726 DEBUG neutron.agent.linux.utils [-] Found cmdline ['ip', '-o', 'monitor', 'address'] for rocess with PID 3129727. get_cmdline_from_pid /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/linux/utils.py:339
Process runs with uid/gid: 42435/42435
Running privsep helper: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'privsep-helper', '--privsep_context', 'neutron.privileged.default', '--privsep_sock_path', '/tmp/tmpVR89Yr/privsep.sock']
Spawned new privsep daemon via rootwrap
Accepted privsep connection to /tmp/tmpVR89Yr/privsep.sock
privsep daemon starting
privsep process running with uid/gid: 0/0
privsep process running with capabilities (eff/prm/inh): CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_NET_ADMIN|CAP_SYS_ADMIN/CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_NET_ADMIN|CAP_SYS_ADMIN/none
privsep daemon running as pid 3129957
privsep log: /var/lib/kolla/venv/local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
privsep log: """)
Initial status of router abfd6fde-a4b5-436e-9eb4-7da3ee926279 is backup
Wrote router abfd6fde-a4b5-436e-9eb4-7da3ee926279 state master
Notified agent router abfd6fde-a4b5...

Read more...

Revision history for this message
LIU Yulong (dragon889) wrote :

So I guess maybe the VRRP heartbeats were dropped between the hosts for their LVS clusters.
Could you peast the default security group rules of these cluster hosts?
Or the port security or allowed address pair settings?

Revision history for this message
Marek Grudzinski (ivve) wrote :

Hello Liu Yulong,

This issue is regarding physical nodes acting openstack controller nodes. They do not have any security group rules.
Besides this happens ca ~50% of the times when a large stack is created with multiple instances using snat rather than flips. Any form of firewall issue would result in consistent errors.
It seems that keepalived does not always respect the option nopreemt and releases master during setup of the 3 snat namespaces, even if it transitions to master first.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.