kolla

Docker L3 and VPN agent containers don't kill keealived with a SIGTERM

Bug #1690252 reported by Gaëtan Trellu on 2017-05-11

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	kolla	Fix Released	Undecided	Unassigned

Bug Description

Hi guys,

When Docker L3 or VPN agent containers are killed, dumb-init handle perfectly the first process (/usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_start) but for the other processes it just doesn't care.

For example, neutron-vpn-agent will created few processes that will become "non-dependent" of the parent like keepalived process.

keepalived process has to be killed with SIGTERM to clean the HA VIP, floating, etc... in the kernel namespace.

Currently it's not the case. When we stop neutron_vpnaas_agent container, all processes in the containers are killed, so the two other controllers elect a new master but the problem is that the namespace on the controller where the container has been stopped was active and the it results to an active/active qrouter.

The only way to avoid this issue is to run a "pkill keepalived" in the container is directly stop the neutron_vpnaas_agent container, not the best way to do it but it works.

We are using Docker 17.03.1-ce.

Tags:

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2017-06-07:

For now the "workaround" we found is to add a task to neutron/do_reconfigure.yml to make sure keepalived is kill properly before restarting the container.

- name: Pkill keepalived in neutron_vpnaas_agent/neutron_l3_agent
  command:
    docker exec -u root neutron_vpnaas_agent pkill keepalived
  when: inventory_hostname in groups['neutron-vpnaas-agent'] or
        inventory_hostname in groups['neutron-l3-agent']

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2018-12-13:

Should be fixed with: https://review.openstack.org/620238

Changed in kolla:
status:	New → Fix Committed
no longer affects:	kolla-ansible

Mark Goddard (mgoddard) on 2019-11-06

Changed in kolla:
status:	Fix Committed → Fix Released

Revision history for this message

Piotr Misiak (pmisiak) wrote on 2020-08-11:

Hi,

I've tested Stein release with suggested fix (https://review.openstack.org/620238) but unfortunately the issue is there.

All keepalived processes are killed with SIGKILL signal and they leave active routers still configured (with IP addresses configured). This leads to active/active routers which is very painful on production.

When you are restarting L3 agent there is a quite long time window (when nscleanup script is running before the L3 agent process starting) when we still have IP addresses configured resulting in active/active routers. This depends on how many routers were spawned on a particular network node. In extreme cases ns clean-up process takes hours.

I think this solution have to be redesigned.

affects:	kolla → kolla-ansible
affects:	kolla-ansible → kolla

Revision history for this message

Piotr Misiak (pmisiak) wrote on 2020-08-11:

I'd like to add that Gaëtan Trellu's workaround is also working for me. When you manually kill all keepalived processes with SIGTERM then they are removing IP addresses before they quit.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.