Docker L3 and VPN agent containers don't kill keealived with a SIGTERM

Bug #1690252 reported by Gaëtan Trellu
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Undecided
Unassigned

Bug Description

Hi guys,

When Docker L3 or VPN agent containers are killed, dumb-init handle perfectly the first process (/usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_start) but for the other processes it just doesn't care.

For example, neutron-vpn-agent will created few processes that will become "non-dependent" of the parent like keepalived process.

keepalived process has to be killed with SIGTERM to clean the HA VIP, floating, etc... in the kernel namespace.

Currently it's not the case. When we stop neutron_vpnaas_agent container, all processes in the containers are killed, so the two other controllers elect a new master but the problem is that the namespace on the controller where the container has been stopped was active and the it results to an active/active qrouter.

The only way to avoid this issue is to run a "pkill keepalived" in the container is directly stop the neutron_vpnaas_agent container, not the best way to do it but it works.

We are using Docker 17.03.1-ce.

Tags: kolla
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

For now the "workaround" we found is to add a task to neutron/do_reconfigure.yml to make sure keepalived is kill properly before restarting the container.

- name: Pkill keepalived in neutron_vpnaas_agent/neutron_l3_agent
  command:
    docker exec -u root neutron_vpnaas_agent pkill keepalived
  when: inventory_hostname in groups['neutron-vpnaas-agent'] or
        inventory_hostname in groups['neutron-l3-agent']

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :
Changed in kolla:
status: New → Fix Committed
no longer affects: kolla-ansible
Mark Goddard (mgoddard)
Changed in kolla:
status: Fix Committed → Fix Released
Revision history for this message
Piotr Misiak (pmisiak) wrote :

Hi,

I've tested Stein release with suggested fix (https://review.openstack.org/620238) but unfortunately the issue is there.

All keepalived processes are killed with SIGKILL signal and they leave active routers still configured (with IP addresses configured). This leads to active/active routers which is very painful on production.

When you are restarting L3 agent there is a quite long time window (when nscleanup script is running before the L3 agent process starting) when we still have IP addresses configured resulting in active/active routers. This depends on how many routers were spawned on a particular network node. In extreme cases ns clean-up process takes hours.

I think this solution have to be redesigned.

affects: kolla → kolla-ansible
affects: kolla-ansible → kolla
Revision history for this message
Piotr Misiak (pmisiak) wrote :

I'd like to add that Gaëtan Trellu's workaround is also working for me. When you manually kill all keepalived processes with SIGTERM then they are removing IP addresses before they quit.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.