L3 agent failed to respawn keepalived process

Bug #1511311 reported by Lan Qi song
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Hong Hui Xiao
Kilo
Fix Committed
Undecided
Unassigned

Bug Description

I enabled the l3 ha in neutron configuration, and I usually see the following log in l3_agent.log:

2015-10-14 22:30:16.397 21460 ERROR neutron.agent.linux.external_process [-] default-service for router with uuid 59de181e-8f02-470d-80f6-cb9f0d46f78b not found. The process should not have died
2015-10-14 22:30:16.397 21460 ERROR neutron.agent.linux.external_process [-] respawning keepalived for uuid 59de181e-8f02-470d-80f6-cb9f0d46f78b
2015-10-14 22:30:16.397 21460 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b.pid get_value_from_file /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:222
2015-10-14 22:30:16.398 21460 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-59de181e-8f02-470d-80f6-cb9f0d46f78b', 'keepalived', '-P', '-f', '/var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b/keepalived.conf', '-p', '/var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b.pid', '-r', '/var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b.pid-vrrp'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:84

And I noticed that the counts of vrrp pid files were usually bigger than the "pid" files:

root@neutron2:~# ls /var/lib/neutron/ha_confs/ | grep pid | grep -v vrrp | wc -l
664
root@neutron2:~# ls /var/lib/neutron/ha_confs/ | grep vrrp | wc -l
677

And seems that if "pid.vrrp" file existed, we can't successfully respawn the keepalived process using this kind of command:
keepalived -P -f /var/lib/neutron/ha_confs/cb01b1de-fa6c-461e-ba39-4d506dfdfccb/keepalived.conf -p /var/lib/neutron/ha_confs/cb01b1de-fa6c-461e-ba39-4d506dfdfccb.pid -r /var/lib/neutron/ha_confs/cb01b1de-fa6c-461e-ba39-4d506dfdfccb.pid-vrrp

So I think in neutron, after we checked that the pid is not active, can we check the existence of "pid" file and "vrrp pid" file and remove them before respawn the keepalived process to make sure the process can be started successfully ?

https://github.com/openstack/neutron/blob/master/neutron/agent/linux/external_process.py#L91-L92

Hong Hui Xiao (xiaohhui)
Changed in neutron:
assignee: nobody → Hong Hui Xiao (xiaohhui)
Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I can't reproduce this bug by rm .pid file and keep .pid-vrrp file. When I restart neutron-l3-agent, I can have keepalived process re-spawned.
Bug look into the keepalived code[1-3], it may be because the vrrp process is alive, while keepalived process is dead. So, neutron code can't detect the keepalived process, meanwhile, neutron can't re-spawn the keepalived process too.
[1]
https://github.com/acassen/keepalived/blob/03da0d2d0393808bbb2feac7abc07aaf8d647855/keepalived/core/main.c#L236
[2]
https://github.com/acassen/keepalived/blob/03da0d2d0393808bbb2feac7abc07aaf8d647855/keepalived/core/main.c#L291
[3] https://github.com/acassen/keepalived/blob/03da0d2d0393808bbb2feac7abc07aaf8d647855/keepalived/core/pidfile.c#L92

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

The bug can be reproduced this way:
1) create HA router
2) kill -9 to the main keepalived process in one l3-agent
3) the vrrp process will be orphan
4) l3-agent can't detect keepalived running, but can't respawn it as comment #1 said.

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :
tags: added: l3-ha
Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I failed to run functional test locally

tox -e dsvm-functional neutron.tests.functional.agent.linux.test_keepalived

The failed log is:
neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase.test_keepalived_respawns
-------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/functional/agent/linux/test_keepalived.py", line 73, in test_keepalived_respawns
        exception=RuntimeError(_("Keepalived didn't respawn")))
      File "neutron/agent/linux/utils.py", line 339, in wait_until_true
        eventlet.sleep(sleep)
      File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 34, in sleep
        hub.switch()
      File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
        return self.greenlet.switch()
    RuntimeError: Keepalived didn't respawn

Captured pythonlogging:
~~~~~~~~~~~~~~~~~~~~~~~
    2015-11-03 20:58:13,321 WARNING [oslo_config.cfg] Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
    2015-11-03 20:58:14,357 ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-03 20:58:14,358 ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-03 20:58:15,357 ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-03 20:58:15,358 ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-03 20:58:16,358 ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-03 20:58:16,358 ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-03 20:58:17,358 ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-03 20:58:17,358 ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-03 20:58:18,358 ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-03 20:58:18,359 ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1

So, things are as expected.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/241517

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/251693

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/251693
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=34822ba31a62ee6f0f5b532a2c435f3c9b684605
Submitter: Jenkins
Branch: master

commit 34822ba31a62ee6f0f5b532a2c435f3c9b684605
Author: Arie Bregman <email address hidden>
Date: Tue Dec 1 09:47:55 2015 +0200

    Skip keepalived_respawns test

    keepalived fails to respawn after crash for > 1.2.11 version.

    When keepalived starts, it spawns vrrp thread to monitor vrrp forked
    process. It also creates a vrrp pid file. When the process is killed, and
    it's restarted, the the new keepalived process runs with -P, so
    when we validate whether we are already running, we check vrrp pid file.
    Since we never clean up the file before starting the process, and the process
    dies without a chance to clean up the file as part of its signal
    handler, respawn never works.

    keepalived_respawns test should be skipped until bug is resolved.
    See also: https://bugzilla.redhat.com/show_bug.cgi?id=1286729

    Change-Id: Ic111573e0cd5ad5bfe70b0f38ec0203c10d52e34
    Related-Bug: #1511311

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/258117

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/258117
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a0747e6bc3c3d19ecc9284e60834f11f7d2b5768
Submitter: Jenkins
Branch: stable/liberty

commit a0747e6bc3c3d19ecc9284e60834f11f7d2b5768
Author: Arie Bregman <email address hidden>
Date: Tue Dec 1 09:47:55 2015 +0200

    Skip keepalived_respawns test

    keepalived fails to respawn after crash for > 1.2.11 version.

    When keepalived starts, it spawns vrrp thread to monitor vrrp forked
    process. It also creates a vrrp pid file. When the process is killed, and
    it's restarted, the the new keepalived process runs with -P, so
    when we validate whether we are already running, we check vrrp pid file.
    Since we never clean up the file before starting the process, and the process
    dies without a chance to clean up the file as part of its signal
    handler, respawn never works.

    keepalived_respawns test should be skipped until bug is resolved.
    See also: https://bugzilla.redhat.com/show_bug.cgi?id=1286729

    Change-Id: Ic111573e0cd5ad5bfe70b0f38ec0203c10d52e34
    Related-Bug: #1511311
    (cherry picked from commit 34822ba31a62ee6f0f5b532a2c435f3c9b684605)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/259061

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/259070

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/259075

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/241517
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2d1b53bcfa6c4d6fa5bca2ba4da9aaca66245a5b
Submitter: Jenkins
Branch: master

commit 2d1b53bcfa6c4d6fa5bca2ba4da9aaca66245a5b
Author: Hong Hui Xiao <email address hidden>
Date: Wed Nov 4 01:44:43 2015 -0500

    Kill the vrrp orphan process when (re)spawn keepalived

    When keepalived crashed unexpectedly, the vrrp process that
    it associates with will be orphan process. This will make
    the VIP unable to migrate to the router in the same host.
    Also, neutron code is not able to respawn the keepalived
    process, because keepalived thinks itself is still running,
    according to [1-3]. As a result, neutron will report respawning
    keepalived all the time. Restart l3-agent will not help.

    This patch will check and delete the orphan vrrp process
    if there is any, in the processmonitor of l3 agent.

    More details can be found in the bug description and comments.

    [1] https://goo.gl/W3GL9I
    [2] https://goo.gl/F0Ixfb
    [3] https://goo.gl/dUqhTo

    Change-Id: Ia1759ed1365b845d404686a8cd25f882cce35caf
    Closes-Bug: #1511311

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/259070
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4344bce278005a3ebd72c612ec3b859344e0f181
Submitter: Jenkins
Branch: master

commit 4344bce278005a3ebd72c612ec3b859344e0f181
Author: Hong Hui Xiao <email address hidden>
Date: Thu Dec 17 10:05:50 2015 -0500

    Clean up code for bug1511311

    Due to the code change for the fix, the callback is no longer needed.

    Change-Id: Id603add6bdf98d848fb4afe4dd117552992f9ed1
    Related-Bug: #1511311

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/259061
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3a70a5b236514711d7ff924f657e41fa7d70d30a
Submitter: Jenkins
Branch: stable/liberty

commit 3a70a5b236514711d7ff924f657e41fa7d70d30a
Author: Hong Hui Xiao <email address hidden>
Date: Wed Nov 4 01:44:43 2015 -0500

    Kill the vrrp orphan process when (re)spawn keepalived

    When keepalived crashed unexpectedly, the vrrp process that
    it associates with will be orphan process. This will make
    the VIP unable to migrate to the router in the same host.
    Also, neutron code is not able to respawn the keepalived
    process, because keepalived thinks itself is still running,
    according to [1-3]. As a result, neutron will report respawning
    keepalived all the time. Restart l3-agent will not help.

    This patch will check and delete the orphan vrrp process
    if there is any, in the processmonitor of l3 agent.

    More details can be found in the bug description and comments.

    [1] https://goo.gl/W3GL9I
    [2] https://goo.gl/F0Ixfb
    [3] https://goo.gl/dUqhTo

    Change-Id: Ia1759ed1365b845d404686a8cd25f882cce35caf
    Closes-Bug: #1511311
    (cherry picked from commit 2d1b53bcfa6c4d6fa5bca2ba4da9aaca66245a5b)

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b2

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.2

This issue was fixed in the openstack/neutron 7.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/259075
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=49ac0c477cf74f225003b3f2734a156fe8bc445f
Submitter: Jenkins
Branch: stable/kilo

commit 49ac0c477cf74f225003b3f2734a156fe8bc445f
Author: Hong Hui Xiao <email address hidden>
Date: Wed Nov 4 01:44:43 2015 -0500

    Kill the vrrp orphan process when (re)spawn keepalived

    When keepalived crashed unexpectedly, the vrrp process that
    it associates with will be orphan process. This will make
    the VIP unable to migrate to the router in the same host.
    Also, neutron code is not able to respawn the keepalived
    process, because keepalived thinks itself is still running,
    according to [1-3]. As a result, neutron will report respawning
    keepalived all the time. Restart l3-agent will not help.

    This patch will check and delete the orphan vrrp process
    if there is any, in the processmonitor of l3 agent.

    More details can be found in the bug description and comments.

    [1] https://goo.gl/W3GL9I
    [2] https://goo.gl/F0Ixfb
    [3] https://goo.gl/dUqhTo

    Change-Id: Ia1759ed1365b845d404686a8cd25f882cce35caf
    Closes-Bug: #1511311

tags: added: in-stable-kilo
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 2015.1.4

This issue was fixed in the openstack/neutron 2015.1.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

This issue was fixed in the openstack/neutron 2015.1.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.