fullstack infrastructure tears down processes via kill -9

Bug #1487548 reported by Assaf Muller
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Wishlist
Unassigned

Bug Description

I can't imagine this has good implications. Distros typically kill neutron processes via kill -15, so this should definitely be doable here as well.

Tags: fullstack
Revision history for this message
Assaf Muller (amuller) wrote :

Initial research: I changed the signal to 15, and sometimes a process can hang. I've seen this happen once with the L3 agent and once with the OVS agent. It looks the test teardown is not serial: That is, a process's cleanUp is called, kill -15 is executed, and when we call wait() on the process it goes on to kill the next one. Then what happens is that a process doesn't manage to exit in time and we kill the rabbit server, at that point it looks like the process is stuck in a loop trying to reconnect and never exits.

tags: added: fullstack
Revision history for this message
Assaf Muller (amuller) wrote :

To clarify my previous comment: We don't kill the rabbit server, but when a test tears down it kills that test's rabbit vhost, user and password, so that the stuck process is looping with: http://paste.openstack.org/show/423556/, then this error: http://paste.openstack.org/show/423558/, then finally loops endlessly with this error (Credentials error, which makes perfect sense): http://paste.openstack.org/show/423561/. Just not sure why oslo messaging tries to endlessly reconnect and not respect the SIGKILL signal.

Revision history for this message
Assaf Muller (amuller) wrote :

Sorry, SIGTERM.

Revision history for this message
Assaf Muller (amuller) wrote :

I think that my initial analysis what incorrect. It's not eventlet yielding on wait, killing rabbit, causing the agent that refused to die to be stuck in a re-connection loop. Rather, what's happening is that wait() is simply timing out because for some reason both the OVS and L3 agents sometimes refuse to die. Fixtures then carries out the cleanUp of all of the other fixtures, including rabbit's, and then logging at the agent's log you can see the reconnect loop. The root cause, however, is that sometimes both the OVS/L3 agents just refuse to die when given a signal 15. More investigation needed...

Revision history for this message
Assaf Muller (amuller) wrote :

I've found something very interesting in an instance the OVS agent refused to die honorably:

2015-08-21 17:17:16.985 1110 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6a1e7391-2f47-44b2-8299-571b31a12df8 - - - - -] Agent rpc_loop - iteration:3 completed. Processed ports statistics: {'regular': {'removed': 1, 'updated': 0, 'added': 0}}. Elapsed:0.553 loop_count_and_wait /opt/openstack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1496
2015-08-21 17:21:07.204 1110 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed # This is logged since rabbit was killed because the test code gave up on this agent and went on with further clean ups
2015-08-21 17:21:07.205 1110 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6a1e7391-2f47-44b2-8299-571b31a12df8 - - - - -] Agent caught SIGTERM, quitting daemon loop.

SIGTERM was sent to the agent during its 3rd loop.
Loop 3 finished at 17:17:16.985, with elapsed time being 0.553. It was then supposed to sleep for (2 seconds - 0.553), but instead it looks like it slept for around 3 minutes and 50 seconds, which is when the 'Agent caught SIGTERM' is logged, and the agent proceeds to quit immediately.

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

I made a patch against bug/1494363, which is probably the fix for this bug.

https://review.openstack.org/222556

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

I've sent the above comment without reading the discussion much, sorry.

The patch first tries SIGTERM and then sends SIGKILL, so it won't make the situation worse but I'm not sure it is the fix for this problem.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I don't think that patch above is solution. Why would AsyncProcess do such behavior. Imho that is out of scope of that class capabilities and which signal is sent to terminate process should be from outside. That patch seems more like a workaround for the fact we have race when quitting services. I'd rather solve the races inside of agents and have an option which way we want to kill the process.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/224736

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/224736
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ecbc2e3ed36964ed8944b3a128cde6850e250dd5
Submitter: Jenkins
Branch: master

commit ecbc2e3ed36964ed8944b3a128cde6850e250dd5
Author: Jakub Libosvar <email address hidden>
Date: Thu Sep 17 13:26:05 2015 +0000

    Introduce kill_signal parameter to AsynProcess.stop()

    All stop() calls of instances of AsyncProcess class were sending
    hardcoded SIGKILL signal to its process. This patch leaves the default
    behavior to SIGKILL but offers any number to be sent to kill command.

    Note: Internal private methods also got a new parameter which is not
          appended. Given that those methods are private and thus not used
          outside of the class, we can afford it.

    Change-Id: Ib7b0273c134d59c6a50173d4c2eb35761fcd3d62
    Related-Bug: #1487548

Changed in neutron:
importance: Undecided → Low
Revision history for this message
Assaf Muller (amuller) wrote :
Revision history for this message
Assaf Muller (amuller) wrote :
Revision history for this message
Assaf Muller (amuller) wrote :
Revision history for this message
John Schwarz (jschwarz) wrote :

The patches Assaf posted are all linked to this issue - they solved in one way or another issues with terminating processes gracefully using SIGTERM. Following https://review.openstack.org/#/c/278501/, we should probably give this bug report another go and see what breaks next.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/290277

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/290277
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ce7e26d41fce660367c741aff09368578c37b8b1
Submitter: Jenkins
Branch: master

commit ce7e26d41fce660367c741aff09368578c37b8b1
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Oct 6 16:06:49 2015 +0900

    fullstack: use SIGTERM when stopping ovs agents

    ovs agents should be killed with SIGTERM, otherwise orphaned
    ovsdb-clients remain.
    Related-bug: 1487548

    Change-Id: Ibf840f2a50ff4078b6828cdc25e0ac61f98e1fd3

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

neutron-server is now terminated with SIGTERM with Jakub's commit a5a7b89 .
Can we close this?

Revision history for this message
Assaf Muller (amuller) wrote :

The L3 and LB agents still use signal 9.

Hareesh (hareesh54)
Changed in neutron:
assignee: nobody → Hareesh (hareesh54)
Hareesh (hareesh54)
Changed in neutron:
assignee: Hareesh (hareesh54) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/499803

Changed in neutron:
status: New → Confirmed
Changed in neutron:
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/499803

Changed in neutron:
importance: Low → Wishlist
Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → nobody
Revision history for this message
Lajos Katona (lajos-katona) wrote :

No activity for ~6 years on this bug report, so I close it now

Changed in neutron:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.