neutron

fullstack infrastructure tears down processes via kill -9

Bug #1487548 reported by Assaf Muller on 2015-08-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Won't Fix	Wishlist	Unassigned

Bug Description

I can't imagine this has good implications. Distros typically kill neutron processes via kill -15, so this should definitely be doable here as well.

Tags:

Revision history for this message

Assaf Muller (amuller) wrote on 2015-08-21:

Initial research: I changed the signal to 15, and sometimes a process can hang. I've seen this happen once with the L3 agent and once with the OVS agent. It looks the test teardown is not serial: That is, a process's cleanUp is called, kill -15 is executed, and when we call wait() on the process it goes on to kill the next one. Then what happens is that a process doesn't manage to exit in time and we kill the rabbit server, at that point it looks like the process is stuck in a loop trying to reconnect and never exits.

tags:

added: fullstack

Revision history for this message

Assaf Muller (amuller) wrote on 2015-08-21:

To clarify my previous comment: We don't kill the rabbit server, but when a test tears down it kills that test's rabbit vhost, user and password, so that the stuck process is looping with: http://paste.openstack.org/show/423556/, then this error: http://paste.openstack.org/show/423558/, then finally loops endlessly with this error (Credentials error, which makes perfect sense): http://paste.openstack.org/show/423561/. Just not sure why oslo messaging tries to endlessly reconnect and not respect the SIGKILL signal.

Revision history for this message

Assaf Muller (amuller) wrote on 2015-08-21:

Sorry, SIGTERM.

Revision history for this message

Assaf Muller (amuller) wrote on 2015-08-21:

I think that my initial analysis what incorrect. It's not eventlet yielding on wait, killing rabbit, causing the agent that refused to die to be stuck in a re-connection loop. Rather, what's happening is that wait() is simply timing out because for some reason both the OVS and L3 agents sometimes refuse to die. Fixtures then carries out the cleanUp of all of the other fixtures, including rabbit's, and then logging at the agent's log you can see the reconnect loop. The root cause, however, is that sometimes both the OVS/L3 agents just refuse to die when given a signal 15. More investigation needed...

Revision history for this message

Assaf Muller (amuller) wrote on 2015-08-21:

I've found something very interesting in an instance the OVS agent refused to die honorably:

2015-08-21 17:17:16.985 1110 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6a1e7391-2f47-44b2-8299-571b31a12df8 - - - - -] Agent rpc_loop - iteration:3 completed. Processed ports statistics: {'regular': {'removed': 1, 'updated': 0, 'added': 0}}. Elapsed:0.553 loop_count_and_wait /opt/openstack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1496
2015-08-21 17:21:07.204 1110 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed # This is logged since rabbit was killed because the test code gave up on this agent and went on with further clean ups
2015-08-21 17:21:07.205 1110 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6a1e7391-2f47-44b2-8299-571b31a12df8 - - - - -] Agent caught SIGTERM, quitting daemon loop.

SIGTERM was sent to the agent during its 3rd loop.
Loop 3 finished at 17:17:16.985, with elapsed time being 0.553. It was then supposed to sleep for (2 seconds - 0.553), but instead it looks like it slept for around 3 minutes and 50 seconds, which is when the 'Agent caught SIGTERM' is logged, and the agent proceeds to quit immediately.

Revision history for this message

IWAMOTO Toshihiro (iwamoto) wrote on 2015-09-14:

I made a patch against bug/1494363, which is probably the fix for this bug.

https://review.openstack.org/222556

Revision history for this message

IWAMOTO Toshihiro (iwamoto) wrote on 2015-09-14:

I've sent the above comment without reading the discussion much, sorry.

The patch first tries SIGTERM and then sends SIGKILL, so it won't make the situation worse but I'm not sure it is the fix for this problem.

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2015-09-15:

I don't think that patch above is solution. Why would AsyncProcess do such behavior. Imho that is out of scope of that class capabilities and which signal is sent to terminate process should be from outside. That patch seems more like a workaround for the fact we have race when quitting services. I'd rather solve the races inside of agents and have an option which way we want to kill the process.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-17: Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/224736

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-18: Related fix merged to neutron (master)

#10

Reviewed: https://review.openstack.org/224736
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ecbc2e3ed36964ed8944b3a128cde6850e250dd5
Submitter: Jenkins
Branch: master

commit ecbc2e3ed36964ed8944b3a128cde6850e250dd5
Author: Jakub Libosvar <email address hidden>
Date: Thu Sep 17 13:26:05 2015 +0000

Introduce kill_signal parameter to AsynProcess.stop()

    All stop() calls of instances of AsyncProcess class were sending
    hardcoded SIGKILL signal to its process. This patch leaves the default
    behavior to SIGKILL but offers any number to be sent to kill command.

    Note: Internal private methods also got a new parameter which is not
          appended. Given that those methods are private and thus not used
          outside of the class, we can afford it.

Change-Id: Ib7b0273c134d59c6a50173d4c2eb35761fcd3d62
Related-Bug: #1487548

Armando Migliaccio (armando-migliaccio) on 2015-10-05

Changed in neutron:
importance:	Undecided → Low

Revision history for this message

Assaf Muller (amuller) wrote on 2016-02-16:

#11

Related patch: https://review.openstack.org/#/c/257204

Revision history for this message

Assaf Muller (amuller) wrote on 2016-02-16:

#12

Related patch: https://review.openstack.org/#/c/234770/

Revision history for this message

Assaf Muller (amuller) wrote on 2016-02-16:

#13

Related patch: https://review.openstack.org/#/c/278501/

Revision history for this message

John Schwarz (jschwarz) wrote on 2016-02-17:

#14

The patches Assaf posted are all linked to this issue - they solved in one way or another issues with terminating processes gracefully using SIGTERM. Following https://review.openstack.org/#/c/278501/, we should probably give this bug report another go and see what breaks next.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-09: Related fix proposed to neutron (master)

#15

Related fix proposed to branch: master
Review: https://review.openstack.org/290277

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-18: Related fix merged to neutron (master)

#16

Reviewed: https://review.openstack.org/290277
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ce7e26d41fce660367c741aff09368578c37b8b1
Submitter: Jenkins
Branch: master

commit ce7e26d41fce660367c741aff09368578c37b8b1
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Oct 6 16:06:49 2015 +0900

fullstack: use SIGTERM when stopping ovs agents

    ovs agents should be killed with SIGTERM, otherwise orphaned
    ovsdb-clients remain.
    Related-bug: 1487548

Change-Id: Ibf840f2a50ff4078b6828cdc25e0ac61f98e1fd3

Revision history for this message

IWAMOTO Toshihiro (iwamoto) wrote on 2016-07-06:

#17

neutron-server is now terminated with SIGTERM with Jakub's commit a5a7b89 .
Can we close this?

Revision history for this message

Assaf Muller (amuller) wrote on 2016-07-06:

#18

The L3 and LB agents still use signal 9.

Hareesh (hareesh54) on 2016-07-14

Changed in neutron:
assignee:	nobody → Hareesh (hareesh54)

Hareesh (hareesh54) on 2016-07-19

Changed in neutron:
assignee:	Hareesh (hareesh54) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-31: Related fix proposed to neutron (master)

#19

Related fix proposed to branch: master
Review: https://review.openstack.org/499803

Ihar Hrachyshka (ihar-hrachyshka) on 2017-08-31

Changed in neutron:
status:	New → Confirmed

OpenStack Infra (hudson-openstack) on 2017-09-11

Changed in neutron:
assignee:	nobody → Ihar Hrachyshka (ihar-hrachyshka)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-05: Change abandoned on neutron (master)

#20

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/499803

Ihar Hrachyshka (ihar-hrachyshka) on 2018-03-20

Changed in neutron:
importance:	Low → Wishlist

Ihar Hrachyshka (ihar-hrachyshka) on 2018-04-05

Changed in neutron:
assignee:	Ihar Hrachyshka (ihar-hrachyshka) → nobody

Revision history for this message

Lajos Katona (lajos-katona) wrote on 2022-11-08:

#21

No activity for ~6 years on this bug report, so I close it now

Changed in neutron:
status:	In Progress → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.