tripleo

Sending SIGHUP to neutron-server process causes it to hang

Bug #1780139 reported by Jan Gutter on 2018-07-04

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Bernard Cafarelli
	tripleo	Won't Fix	Critical	Emilien Macchi	tripleo stein-1

Bug Description

* High level description

When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist:
wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)

This is problematic, because logrotate uses SIGHUP in the containerized environment. It doesn't happen always: it might take one or two signals reasonably interspersed, to trigger this.

* Pre-conditions:

I'm using CentOS 7 + Queens RDO
"rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"

I first noticed the issue when the neutron_api docker would stop working on the control nodes, eventually it was traced back to the logrotate_crond container sending SIGHUP to all the processes owning log files in /var/log/containers. This doesn't happen every time, but it's pretty easy to trigger on my system.

* Step-by-step reproduction steps:

# Start with a clean container

docker restart neutron_api

# Identify the neutron-server PID: (613782 in this case) and send SIGHUP

kill -HUP 613782

# the relevant log files generally look clean the first time:

2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with status 0
2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child list
2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with status 0
2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with status 0
2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with status 0
2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with status 0
2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with status 0
2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with status 0
2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with status 0
2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with status 0
2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child list
2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://10.0.105.101:9696

# But on the second SIGHUP, the following happened:

2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after thread killed. Cleaning up.
2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True
2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children to exit
2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by signal 15
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with status 0
2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with status 0
2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with status 0

* Version

I'm running Centos7 RDO TripleO (commit tag c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 controllers in a HA setup. tripleo templates and deployment config
available on request.

This corresponds to OpenStack Queens.

I have not fully investigated all other services, but they seem to be fine with SIGHUP. In particular nova-api-metadata seems to be OK and doing what's expected.

* Perceived severity

This blocks our private production stack from running: neutron api breaking has a big knock-on effect, unfortunately.

Tags:

zhaobo (zhaobo6) on 2018-07-10

Changed in neutron:
status:	New → Triaged

Revision history for this message

Jan Gutter (jangutter) wrote on 2018-07-10:

Adding a small update for context in the tripleo scenario.

Originally the logrotate config looked like this:

/var/log/neutron/*.log {
    rotate 14
    size 10M
    missingok
    compress
    copytruncate
}

In the tripleo containers deployment, logrotate looks like this:

https://github.com/openstack/puppet-tripleo/blob/master/templates/logrotate/containers_logrotate.conf.erb

Brent Eagles (beagles) on 2018-07-12

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → rocky-3

Revision history for this message

Brent Eagles (beagles) wrote on 2018-07-12:

This can be demonstrated by simply sending a SIGHUP to the container or the root neutron-server process. Apparently fixing this is non-trivial in neutron. I've added tripleo to the bug as the more likely near-term solution would to be alter the logrotate behavior for neutron.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-07-13:

related bug with a long beard https://bugs.launchpad.net/oslo-incubator/+bug/1276694

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-13: Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/582661

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Revision history for this message

yao ning (mslovy11022) wrote on 2018-07-28:

Is that possible to fix it by moving neutron-server under apache httpd using wsgi? can it fix the issues?

Revision history for this message

Juan Antonio Osorio Robles (juan-osorio-robles) wrote on 2018-08-06:

Yao ning, yeah, seems to me like this is something where using httpd would help.

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2018-08-07:

Testing on devstack master (stacked around July 31), neutron-server does not react well to SIGHUP, with logs looking similar, like:
WARNING oslo_service.service [-] pid 17122 not in child list

After SIGHUP, commands like "openstack network list" hang

Note that issuing a restart with "systemctl restart devstack@q-svc", the oslso_service warnings do not appear on shutdown

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2018-08-08:

On a fresh restack (CentOS 7.5), I managed to reproduce the wait4/ECHILD loop issue on devstack.

So this is not container-specific. On Ubuntu 16, reloading the neutron-server service (aka SIGHUP) seems to work fine

Changed in neutron:
assignee:	nobody → Bernard Cafarelli (bcafarel)

Bogdan Dobrelya (bogdando) on 2018-08-24

Changed in tripleo:
status:	Triaged → In Progress
assignee:	nobody → Bogdan Dobrelya (bogdando)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-24: Related fix proposed to tripleo-heat-templates (master)

#10

Related fix proposed to branch: master
Review: https://review.openstack.org/596274

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-24: Fix proposed to puppet-tripleo (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/596275

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-24: Related fix proposed to tripleo-common (master)

#12

Related fix proposed to branch: master
Review: https://review.openstack.org/596277

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-24: Change abandoned on puppet-tripleo (master)

#13

Change abandoned by Brent Eagles (<email address hidden>) on branch: master
Review: https://review.openstack.org/582661
Reason: in favor of https://review.openstack.org/#/c/596275/

Alex Schultz (alex-schultz) on 2018-08-24

Changed in tripleo:
milestone:	rocky-rc1 → rocky-rc2

Alex Schultz (alex-schultz) on 2018-08-24

Changed in tripleo:
milestone:	rocky-rc2 → stein-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-11: Fix proposed to puppet-tripleo (stable/queens)

#14

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/601544

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-11: Related fix proposed to tripleo-heat-templates (stable/queens)

#15

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/601555

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-11: Related fix proposed to tripleo-common (stable/queens)

#16

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/601558

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-18: Related fix merged to tripleo-common (master)

#17

Reviewed: https://review.openstack.org/596277
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=305445bac6ccf2cb491f451412035f08aa2928ba
Submitter: Zuul
Branch: master

commit 305445bac6ccf2cb491f451412035f08aa2928ba
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200

Add container runtime packages for cron image

Logrotate-crond needs docker binary (from package) and docker
socket to restart some non-SIGHUP friendly containers.

Related-Bug: #1715374
Related-Bug: #1780139

Change-Id: I73e7bb6e5ba4a79a3158a1878bf8b26a539e3e2f
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-20: Related fix proposed to tripleo-common (stable/rocky)

#18

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/604006

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2018-09-20:

#19

Relevant issue in tripleo periodic jobs: https://bugs.launchpad.net/tripleo/+bug/1793482

Revision history for this message

Boden R (boden) wrote on 2018-09-20:

#20

Appears to have been duplicated in neutron with https://bugs.launchpad.net/neutron/+bug/1793482

Quique Llorente (quiquell) on 2018-09-21

tags:

added: promotion-blocker

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-23: Related fix merged to tripleo-common (stable/rocky)

#21

Reviewed: https://review.openstack.org/604006
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=fc31a1c649cca06d0340a15a55339757ff002775
Submitter: Zuul
Branch: stable/rocky

commit fc31a1c649cca06d0340a15a55339757ff002775
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200

Add container runtime packages for cron image

Logrotate-crond needs docker binary (from package) and docker
socket to restart some non-SIGHUP friendly containers.

Related-Bug: #1715374
Related-Bug: #1780139

    Change-Id: I73e7bb6e5ba4a79a3158a1878bf8b26a539e3e2f
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 305445bac6ccf2cb491f451412035f08aa2928ba)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Related fix proposed to tripleo-common (stable/pike)

#22

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/604708

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-25: Change abandoned on tripleo-heat-templates (master)

#23

Change abandoned by Juan Antonio Osorio Robles (<email address hidden>) on branch: master
Review: https://review.openstack.org/596274
Reason: Purging the gate to free up resources and address the timeout issues

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-26: Related fix merged to tripleo-heat-templates (master)

#24

Reviewed: https://review.openstack.org/596274
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2b1afc0483f849046b404099944653015fc11a04
Submitter: Zuul
Branch: master

commit 2b1afc0483f849046b404099944653015fc11a04
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:12:41 2018 +0200

Allow a containerized logrotate to access docker

This is required for a containerized logrotate to restart neutron
API container as it cannot process signals.

Related-Bug: #1780139

Change-Id: I27663ffc0adb34a61f2f97fef913e9e36592a585
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-26: Related fix proposed to tripleo-heat-templates (stable/pike)

#25

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/605348

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-26: Related fix proposed to tripleo-heat-templates (stable/rocky)

#26

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/605349

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-28: Related fix merged to tripleo-heat-templates (stable/rocky)

#27

Reviewed: https://review.openstack.org/605349
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=78e0205b936ed069a6e1866165c17d5a3d4ee1be
Submitter: Zuul
Branch: stable/rocky

commit 78e0205b936ed069a6e1866165c17d5a3d4ee1be
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:12:41 2018 +0200

Allow a containerized logrotate to access docker

This is required for a containerized logrotate to restart neutron
API container as it cannot process signals.

Related-Bug: #1780139

    Change-Id: I27663ffc0adb34a61f2f97fef913e9e36592a585
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b1afc0483f849046b404099944653015fc11a04)

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2018-09-28:

#28

+Bogdan, shall we close this LP for the tripleo part?

OpenStack Infra (hudson-openstack) on 2018-09-29

Changed in tripleo:
assignee:	Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-30: Fix merged to puppet-tripleo (master)

#29

Reviewed: https://review.openstack.org/596275
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Submitter: Zuul
Branch: master

commit d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:15:51 2018 +0200

Fix postrotate to notify holders of rotated logs

Lsof +L1 locates unlinked and open files and does not work for
logrotate, neither with copyteuncate not w/o that option.

    Instead, find *.X (X - number) files held and notify the processes
    owning those to make an apropriate actions and reopen new log files to
    stop writing to the rotated files.

The actions to be taken by such processes are:

    * For httpd processes, use USR1 to gracefully reload
    * For neutron-server, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1780139).
    * For nova-compute, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1715374).
    * For other processes, use HUP to reload

This also fixes the filter to match logfiles ending with *err,
like rabbitmq startup errors log.

    Closes-Bug: #1780139
    Closes-Bug: #1785659
    Closes-Bug: #1715374

Change-Id: I5110426aa26e5fce7ebb4d80d8a2082cbf80519c
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-30: Fix proposed to puppet-tripleo (stable/rocky)

#30

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/606864

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-01: Fix proposed to puppet-tripleo (stable/pike)

#31

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/606908

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2018-10-01:

#32

Download full text (6.1 KiB)

Recapping some other investigations (downstream https://bugzilla.redhat.com/show_bug.cgi?id=1631335 has some details too), in general neutron-server does not handle SIGHUP properly.
While it may still reply to API requests after SIGHUP, service is degraded (or not available). Two typical situations after some time (to allow childs to exit):
* No child process, main process in a 100% CPU loop "wait4(0, 0x7ffd52e322a4, WNOHANG, NULL) = -1 ECHILD (No child processes)"
* wsgi workerds are killed but not started (rpc workers restart properly)

Some traces captured on different setups:
Ubuntu 16.04, python 3.5 (missing wsgi workers)

bcafarel@defiant-ubuntu:~$ sudo systemctl restart devstack@q-svc
bcafarel@defiant-ubuntu:~$ ps fauxw|grep [n]eutron-server
bcafarel 27704 74.2 0.7 288252 122248 ? Ss 10:40 0:02 /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27712 0.0 0.6 288392 111364 ? S 10:40 0:00 \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27713 0.0 0.6 288392 111364 ? S 10:40 0:00 \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27714 14.0 0.6 291700 114824 ? S 10:40 0:00 \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27715 98.0 0.7 302396 126144 ? R 10:40 0:00 \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

bcafarel@defiant-ubuntu:~$ sudo systemctl reload devstack@q-svc
bcafarel@defiant-ubuntu:~$ journalctl -afu devstack@q-svc
[...]
août 09 10:39:28 defiant-ubuntu neutron-server[27603]: INFO oslo_service.service [-] Child 27616 exited with status 0
août 09 10:39:28 defiant-ubuntu neutron-server[27603]: WARNING oslo_service.service [-] pid 27616 not in child list

# After workers exit
bcafarel@defiant-ubuntu:~$ ps fauxw|grep [n]eutron-server
bcafarel 27704 6.3 0.7 290424 124280 ? Ss 10:40 0:03 /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27738 0.0 0.6 290424 113424 ? S 10:40 0:00 \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

# New api requests like "openstack network list" are answered by 27738, no workers around

Centos 7.5, python 2.7 (example of 100% CPU loop)

Some traces captured on different setups:
Ubuntu 16.04, python 3.5 (missing wsgi workers)

bcafarel@defiant-ubuntu:~$ sudo systemctl restart devstack@q-svc
bcafarel@defiant-ubuntu:~$ ps fauxw|grep [n]eutron-server
bcafarel 27704 74.2  0.7 288252 122248 ?       Ss   10:40   0:02 /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27712  0.0  0.6 288392 111364 ?       S    10:40   0:00  \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27713  0.0  0.6 288392 111364 ?       S    10:40   0:00  \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27714 14.0  0.6 291700 114824 ?       S    10:40   0:00  \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27715 98.0  0.7 302396 126144 ?       R    10:40   0:00  \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

# After workers exit
bcafarel@defiant-ubuntu:~$ ps fauxw|grep [n]eutron-server
bcafarel 27704  6.3  0.7 290424 124280 ?       Ss   10:40   0:03 /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
bcafarel 27738  0.0  0.6 290424 113424 ?       S    10:40   0:00  \_ /usr/bin/python3.5 /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

# New api requests like "openstack network list" are answered by 27738, no workers around

Centos 7.5, python 2.7 (example of 100% CPU loop)

# Can get in that state sometime after reload:
[stack@defiant-devstack ~]$ ps fuaxw|grep [n]eutron-ser                                                                                                                        
stack    20575 55.7  0.7 468188 118948 ?       Rs   août08 792:44 /usr/bin/python /usr/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
[stack@defiant-devstack ~]$ strace -p 20575 2>&1|head
strace: Process 20575 attached
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(0, 0x7ffc6082ee84, WNOHANG, NULL) = -1 ECHILD (No child processes)

Ubuntu 16.04, python 2.7 missing workers example again)
ubuntu@devstack-ubuntu-ovs  ~   sudo systemctl restart devstack@q-svc

ubuntu@devstack-ubuntu-ovs  ~  ✘ 1  ps fauxw|grep neutron-server 
ubuntu   28696  0.0  0.0  12944   972 pts/0    S+   10:55   0:00              \_ grep --color=auto neutron-server 
ubuntu   28614 14.2  1.5 286564 126216 ?       Ss   10:55   0:02 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini 
ubuntu   28637  0.0  1.4 286308 115264 ?       S    10:55   0:00  \_ /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini 
ubuntu   28638  0.0  1.4 286308 115264 ?       S    10:55   0:00  \_ /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini 
ubuntu   28639  1.0  1.4 289676 119380 ?       S    10:55   0:00  \_ /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini 
ubuntu   28640  1.9  1.4 286916 115928 ?       S    10:55   0:00  \_ /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

ubuntu@devstack-ubuntu-ovs  ~  sudo systemctl reload devstack@q-svc
ubuntu@devstack-ubuntu-ovs  ~  journalctl -afu devstack@q-svc
...
Aug 09 10:56:02 devstack-ubuntu-ovs neutron-server[28614]: INFO oslo_service.service [-] Child 28639 exited with status 0
Aug 09 10:56:02 devstack-ubuntu-ovs neutron-server[28614]: WARNING oslo_service.service [-] pid 28639 not in child list

ubuntu@devstack-ubuntu-ovs  ~  ✘ 1  ps fauxw|grep neutron-server
ubuntu   28823  0.0  0.0  12944  1088 pts/0    S+   10:56   0:00              \_ grep --color=auto neutron-server
ubuntu   28614  3.8  1.5 288160 128012 ?       Ss   10:55   0:02 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
ubuntu   28740  0.0  1.4 288160 117332 ?       S    10:55   0:00  \_ /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on tripleo-common (stable/pike)

#33

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/604708
Reason: not needed any more as we fallback to copytruncate

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on tripleo-common (stable/queens)

#34

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601558
Reason: not needed any more as we fallback to copytruncate

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on tripleo-heat-templates (stable/pike)

#35

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/605348
Reason: not needed any more as we fallback to copytruncate

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on tripleo-heat-templates (stable/queens)

#36

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601555
Reason: not needed any more as we fallback to copytruncate

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on puppet-tripleo (stable/queens)

#37

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601544
Reason: implementation falls back to copytruncate making this not needed any more

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on puppet-tripleo (stable/pike)

#38

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/606908
Reason: implementation falls back to copytruncate making this not needed any more

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Change abandoned on puppet-tripleo (stable/rocky)

#39

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/606864
Reason: implementation falls back to copytruncate making this not needed any more

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-05:

#40

Addressed by https://review.openstack.org/#/q/Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-08: Fix included in openstack/puppet-tripleo 10.0.0

#41

This issue was fixed in the openstack/puppet-tripleo 10.0.0 release.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-31:

#43

We can't fix that for tripleo as it requires neutron server fixes for SIGHUP processing, so marked it won't fix. Logrotate configs had been switched from sending SIGHUP to using copytruncate, so tripleo is not affected with this anymore.

Changed in tripleo:
status:	Fix Released → Won't Fix

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2020-01-23:

#44

On the Neutron side, I plan to work again in this, as I found that once killed, a RPC worker is not respawned (current devstack master)

Killing one of the RPC processes:
Jan 23 12:40:29 bionic neutron-server[28154]: INFO oslo_service.service [-] Child 28170 killed by signal 9
Jan 23 12:40:29 bionic neutron-server[28154]: WARNING oslo_service.service [-] pid 28170 not in child list

No new process spawned
After that kill, stopping devstack@q-svc process, we see the wsgi processes quitting properly:
Jan 23 12:41:37 bionic neutron-server[28154]: INFO oslo_service.service [-] Waiting on 2 children to exit
Jan 23 12:41:37 bionic neutron-server[28154]: INFO neutron.wsgi [-] (28169) wsgi exited, is_accepting=True
Jan 23 12:41:37 bionic neutron-server[28154]: INFO neutron.wsgi [-] (28168) wsgi exited, is_accepting=True
Jan 23 12:41:37 defiant-bionic neutron-server[28154]: INFO oslo_service.service [-] Child 28169 exited with status 0
Jan 23 12:41:37 defiant-bionic neutron-server[28154]: INFO oslo_service.service [-] Child 28168 exited with status 0
Jan 23 12:41:37 bionic neutron-server[28154]: INFO oslo_service.service [-] Wait called after thread killed. Cleaning up.
Jan 23 12:41:37 bionic neutron-server[28154]: DEBUG neutron.service [-] calling RpcWorker stop() {{(pid=28154) stop /opt/stack/neutron/neutron/service.py:138}}
Jan 23 12:41:37 bionic neutron-server[28154]: DEBUG oslo_service.service [-] Killing children. {{(pid=28154) stop /usr/local/lib/python3.6/dist-packages/oslo_service/service.py:704}}
Jan 23 12:41:37 defiant-bionic neutron-server[28154]: INFO oslo_service.service [-] Waiting on 2 children to exit

The "surviving" other RPC process quits properly, though we have alarm status because of the previously killed one:
Jan 23 12:41:43 bionic neutron-server[28154]: INFO oslo_service.service [-] Child 28171 exited with status 0
Jan 23 12:42:37 bionic systemd[1]: <email address hidden>: Main process exited, code=killed, status=14/ALRM
Jan 23 12:42:37 bionic systemd[1]: <email address hidden>: Failed with result 'signal'.

On the Neutron side, I plan to work again in this, as I found that once killed, a RPC worker is not respawned (current devstack master)

The "surviving" other RPC process quits properly, though we have alarm status because of the previously killed one:
Jan 23 12:41:43 bionic neutron-server[28154]: INFO oslo_service.service [-] Child 28171 exited with status 0
Jan 23 12:42:37 bionic systemd[1]: devstack@q-svc.service: Main process exited, code=killed, status=14/ALRM
Jan 23 12:42:37 bionic systemd[1]: devstack@q-svc.service: Failed with result 'signal'.

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2020-01-31:

#45

Confirming, the ProcessLauncher children array in olso.service only shows the WSGI PIDs

Testing in neutron.conf to disable API workers with "api_workers = 0", the children array this time has the RPC worker PIDs, and killing one works properly

So code in neutron service handling RPC workers looks OK (by itself), wsgi processes code seems to come over and override

Bernard Cafarelli (bcafarel) on 2020-02-05

Changed in neutron:
importance:	Undecided → High
status:	Triaged → In Progress

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2020-02-05:

#46

Finally found the root cause here!

First, this is most probably not an issue if using WSGI for neutron API

For "standard" setups, parts of neutron-server startup are:
server/wsgi_eventlet.py: start wsgi/api, then start all workers
For wsgi, eventlet_wsgi_server() will eventually result in wsgi.Server.start() call. For multiple workers it will call oslo.service ProcessLauncher (stored in self._server)

For other workers, service.start_all_workers() which in _start_workers() will create a worker launcher (if we have process_workers), which is another call to oslo.service ProcessLauncher

oslo.service https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers strongly recommends not to use multiple instances of ProcessLauncher in the same process (probably because at creation it does a few global actions, especially on signal handlers).

So what happens is RPC processes signals end up on the wsgi ProcessLauncher, which as shown in previous comment, does not know them. It also shows RPC processes are handled correctly with api_workers=0 (where there is only one ProcessLauncher).

Planned fix will be to reuse the neutron API ProcessLauncher instance if it is set

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-07: Fix proposed to neutron (master)

#47

Fix proposed to branch: master
Review: https://review.opendev.org/706514

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-19: Fix merged to neutron (master)

#48

Reviewed: https://review.opendev.org/706514
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=13aa00026f768642f385259ba7ccddcfdad3c75f
Submitter: Zuul
Branch: master

commit 13aa00026f768642f385259ba7ccddcfdad3c75f
Author: Bernard Cafarelli <email address hidden>
Date: Fri Feb 7 14:21:09 2020 +0100

Re-use existing ProcessLauncher from wsgi in RPC workers

If both are run under the same process, and api_workers >= 2, the server
process will instantiate two oslo_service.ProcessLauncher instances

    This should be avoided [0], and indeed causes issues on subprocess and
    signal handling: killed RPC workers not respawning, SIGHUP on master
    process leading to unresponsive server, signal not properly sent to all
    child processes, ...

To avoid this, use the wsgi ProcessLauncher instance if it exists

[0] https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers

Change-Id: Ic821f8ca84add9c8137ef712031afb43e491591c
Closes-Bug: #1780139

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-19: Fix proposed to neutron (stable/train)

#49

Fix proposed to branch: stable/train
Review: https://review.opendev.org/708570

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-19: Fix proposed to neutron (stable/stein)

#50

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/708571

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-19: Fix proposed to neutron (stable/rocky)

#51

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/708572

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-19: Fix proposed to neutron (stable/queens)

#52

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/708576

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-21: Fix included in openstack/neutron 16.0.0.0b1

#53

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-24: Fix merged to neutron (stable/stein)

#54

Reviewed: https://review.opendev.org/708571
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4916bbc5f3c3727a7f0ad822ec6c140b54e51271
Submitter: Zuul
Branch: stable/stein

commit 4916bbc5f3c3727a7f0ad822ec6c140b54e51271
Author: Bernard Cafarelli <email address hidden>
Date: Fri Feb 7 14:21:09 2020 +0100

Re-use existing ProcessLauncher from wsgi in RPC workers

If both are run under the same process, and api_workers >= 2, the server
process will instantiate two oslo_service.ProcessLauncher instances

To avoid this, use the wsgi ProcessLauncher instance if it exists

[0] https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers

    Change-Id: Ic821f8ca84add9c8137ef712031afb43e491591c
    Closes-Bug: #1780139
    (cherry picked from commit 13aa00026f768642f385259ba7ccddcfdad3c75f)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-24: Fix merged to neutron (stable/rocky)

#55

Reviewed: https://review.opendev.org/708572
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dc68b69fc04283746518692886f386f034afc8f0
Submitter: Zuul
Branch: stable/rocky

commit dc68b69fc04283746518692886f386f034afc8f0
Author: Bernard Cafarelli <email address hidden>
Date: Fri Feb 7 14:21:09 2020 +0100

Re-use existing ProcessLauncher from wsgi in RPC workers

If both are run under the same process, and api_workers >= 2, the server
process will instantiate two oslo_service.ProcessLauncher instances

To avoid this, use the wsgi ProcessLauncher instance if it exists

[0] https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers

    Change-Id: Ic821f8ca84add9c8137ef712031afb43e491591c
    Closes-Bug: #1780139
    (cherry picked from commit 13aa00026f768642f385259ba7ccddcfdad3c75f)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-24: Fix merged to neutron (stable/queens)

#56

Reviewed: https://review.opendev.org/708576
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=66d45b8e91bf1263c82e636bd21498bf6de552bc
Submitter: Zuul
Branch: stable/queens

commit 66d45b8e91bf1263c82e636bd21498bf6de552bc
Author: Bernard Cafarelli <email address hidden>
Date: Fri Feb 7 14:21:09 2020 +0100

Re-use existing ProcessLauncher from wsgi in RPC workers

If both are run under the same process, and api_workers >= 2, the server
process will instantiate two oslo_service.ProcessLauncher instances

To avoid this, use the wsgi ProcessLauncher instance if it exists

[0] https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers

Conflicts:
neutron/service.py

    Depends-On: https://review.opendev.org/708825/
    Change-Id: Ic821f8ca84add9c8137ef712031afb43e491591c
    Closes-Bug: #1780139
    (cherry picked from commit 13aa00026f768642f385259ba7ccddcfdad3c75f)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-28: Fix included in openstack/neutron 13.0.7

#57

This issue was fixed in the openstack/neutron 13.0.7 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-05: Fix merged to neutron (stable/train)

#58

Reviewed: https://review.opendev.org/708570
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=13f2e1344a697c8c7adf795100e3e0ee92bdc43f
Submitter: Zuul
Branch: stable/train

commit 13f2e1344a697c8c7adf795100e3e0ee92bdc43f
Author: Bernard Cafarelli <email address hidden>
Date: Fri Feb 7 14:21:09 2020 +0100

Re-use existing ProcessLauncher from wsgi in RPC workers

If both are run under the same process, and api_workers >= 2, the server
process will instantiate two oslo_service.ProcessLauncher instances

To avoid this, use the wsgi ProcessLauncher instance if it exists

[0] https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers

    Change-Id: Ic821f8ca84add9c8137ef712031afb43e491591c
    Closes-Bug: #1780139
    (cherry picked from commit 13aa00026f768642f385259ba7ccddcfdad3c75f)