Sending SIGHUP to neutron-server process causes it to hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Bernard Cafarelli | ||
tripleo |
Won't Fix
|
Critical
|
Emilien Macchi |
Bug Description
* High level description
When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist:
wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)
This is problematic, because logrotate uses SIGHUP in the containerized environment. It doesn't happen always: it might take one or two signals reasonably interspersed, to trigger this.
* Pre-conditions:
I'm using CentOS 7 + Queens RDO
"rdo_version": "c9fd2404045491
I first noticed the issue when the neutron_api docker would stop working on the control nodes, eventually it was traced back to the logrotate_crond container sending SIGHUP to all the processes owning log files in /var/log/
* Step-by-step reproduction steps:
# Start with a clean container
docker restart neutron_api
# Identify the neutron-server PID: (613782 in this case) and send SIGHUP
kill -HUP 613782
# the relevant log files generally look clean the first time:
2018-07-04 16:50:34.730 7 INFO oslo_service.
2018-07-04 16:50:34.739 7 INFO neutron.
2018-07-04 16:50:34.740 7 INFO neutron.
2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 7 INFO oslo_service.
2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True
2018-07-04 16:50:34.771 7 INFO neutron.
2018-07-04 16:50:34.771 7 INFO neutron.
2018-07-04 16:50:34.792 7 INFO neutron.
2018-07-04 16:50:34.792 7 INFO neutron.
2018-07-04 16:50:34.807 7 INFO oslo_service.
2018-07-04 16:50:34.807 7 WARNING oslo_service.
2018-07-04 16:50:35.761 7 INFO oslo_service.
2018-07-04 16:50:35.764 7 INFO oslo_service.
2018-07-04 16:50:35.767 7 INFO oslo_service.
2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://
2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://
2018-07-04 16:50:35.770 7 INFO oslo_service.
2018-07-04 16:50:35.773 7 INFO oslo_service.
2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://
2018-07-04 16:50:35.776 7 INFO oslo_service.
2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://
2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://
2018-07-04 16:50:35.779 7 INFO oslo_service.
2018-07-04 16:50:35.782 7 INFO oslo_service.
2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://
2018-07-04 16:50:35.783 7 WARNING oslo_service.
2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://
# But on the second SIGHUP, the following happened:
2018-07-04 16:52:08.821 7 INFO oslo_service.
2018-07-04 16:52:08.830 7 INFO neutron.
2018-07-04 16:52:08.831 7 INFO neutron.
2018-07-04 16:52:08.847 7 INFO oslo_service.
2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True
2018-07-04 16:52:08.848 7 INFO oslo_service.
2018-07-04 16:52:08.852 7 INFO oslo_service.
2018-07-04 16:52:08.853 7 WARNING oslo_service.
2018-07-04 16:52:08.853 7 INFO oslo_service.
2018-07-04 16:52:08.853 7 WARNING oslo_service.
2018-07-04 16:52:08.853 7 INFO oslo_service.
2018-07-04 16:52:08.854 7 INFO oslo_service.
2018-07-04 16:52:08.854 7 WARNING oslo_service.
2018-07-04 16:52:08.854 7 INFO oslo_service.
2018-07-04 16:52:08.854 7 WARNING oslo_service.
2018-07-04 16:52:08.854 7 INFO oslo_service.
2018-07-04 16:52:08.855 7 WARNING oslo_service.
2018-07-04 16:52:08.855 7 INFO oslo_service.
2018-07-04 16:52:08.855 7 WARNING oslo_service.
2018-07-04 16:52:08.855 7 INFO oslo_service.
2018-07-04 16:52:08.855 7 WARNING oslo_service.
2018-07-04 16:52:15.039 7 INFO oslo_service.
2018-07-04 16:52:30.025 7 INFO oslo_service.
2018-07-04 16:52:38.017 7 INFO oslo_service.
* Version
I'm running Centos7 RDO TripleO (commit tag c9fd24040454913
available on request.
This corresponds to OpenStack Queens.
I have not fully investigated all other services, but they seem to be fine with SIGHUP. In particular nova-api-metadata seems to be OK and doing what's expected.
* Perceived severity
This blocks our private production stack from running: neutron api breaking has a big knock-on effect, unfortunately.
Changed in neutron: | |
status: | New → Triaged |
Changed in tripleo: | |
status: | New → Triaged |
importance: | Undecided → Critical |
milestone: | none → rocky-3 |
Changed in tripleo: | |
milestone: | rocky-3 → rocky-rc1 |
Changed in tripleo: | |
status: | Triaged → In Progress |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
Changed in tripleo: | |
milestone: | rocky-rc1 → rocky-rc2 |
Changed in tripleo: | |
milestone: | rocky-rc2 → stein-1 |
tags: | added: promotion-blocker |
Changed in tripleo: | |
assignee: | Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm) |
Changed in neutron: | |
importance: | Undecided → High |
status: | Triaged → In Progress |
tags: | added: neutron-proactive-backport-potential |
tags: | removed: neutron-proactive-backport-potential |
Adding a small update for context in the tripleo scenario.
Originally the logrotate config looked like this:
/var/log/ neutron/ *.log {
rotate 14
size 10M
missingok
compress
copytruncate
}
In the tripleo containers deployment, logrotate looks like this:
https:/ /github. com/openstack/ puppet- tripleo/ blob/master/ templates/ logrotate/ containers_ logrotate. conf.erb