tripleo

upstart service unreliable after introducion of pipe to logger

Bug #1385346 reported by Michael Kerrin on 2014-10-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Gregory Haynes

Bug Description

upstart scripts for the openstack services now pipe the output to logger for example neutron-server the start script is:

The | introduces a fork which can in some circumtances leave a process behind after shutting down. For example after two runs of o-r-c I see

Oct 23 12:24:59 overcloud-controller1-vgsj3lyoqjix os-collect-config: + os-svc-enable -n neutron-server
Oct 23 12:24:59 overcloud-controller1-vgsj3lyoqjix os-collect-config: + os-svc-restart -n neutron-server
Oct 23 12:24:59 overcloud-controller1-vgsj3lyoqjix os-collect-config: stop: Unknown instance:
Oct 23 12:25:00 overcloud-controller1-vgsj3lyoqjix os-collect-config: neutron-server start/running, process 8126

And then

Oct 23 12:27:07 overcloud-controller1-vgsj3lyoqjix os-collect-config: + os-svc-enable -n neutron-server
Oct 23 12:27:07 overcloud-controller1-vgsj3lyoqjix os-collect-config: + os-svc-restart -n neutron-server
Oct 23 12:27:08 overcloud-controller1-vgsj3lyoqjix os-collect-config: neutron-server stop/waiting
Oct 23 12:27:09 overcloud-controller1-vgsj3lyoqjix os-collect-config: neutron-server start/running, process 12301

But if I look at the system I see

root@overcloud-controller1-vgsj3lyoqjix:/var/log/neutron# ps auxww | grep neutron-server
neutron 8127 62.0 0.9 232516 74088 ? R 12:24 38:00 /opt/stack/venvs/openstack/bin/python /opt/stack/venvs/openstack/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/rootwrap.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/server.ini
root 11370 0.0 0.0 4312 616 ? Ss 13:25 0:00 /bin/sh -e -c exec start-stop-daemon --start -c neutron --exec /opt/stack/venvs/openstack/bin/neutron-server – --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/rootwrap.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/server.ini 2>&1 | logger -t neutron-server /bin/sh
neutron 11371 14.5 1.5 286736 128432 ? S 13:25 0:04 /opt/stack/venvs/openstack/bin/python /opt/stack/venvs/openstack/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/rootwrap.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/server.ini
root 11372 0.0 0.0 4212 340 ? S 13:25 0:00 logger -t neutron-server
root 11759 0.0 0.0 10464 864 pts/0 S+ 13:26 0:00 grep neutron-server

Now the process 8127 holds the neutron api server port 9696 so the second instance of neutron-server (11371) can bind to it and so goes into a infinite loop trying to start up.

The pipe was introduced in the review: https://review.openstack.org/#/c/104619/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-24: Fix proposed to tripleo-image-elements (master)

Fix proposed to branch: master
Review: https://review.openstack.org/130830

Changed in tripleo:
assignee:	nobody → Michael Kerrin (michael-kerrin-w)
status:	New → In Progress

Revision history for this message

Ben Nemec (bnemec) wrote on 2014-10-24:

This basically means it's impossible to restart a service on an Upstart-based distro right? Seems like a serious problem to me.

Changed in tripleo:
importance:	Undecided → Critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-13: Fix merged to tripleo-image-elements (master)

Reviewed: https://review.openstack.org/130830
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=615c6bb95b818551934e652e70cd7833f965ccd6
Submitter: Jenkins
Branch: master

commit 615c6bb95b818551934e652e70cd7833f965ccd6
Author: Michael Kerrin <email address hidden>
Date: Fri Oct 24 16:42:12 2014 +0100

We need to tell upstart to expect the logger fork

Otherwise upstart doesn't get the correct process id for the service
and in some causes can fail to kill the process during a servcie stop

Change-Id: I138dd2206ed1e511fbabaf43bd3cb1d163c76db4
Closes-bug: #1385346

Changed in tripleo:
status:	In Progress → Fix Committed

Revision history for this message

Gregory Haynes (greghaynes) wrote on 2014-12-02:

So there is an issue here that upstart is not tracking the correct pid, but I dont believe it is a result of the piping we are doing. Upstart should be tracking the PID of the first process which will not be the logger. I think what is more likely is we are running into https://bugs.launchpad.net/upstart/+bug/406397 which is not a fun one to fix...

Revision history for this message

Gregory Haynes (greghaynes) wrote on 2014-12-02:

ok, I was a bit off, what is actually going on is upstart is killing the /bin/sh process and neutron-server isnt exiting as a result. Expect fork does fix this in a much cleaner fasion.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2014-12-16:

This wasn't a complete solution.. Greg has a cleaner way.

Changed in tripleo:
status:	Fix Committed → In Progress
assignee:	Michael Kerrin (michael-kerrin-w) → Gregory Haynes (greghaynes)

Revision history for this message

Marios Andreou (marios-b) wrote on 2014-12-29:

Don't know why this hasn't shown up here:

Log directly to syslog from os services https://review.openstack.org/#/c/138844/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-05:

Reviewed: https://review.openstack.org/138844
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=5fc2040aa5e8bb146b9b220ded376ca49a1995f9
Submitter: Jenkins
Branch: master

commit 5fc2040aa5e8bb146b9b220ded376ca49a1995f9
Author: Gregory Haynes <email address hidden>
Date: Wed Dec 3 12:13:39 2014 -0800

Log directly to syslog from os services

    We currently have a horrible hack for upstart where we run our services
    with stdout and stderr piped into logger. There are lots of issues with
    this, such as upstart has a hard time knowing which PID to track.
    Conveniently, python logging knows how to syslog, so why not just use
    that!

This changes the default behaviour of os-svc-daemon to create a python
logging.conf so services log directly to syslog.

    This also changes the default upstart respawn limiting to be 2 fails in
    10 secs. This is becauce making upstart monitor the actual service
    daemon was causing some services to respawn loop due to time it took
    them to fail.

Closes-Bug: #1385346

Change-Id: Ibd491881d79160ad1457526d51512f5fc1cf9ea5

Changed in tripleo:
status:	In Progress → Fix Committed

Ben Nemec (bnemec) on 2015-01-28

Changed in tripleo:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.