Launchpad itself

race condition when rotating logs

Bug #481512 reported by Gary Poster on 2009-11-12

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	Unassigned
	ZConfig	Fix Released	Medium	aboubacar

Bug Description

lpnet 13 reported a segfault. Diagnostic details are here:

- https://pastebin.canonical.com/24640/
- https://pastebin.canonical.com/24724/

As Michael describes in the first comment below, this appears to be caused by a race condition: writing logs while rotating them. (See thread 6 and thread 1 in second pastebin.)

The next challenge is to figure out what can be done about this, barring the removal of in-app logrotation.

See original description

Tags:

Related branches

lp:~henninge/zconfig/bug-481512-reopen-logs

lp:~henninge/launchpad/bug-481512-reopen-logs

Merged into lp:launchpad at revision 13544

Gary Poster (community): Approve on 2011-07-28

lp:~lifeless/zconfig/bug-481512

Ready for review for merging into lp:~fdrake/zconfig/trunk

Curtis Hovey (community): Approve (code) on 2012-09-12

Fred Drake: Pending requested 2012-09-04

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-11-12:

gdb-ing suggests very strongly that this is a fclose/fwrite race at logrotate time: thread 6 is waiting to reacquire the GIL having just closed something, the python stacktrace very much suggesting a log file and thread 1 is writing to something, the python stacktrace very much suggesting a log file.

Gary Poster (gary) on 2009-11-13

summary:	- segfault on lpnet13 + race condition when rotating logs
description:	updated
Changed in launchpad-foundations:
status:	New → Triaged
importance:	Undecided → Low
importance:	Low → High

Gary Poster (gary) on 2009-11-13

description:

updated

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-lp

Revision history for this message

Haw Loeung (hloeung) wrote on 2011-07-06:

Hi Guys,

lpnet123, lpnet125, lpnet126, lpnet129, and lpnet132 died on soybean. It's around the same time as log rotating so I believe it is related to this bug.

In the event that it happens again, what can I do to help collect more information?

Regards,

Haw

Revision history for this message

Robert Collins (lifeless) wrote on 2011-07-06:

a corefile + the stdout/stderr logs for an appserver that was affected.

Revision history for this message

Steve McInerney (spm) wrote on 2011-07-12:

lpnets 122, 124, 125, 130, 132, 134, 135 all died on soybean shortly after the scheduled log rotation.

Revision history for this message

Steve McInerney (spm) wrote on 2011-07-15:

Have two major events today, both post coinciding shortly after the scheduled start for logrotations.

Two core files created:
https://devpad.canonical.com/~spm/soybean-core.2011-07-15.gdb-out
https://devpad.canonical.com/~spm/wampee-core.2011-07-15.gdb-out

logs are all on devpad.

Revision history for this message

Steve McInerney (spm) wrote on 2011-07-18:

Upping to critical. "for" loops for service restoration are undesirable. :-(

Changed in launchpad:
importance:	High → Critical

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2011-07-20:

Escalated by Tom Haddon

tags:

added: escalated

Henning Eggers (henninge) on 2011-07-28

Changed in launchpad:
status:	Triaged → In Progress
assignee:	nobody → Henning Eggers (henninge)

Henning Eggers (henninge) on 2011-07-28

Changed in zconfig:
status:	New → In Progress
assignee:	nobody → Henning Eggers (henninge)

Henning Eggers (henninge) on 2011-07-28

tags:

added: no-qa

Revision history for this message

Steve McInerney (spm) wrote on 2011-07-29:

managed to get a core in the latest batch of crashings:
https://devpad.canonical.com/~spm/core.2011-07-29.out

Revision history for this message

Henning Eggers (henninge) wrote on 2011-07-29:

Hm, the cores don't show anything that would help me. I went by the thread listing from the second paste in the current bug description. That shows "reopen" and "emit" colliding and I have added locking to prevent that. (emit already did it but reopen didn't). That fix will be in the next deploy and we should watch if it helps. Otherwise we may have been looking in the wrong direction.

Changed in launchpad:
status:	In Progress → Fix Committed
tags:	added: qa-untestable removed: no-qa

Henning Eggers (henninge) on 2011-08-02

Changed in launchpad:
status:	Fix Committed → Fix Released

Revision history for this message

Fred Drake (fdrake) wrote on 2012-02-11:

#10

Fix committed to ZConfig trunk in r124374.

Changed in zconfig:
status:	In Progress → Fix Committed
importance:	Undecided → Medium
assignee:	Henning Eggers (henninge) → Fred Drake (fdrake)

Revision history for this message

Fred Drake (fdrake) wrote on 2012-02-11:

#11

Fix released in ZConfig 2.9.1.

Changed in zconfig:
status:	Fix Committed → Fix Released

Revision history for this message

Robert Collins (lifeless) wrote on 2012-02-11:

#12

Hi, sorry to be the bearer of bad news but the fix Henning did introduces another issue recorded in bug 861742 (now a duplicate of this)

So we're going to have to examine other ways of achieving the goal - e.g.
- fixing CPython to not segfault on concurrent operations on the filehandle
- using locking but not doing the rotation in the signal handler, instead doing it on the next lock operation to notice the rotation request

Changed in launchpad:
status:	Fix Released → Triaged
assignee:	Henning Eggers (henninge) → nobody

Francis J. Lacoste (flacoste) on 2012-02-24

tags:

removed: escalated

Revision history for this message

Stuart Bishop (stub) wrote on 2012-06-08:

#13

To make this class of problems go away for Launchpad, we might want to stop sending signals to this rather large and complex beast. Noticing when the log file has been moved and reopening, or logging to a stream and having a helper daemon handle rotation could be much less frustrating long term.

Revision history for this message

Robert Collins (lifeless) wrote on 2012-09-04:

#14

I've pushed up a candidate fix. Fred - let me know what you think?

William Grant (wgrant) on 2012-10-22

tags:

added: webapp-infrastructure

Airkm (airkm) on 2019-02-24

information type:

Public → Private

William Grant (wgrant) on 2019-02-25

information type:

Private → Public

aboubacar (aboubacar12) on 2022-01-28

Changed in zconfig:
assignee:	Fred Drake (fdrake) → aboubacar (aboubacar12)

Revision history for this message

Colin Watson (cjwatson) wrote on 2023-06-13:

#15

We switched from zope.server to gunicorn a while back, and the entire way that the individual appserver workers run is entirely different now; also, appserver logs are handled via the systemd journal and rsyslog rather than in the application. I'm pretty sure all that has made this bug obsolete, one way or another.

Changed in launchpad:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #861742

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.