race condition when rotating logs

Bug #481512 reported by Gary Poster on 2009-11-12
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Unassigned
ZConfig
Medium
Fred Drake

Bug Description

lpnet 13 reported a segfault. Diagnostic details are here:

- https://pastebin.canonical.com/24640/
- https://pastebin.canonical.com/24724/

As Michael describes in the first comment below, this appears to be caused by a race condition: writing logs while rotating them. (See thread 6 and thread 1 in second pastebin.)

The next challenge is to figure out what can be done about this, barring the removal of in-app logrotation.

Related branches

Michael Hudson-Doyle (mwhudson) wrote :

gdb-ing suggests very strongly that this is a fclose/fwrite race at logrotate time: thread 6 is waiting to reacquire the GIL having just closed something, the python stacktrace very much suggesting a log file and thread 1 is writing to something, the python stacktrace very much suggesting a log file.

Gary Poster (gary) on 2009-11-13
summary: - segfault on lpnet13
+ race condition when rotating logs
description: updated
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
importance: Low → High
Gary Poster (gary) on 2009-11-13
description: updated
Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Haw Loeung (hloeung) wrote :

Hi Guys,

lpnet123, lpnet125, lpnet126, lpnet129, and lpnet132 died on soybean. It's around the same time as log rotating so I believe it is related to this bug.

In the event that it happens again, what can I do to help collect more information?

Regards,

Haw

Robert Collins (lifeless) wrote :

a corefile + the stdout/stderr logs for an appserver that was affected.

Steve McInerney (spm) wrote :

lpnets 122, 124, 125, 130, 132, 134, 135 all died on soybean shortly after the scheduled log rotation.

Steve McInerney (spm) wrote :

Have two major events today, both post coinciding shortly after the scheduled start for logrotations.

Two core files created:
https://devpad.canonical.com/~spm/soybean-core.2011-07-15.gdb-out
https://devpad.canonical.com/~spm/wampee-core.2011-07-15.gdb-out

logs are all on devpad.

Steve McInerney (spm) wrote :

Upping to critical. "for" loops for service restoration are undesirable. :-(

Changed in launchpad:
importance: High → Critical
Francis J. Lacoste (flacoste) wrote :

Escalated by Tom Haddon

tags: added: escalated
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Henning Eggers (henninge)
Changed in zconfig:
status: New → In Progress
assignee: nobody → Henning Eggers (henninge)
tags: added: no-qa
Steve McInerney (spm) wrote :

managed to get a core in the latest batch of crashings:
https://devpad.canonical.com/~spm/core.2011-07-29.out

Henning Eggers (henninge) wrote :

Hm, the cores don't show anything that would help me. I went by the thread listing from the second paste in the current bug description. That shows "reopen" and "emit" colliding and I have added locking to prevent that. (emit already did it but reopen didn't). That fix will be in the next deploy and we should watch if it helps. Otherwise we may have been looking in the wrong direction.

Changed in launchpad:
status: In Progress → Fix Committed
tags: added: qa-untestable
removed: no-qa
Changed in launchpad:
status: Fix Committed → Fix Released
Fred Drake (fdrake) wrote :

Fix committed to ZConfig trunk in r124374.

Changed in zconfig:
status: In Progress → Fix Committed
importance: Undecided → Medium
assignee: Henning Eggers (henninge) → Fred Drake (fdrake)
Fred Drake (fdrake) wrote :

Fix released in ZConfig 2.9.1.

Changed in zconfig:
status: Fix Committed → Fix Released
Robert Collins (lifeless) wrote :

Hi, sorry to be the bearer of bad news but the fix Henning did introduces another issue recorded in bug 861742 (now a duplicate of this)

So we're going to have to examine other ways of achieving the goal - e.g.
 - fixing CPython to not segfault on concurrent operations on the filehandle
 - using locking but not doing the rotation in the signal handler, instead doing it on the next lock operation to notice the rotation request

Changed in launchpad:
status: Fix Released → Triaged
assignee: Henning Eggers (henninge) → nobody
tags: removed: escalated
Stuart Bishop (stub) wrote :

To make this class of problems go away for Launchpad, we might want to stop sending signals to this rather large and complex beast. Noticing when the log file has been moved and reopening, or logging to a stream and having a helper daemon handle rotation could be much less frustrating long term.

Robert Collins (lifeless) wrote :

I've pushed up a candidate fix. Fred - let me know what you think?

William Grant (wgrant) on 2012-10-22
tags: added: webapp-infrastructure
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers