race condition when rotating logs

Bug #481512 reported by Gary Poster
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Critical
Unassigned
ZConfig
Fix Released
Medium
aboubacar

Bug Description

lpnet 13 reported a segfault. Diagnostic details are here:

- https://pastebin.canonical.com/24640/
- https://pastebin.canonical.com/24724/

As Michael describes in the first comment below, this appears to be caused by a race condition: writing logs while rotating them. (See thread 6 and thread 1 in second pastebin.)

The next challenge is to figure out what can be done about this, barring the removal of in-app logrotation.

Related branches

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

gdb-ing suggests very strongly that this is a fclose/fwrite race at logrotate time: thread 6 is waiting to reacquire the GIL having just closed something, the python stacktrace very much suggesting a log file and thread 1 is writing to something, the python stacktrace very much suggesting a log file.

Gary Poster (gary)
summary: - segfault on lpnet13
+ race condition when rotating logs
description: updated
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
importance: Low → High
Gary Poster (gary)
description: updated
Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
Haw Loeung (hloeung) wrote :

Hi Guys,

lpnet123, lpnet125, lpnet126, lpnet129, and lpnet132 died on soybean. It's around the same time as log rotating so I believe it is related to this bug.

In the event that it happens again, what can I do to help collect more information?

Regards,

Haw

Revision history for this message
Robert Collins (lifeless) wrote :

a corefile + the stdout/stderr logs for an appserver that was affected.

Revision history for this message
Steve McInerney (spm) wrote :

lpnets 122, 124, 125, 130, 132, 134, 135 all died on soybean shortly after the scheduled log rotation.

Revision history for this message
Steve McInerney (spm) wrote :

Have two major events today, both post coinciding shortly after the scheduled start for logrotations.

Two core files created:
https://devpad.canonical.com/~spm/soybean-core.2011-07-15.gdb-out
https://devpad.canonical.com/~spm/wampee-core.2011-07-15.gdb-out

logs are all on devpad.

Revision history for this message
Steve McInerney (spm) wrote :

Upping to critical. "for" loops for service restoration are undesirable. :-(

Changed in launchpad:
importance: High → Critical
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Escalated by Tom Haddon

tags: added: escalated
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Henning Eggers (henninge)
Changed in zconfig:
status: New → In Progress
assignee: nobody → Henning Eggers (henninge)
tags: added: no-qa
Revision history for this message
Steve McInerney (spm) wrote :

managed to get a core in the latest batch of crashings:
https://devpad.canonical.com/~spm/core.2011-07-29.out

Revision history for this message
Henning Eggers (henninge) wrote :

Hm, the cores don't show anything that would help me. I went by the thread listing from the second paste in the current bug description. That shows "reopen" and "emit" colliding and I have added locking to prevent that. (emit already did it but reopen didn't). That fix will be in the next deploy and we should watch if it helps. Otherwise we may have been looking in the wrong direction.

Changed in launchpad:
status: In Progress → Fix Committed
tags: added: qa-untestable
removed: no-qa
Changed in launchpad:
status: Fix Committed → Fix Released
Revision history for this message
Fred Drake (fdrake) wrote :

Fix committed to ZConfig trunk in r124374.

Changed in zconfig:
status: In Progress → Fix Committed
importance: Undecided → Medium
assignee: Henning Eggers (henninge) → Fred Drake (fdrake)
Revision history for this message
Fred Drake (fdrake) wrote :

Fix released in ZConfig 2.9.1.

Changed in zconfig:
status: Fix Committed → Fix Released
Revision history for this message
Robert Collins (lifeless) wrote :

Hi, sorry to be the bearer of bad news but the fix Henning did introduces another issue recorded in bug 861742 (now a duplicate of this)

So we're going to have to examine other ways of achieving the goal - e.g.
 - fixing CPython to not segfault on concurrent operations on the filehandle
 - using locking but not doing the rotation in the signal handler, instead doing it on the next lock operation to notice the rotation request

Changed in launchpad:
status: Fix Released → Triaged
assignee: Henning Eggers (henninge) → nobody
tags: removed: escalated
Revision history for this message
Stuart Bishop (stub) wrote :

To make this class of problems go away for Launchpad, we might want to stop sending signals to this rather large and complex beast. Noticing when the log file has been moved and reopening, or logging to a stream and having a helper daemon handle rotation could be much less frustrating long term.

Revision history for this message
Robert Collins (lifeless) wrote :

I've pushed up a candidate fix. Fred - let me know what you think?

William Grant (wgrant)
tags: added: webapp-infrastructure
Airkm (airkm)
information type: Public → Private
William Grant (wgrant)
information type: Private → Public
aboubacar (aboubacar12)
Changed in zconfig:
assignee: Fred Drake (fdrake) → aboubacar (aboubacar12)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers