mailman doesn't shut down cleanly

Bug #753306 reported by Tom Haddon on 2011-04-07
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Curtis Hovey

Bug Description

During the last few deployments (possibly longer) mailman hasn't shut down cleaning. We run the initscript stop - https://pastebin.canonical.com/45791/ - and the mailmanctl process is still running and needs to manually killed. This makes mailman a tough target for a "nodowntime" deployment as it would block on failing to shut down.

Related branches

Tom Haddon (mthaddon) wrote :

Per Rob Collins, marking as high priority.

tags: added: canonical-losa-lp
Changed in launchpad:
importance: Undecided → High
j.c.sackett (jcsackett) on 2011-04-07
Changed in launchpad:
status: New → Triaged
Tom Haddon (mthaddon) wrote :

I'm going to have to remove it from the "nodowntime" deployment target. See https://pastebin.canonical.com/45912/ and https://pastebin.canonical.com/45913/ (taken from after the first paste completed). I had to kill -9 the processes to get them to go away.

Tom Haddon (mthaddon) wrote :

Fwiw, looking at https://pastebin.canonical.com/45915/ it looks like the shutdown script is targeting "/usr/bin/python2.6 -S bin/run -i production-mailman" for shutdown rather than "/usr/bin/python2.6 ./mailmanctl -s start". I'm not sure if this is correct or not, but it explains why the mailmanctl process is still around after a shutdown.

I'll move these notes to the bug.

Robert Collins (lifeless) wrote :

On Tue, Apr 12, 2011 at 7:43 AM, Robert Collins
<email address hidden> wrote:
> I'll move these notes to the bug.

<FAIL> I thought this was the RT; my bad.

Barry Warsaw (barry) wrote :

It's been ages since I looked at this stuff, and IIRC we had to play funny games to get Mailman's startup/shutdown procedure hooked in, but Mailman itself really wants to be shutdown with `mailmanctl stop` so that's what you eventually need to ensure gets called.

Tom Haddon (mthaddon) wrote :

This happened again at the rollout for 11.06

Barry Warsaw (barry) wrote :

Ping me if you need some help looking into this.

Changed in launchpad:
importance: High → Critical
Gary Poster (gary) wrote :

I wonder if this is tied in with my diagnosis of bug 791492: mailman's xmlrpc client to LP does not have a socket timeout set, so if LP is brought down while Mailman is trying to talk to it, then maybe its the socket that's not letting the process die. I'll only claim to try and fix the other bug now, but if this problem goes away because of adding a timeout...wouldn't that be nice!

Barry Warsaw (barry) wrote :

On Jul 26, 2011, at 06:12 PM, Gary Poster wrote:

>I wonder if this is tied in with my diagnosis of bug 791492: mailman's
>xmlrpc client to LP does not have a socket timeout set, so if LP is
>brought down while Mailman is trying to talk to it, then maybe its the
>socket that's not letting the process die. I'll only claim to try and
>fix the other bug now, but if this problem goes away because of adding a
>timeout...wouldn't that be nice!

One thing to look at. When Mailman doesn't shut down, which of the qrunners
is still running? All of them, or only XMLRPCRunner?

If it's all of them, then there may be a problem propagating signals from the
master to the child qrunners. If it's just the XMLRPCRunner, then your
scenario could indeed be happening. Note that if any of the child qrunners
don't exit, the master won't exit either since it's waiting on the pids of all
its children.

Gary Poster (gary) wrote :

Thanks, Barry. How can the LOSAs determine which qrunners are still running in this circumstance? The ps listing that Tom gave earlier does not show anything clearly that I see. https://pastebin.canonical.com/45913/

Curtis Hovey (sinzui) wrote :

The pastbin does show that XMLRPCRunner is running. It is the only qrunner in fact.

\_ /usr/bin/python2.6 /.../lib/mailman/bin/qrunner --runner=XMLRPCRunner:0:1 -s

Curtis Hovey (sinzui) wrote :

Oh, and compare https://pastebin.canonical.com/45915/ from 2011-04-11 with https://pastebin.canonical.com/45913/ from 2011-07-28. It is clear that all are running in the former, but only XMLRPCRunner is up in the latter

Gary Poster (gary) wrote :

Ah, thanks Curtis! My screen wasn't wide enough.

I'm going to claim that my fix for 791492 is also a fix for this bug then, because it is the only clear problem I see. If I am wrong, we can reopen when we discover that.

Changed in launchpad:
assignee: nobody → Gary Poster (gary)
status: Triaged → In Progress
Changed in launchpad:
status: In Progress → Fix Released
Haw Loeung (hloeung) wrote :

It seems this is still a problem - https://pastebin.canonical.com/50568/

I have removed it from the nodowntime set for the time being.

Changed in launchpad:
status: Fix Released → Triaged
Robert Collins (lifeless) wrote :

Barry suggested getting a list of the running processes, so we need to
gather that.

Can you please do a deploy to mailman but capture the processes that
keep running?

Haw Loeung (hloeung) wrote :

Process list when trying to stop:

https://pastebin.canonical.com/50569/

Process list after when deployment failed:

https://pastebin.canonical.com/50570/

Gary Poster (gary) wrote :

Haw's process list shows what Tom's did from before my change: the XMLRPCRunner seems to be the one that is hanging.

One interesting thing is that it is also the only runner going *before* it hangs.

I still wonder if this is tied to the XMLRPCRunner starting to talk to LP before it goes down, but simply providing a socket timeout is apparently not sufficient, unfortunately.

I suspect that, for the next hang, we should get a Python gdb analysis of the Mailman processes with backtrace.py, a la what we do for LP (https://dev.launchpad.net/Debugging/GDB).

Gary Poster (gary) on 2012-01-06
Changed in launchpad:
assignee: Gary Poster (gary) → nobody
Tom Haddon (mthaddon) wrote :

Bitten by this again today.

William Grant (wgrant) on 2012-10-19
tags: added: mailing-lists
Curtis Hovey (sinzui) wrote :

Mailman shutdowns when each queue when it completes a loop. The xmlrpc runner takes 15 minutes to accomplish oneloop() on the current hardware. On the older server that loop would have been closer to 30 minutes. Shutting down mailman will always take as much time as needed to complete the loops.

The two slowest queues are the xmlrpc runner and the archive runner. For the former, we might add checks to exist the loop early betwen the steps in oneloop(). For the archive runner is a separate problem because the runner waits for mhonarc to complete the regeneration of the archive indexes, which can be 3 minutes for the large archives. Since the pastebins only show xmlrpc runner, I propose just fixing it to exist early.

Curtis Hovey (sinzui) wrote :

mailman's Runner class provides shortcircuit() which can be used by oneloop() to return early. This is not used by the XMLRPCRunner because it ignores slices. Since we know that the many minutes pass for each call to oneloop(), the method can check shortcircuit() between the many atomic steps. The method has 3 calls in it which can be guarded with shortcircuit(). The call to get_subscriptions() can be very long. get_subscriptions() has a looping strategy to ensure it works in small batches which allow for shortcircuit() to be checked for an early exit.

Curtis Hovey (sinzui) wrote :

This bug is a symptom of Bug #889326, but the proposed changes will separate the two issues.

Curtis Hovey (sinzui) on 2013-01-04
Changed in launchpad:
assignee: nobody → Curtis Hovey (sinzui)
status: Triaged → In Progress
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) wrote :

We confirmed staging's mailman shutdown and restarted correctly with the fix.

tags: added: qa-ok
removed: qa-needstesting
Steve Kowalik (stevenk) on 2013-01-09
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers