REST server crashes on "reopen"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| GNU Mailman |
High
|
Barry Warsaw |
Bug Description
Mailman's REST server crashes on the "mailman reopen" command. Here are the only lines I get in the logs:
May 26 12:29:01 2013 (30830) Master started
May 26 12:29:05 2013 (30844) RESTRunner runner started.
May 26 12:29:05 2013 (30844) Starting REST server
May 26 12:29:05 2013 (30837) DigestRunner runner started.
May 26 12:29:05 2013 (30843) RetryRunner runner started.
May 26 12:29:05 2013 (30848) CommandRunner runner started.
May 26 12:29:05 2013 (30845) BounceRunner runner started.
May 26 12:29:05 2013 (30846) LMTPRunner runner started.
May 26 12:29:05 2013 (30850) NNTPRunner runner started.
May 26 12:29:05 2013 (30838) PipelineRunner runner started.
May 26 12:29:06 2013 (30839) ArchiveRunner runner started.
May 26 12:29:06 2013 (30849) VirginRunner runner started.
May 26 12:29:06 2013 (30836) IncomingRunner runner started.
May 26 12:29:06 2013 (30840) OutgoingRunner runner started.
May 26 12:29:33 2013 (30830) Master watcher caught SIGHUP. Re-opening log files.
May 26 12:29:33 2013 (30848) CommandRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30843) RetryRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30838) PipelineRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30837) DigestRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30839) ArchiveRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30836) IncomingRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30846) LMTPRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30849) VirginRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30845) BounceRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30850) NNTPRunner runner caught SIGHUP. Reopening logs.
May 26 12:29:33 2013 (30840) OutgoingRunner runner caught SIGHUP. Reopening logs.
As you can see in the "caught SIGHUP" lines, there is nothing about the REST server, and indeed there is no "runner --runner=rest:0:1" process running anymore. I have to restart mailman to get the REST server back online.
Aurélien Bompard (abompard) wrote : | #1 |
Barry Warsaw (barry) wrote : | #2 |
To be honest, I don't remember why I added that. Note that the REST server is the only one that doesn't intercept signals. I tried commenting out that line (since the default is True) and cli testing seemed just fine. The test suite passes too. I wonder if it was due to a problem in Python 2.6 which perhaps is fixed now in 2.7?
Try commenting out line 43 in rest.py and see if things work better for you. If so, then perhaps we can just remove this mis-feature. I sure wish I remembered why I added it though.
Aurélien Bompard (abompard) wrote : | #3 |
I've been running without this line for a few days already, and it seems to work fine.
Barry Warsaw (barry) wrote : | #4 |
Okay, I've commented out the line in RESTRunner, but not removed the basic functionality to override interception of signals. That way, if I remember why I added that in the first place, it will be easier to re-enable. OTOH, if nothing bad happens, then we should eventually remove the flag.
Changed in mailman: | |
milestone: | none → 3.0.0b4 |
assignee: | nobody → Barry Warsaw (barry) |
importance: | Undecided → High |
status: | New → Fix Committed |
Aurélien Bompard (abompard) wrote : | #5 |
OK, now I know why you added this flag. Removing it does not work, the REST server still exits on reopen (I don't know how I missed this, it used to work). But it's an exit, not a crash. In the logs I get "REST server exiting", because the server gets an interrupted system call error. This is very probably linked to http://
In python upstream they suggested to simply restart the select call, so I'll prepare a patch to restart the serve_forever() call.
Can you think of a reason where we would really want the REST server to exit on an interrupted system call ?
Changed in mailman: | |
status: | Fix Committed → In Progress |
Aurélien Bompard (abompard) wrote : | #6 |
Here's the patch to restart the REST server on an interrupted system call. I tried to make use of the existing runner infrastructure so it's a short patch, but it may not be a good idea (I don't want to hijack something not designed for this purpose).
Barry Warsaw (barry) wrote : | #7 |
I generally like that you're using the runner infrastructure for this. Is there any chance you can come up with a unittest case for this? (I'm still working on one but haven't gotten it working yet.)
Aurélien Bompard (abompard) wrote : | #8 |
Here's a testcase that I finally managed to get working. Download the attached file to src/mailman/
Barry Warsaw (barry) wrote : | #9 |
This turned out to be a more extensive branch than anticipated because of weirdness in three competing constraints. To properly shutdown the WSGI server, the .shutdown() method must be called in a separate thread from the server. But because of SQLite constraints, the server must run in the main thread, and because of Python signal handling semantics, the signal handlers must run in the main thread. I solved this, by running a subthread that just waits on an Event and then calls the .shutdown(). The signal handler fires the Event.
I also greatly simplified the bin/runner implementation and removed some command line options that weren't really being used.
Thanks for the test code! I did a little bit of refactoring, but it was very helpful.
Changed in mailman: | |
status: | In Progress → Fix Committed |
Changed in mailman: | |
status: | Fix Committed → Fix Released |
After looking at the source code, it seems to come from the fact that the REST runner has the intercept_signals attribute set to False. Any reason for that ?