init script / mailmanctl fails to stop mailman 2, reports success
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
GNU Mailman |
New
|
Undecided
|
Unassigned |
Bug Description
# systemctl restart mailman
Jun 13 11:43:27 lists.gnu.org systemd[1]: Stopping LSB: Mailman Master Queue Runner...
Jun 13 11:43:27 lists.gnu.org mailman[31096]: * Stopping Mailman master qrunner mailmanctl
Jun 13 11:43:27 lists.gnu.org systemd[1]: Stopped LSB: Mailman Master Queue Runner.
Jun 13 11:43:28 lists.gnu.org mailman[31096]: ...done.
Jun 13 11:43:27 lists.gnu.org systemd[1]: Starting LSB: Mailman Master Queue Runner...
Jun 13 11:43:31 lists.gnu.org mailman[31153]: * Starting Mailman master qrunner mailmanctl
Jun 13 11:43:31 lists.gnu.org mailman[31153]: The master qrunner lock could not be acquired because it appears as if another
Jun 13 11:43:31 lists.gnu.org mailman[31153]: master qrunner is already running.
Jun 13 11:43:31 lists.gnu.org mailman[31153]: ...done.
At this point, ps -ef | grep mailman shows 4 mailman processes remain:
/usr/bin/python /usr/lib/
and 3 qrunners, like this
/usr/bin/python /var/lib/
The qrunner log does show all the pids getting the TERM signal from mailmanctl:
Jun 13 11:43:27 2019 (21946) OutgoingRunner qrunner caught SIGTERM. Stopping.
But only 1 actually stopped. I manually send the qrunners kill signals over and over and
wait until 5 minutes later, they finally terminate and mailmanctl with them.
Then I run systemctl restart mailman again, and it really starts this time:
Jun 13 11:48:51 lists.gnu.org systemd[1]: Stopping LSB: Mailman Master Queue Runner...
Jun 13 11:48:51 lists.gnu.org mailman[10762]: * Stopping Mailman master qrunner mailmanctl
Jun 13 11:48:51 lists.gnu.org mailman[10762]: PID unreadable in: /var/run/
Jun 13 11:48:51 lists.gnu.org mailman[10762]: [Errno 2] No such file or directory: '/var/run/
Jun 13 11:48:51 lists.gnu.org mailman[10762]: Is qrunner even running?
Jun 13 11:48:51 lists.gnu.org mailman[10762]: ...done.
Jun 13 11:48:51 lists.gnu.org systemd[1]: Stopped LSB: Mailman Master Queue Runner.
Jun 13 11:48:51 lists.gnu.org systemd[1]: Starting LSB: Mailman Master Queue Runner...
Jun 13 11:48:55 lists.gnu.org mailman[10775]: * Starting Mailman master qrunner mailmanctl
Jun 13 11:48:55 lists.gnu.org mailman[10775]: ...done.
Jun 13 11:48:55 lists.gnu.org systemd[1]: Started LSB: Mailman Master Queue Runner
I'm using mailman 2.1.23-
I really need to figure out a fix or workaround to this bug, waiting 5 minutes to
restart mailman is no good, I run a lot of very active lists on lists.gnu.org.
Can I kill -9? Can I start the mailman while the old qrunners are still exiting?
How can I help debug this to find a fix?
Based on what I see above, you have four OutgoingRunner processes each processing a quarter of the out queue space.
When you stop Mailman, three of these processes take some time to stop. I think that's probably because they won't stop in the middle of processing a message, so probably the messages they are processing have a large number of recipients, and possibly delivery to the MTA is slow, so it takes some time for them to exit gracefully.
You could probably SIGKILL them, but this would result in messages to all the recipients upon restart duplicating those that were sent before the process was killed.
The real question is why is it taking 5+ minutes for the delivery of one message to all recipients?
What does Mailman's SMTP log show? I.e., what are the number of recips and times in messages like:
Jun 12 20:53:14 2019 (2142) <message-id> smtp to listname for 265 recips, completed in 11.730 seconds
In particular, the last entries before the restart?