init script / mailmanctl fails to stop mailman 2, reports success

Bug #1832740 reported by Ian Kelling
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
New
Undecided
Unassigned

Bug Description

# systemctl restart mailman

Jun 13 11:43:27 lists.gnu.org systemd[1]: Stopping LSB: Mailman Master Queue Runner...
Jun 13 11:43:27 lists.gnu.org mailman[31096]: * Stopping Mailman master qrunner mailmanctl
Jun 13 11:43:27 lists.gnu.org systemd[1]: Stopped LSB: Mailman Master Queue Runner.
Jun 13 11:43:28 lists.gnu.org mailman[31096]: ...done.
Jun 13 11:43:27 lists.gnu.org systemd[1]: Starting LSB: Mailman Master Queue Runner...
Jun 13 11:43:31 lists.gnu.org mailman[31153]: * Starting Mailman master qrunner mailmanctl
Jun 13 11:43:31 lists.gnu.org mailman[31153]: The master qrunner lock could not be acquired because it appears as if another
Jun 13 11:43:31 lists.gnu.org mailman[31153]: master qrunner is already running.
Jun 13 11:43:31 lists.gnu.org mailman[31153]: ...done.

At this point, ps -ef | grep mailman shows 4 mailman processes remain:

/usr/bin/python /usr/lib/mailman/bin/mailmanctl -s -q start
and 3 qrunners, like this
/usr/bin/python /var/lib/mailman/bin/qrunner --runner=OutgoingRunner:1:4 -s

The qrunner log does show all the pids getting the TERM signal from mailmanctl:
Jun 13 11:43:27 2019 (21946) OutgoingRunner qrunner caught SIGTERM. Stopping.

But only 1 actually stopped. I manually send the qrunners kill signals over and over and
wait until 5 minutes later, they finally terminate and mailmanctl with them.
Then I run systemctl restart mailman again, and it really starts this time:

Jun 13 11:48:51 lists.gnu.org systemd[1]: Stopping LSB: Mailman Master Queue Runner...
Jun 13 11:48:51 lists.gnu.org mailman[10762]: * Stopping Mailman master qrunner mailmanctl
Jun 13 11:48:51 lists.gnu.org mailman[10762]: PID unreadable in: /var/run/mailman/mailman.pid
Jun 13 11:48:51 lists.gnu.org mailman[10762]: [Errno 2] No such file or directory: '/var/run/mailman/mailman.pid'
Jun 13 11:48:51 lists.gnu.org mailman[10762]: Is qrunner even running?
Jun 13 11:48:51 lists.gnu.org mailman[10762]: ...done.
Jun 13 11:48:51 lists.gnu.org systemd[1]: Stopped LSB: Mailman Master Queue Runner.
Jun 13 11:48:51 lists.gnu.org systemd[1]: Starting LSB: Mailman Master Queue Runner...
Jun 13 11:48:55 lists.gnu.org mailman[10775]: * Starting Mailman master qrunner mailmanctl
Jun 13 11:48:55 lists.gnu.org mailman[10775]: ...done.
Jun 13 11:48:55 lists.gnu.org systemd[1]: Started LSB: Mailman Master Queue Runner

I'm using mailman 2.1.23-1+deb9u4+8.0trisquel1 on trisquel 8, which has Python 2.7.12.

I really need to figure out a fix or workaround to this bug, waiting 5 minutes to
restart mailman is no good, I run a lot of very active lists on lists.gnu.org.
Can I kill -9? Can I start the mailman while the old qrunners are still exiting?
How can I help debug this to find a fix?

Revision history for this message
Mark Sapiro (msapiro) wrote :

Based on what I see above, you have four OutgoingRunner processes each processing a quarter of the out queue space.

When you stop Mailman, three of these processes take some time to stop. I think that's probably because they won't stop in the middle of processing a message, so probably the messages they are processing have a large number of recipients, and possibly delivery to the MTA is slow, so it takes some time for them to exit gracefully.

You could probably SIGKILL them, but this would result in messages to all the recipients upon restart duplicating those that were sent before the process was killed.

The real question is why is it taking 5+ minutes for the delivery of one message to all recipients?

What does Mailman's SMTP log show? I.e., what are the number of recips and times in messages like:

Jun 12 20:53:14 2019 (2142) <message-id> smtp to listname for 265 recips, completed in 11.730 seconds

In particular, the last entries before the restart?

Revision history for this message
Ian Kelling (iank) wrote :

Sorry, I didn't get around to replying earlier. You are right, smtp is
going slow and it seems thats causing the problem.

Sep 26 06:33:30 2019 (22952) <email address hidden> smtp to qemu-devel for 1053 recips, completed in 328.854 seconds

It looks like messages go through much much faster when exim isn't
already busy with a lot of other messages. But generally, I can't expect
to get exim to reliably accept 1000 messages much faster. So, I think
the init script/settings needs to wait longer and not report success
when its actually failed.

Revision history for this message
Mark Sapiro (msapiro) wrote :

A workaround for your issue is to run (as root or the mailman user)

/usr/lib/mailman/bin/mailmanctl restart

instead of

systemctl restart mailman

systemctl restart is actually doing a stop followed by start as you observed whereas mailmanctl restart will just signal the qrunners to restart.

You might also be able to adjust your systemd mailman service script to do mailmanctl restart instead of stop/start.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers