bounce mail processed multiple times -> oom crash of BounceRunner

Bug #1859011 reported by Michael Menge on 2020-01-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Medium
Mark Sapiro

Bug Description

System: RHEL 7.7
Mailman: mailman-2.1.15-26.el7_4.1.x86_64

Description: We have 2 Mailinglist with > 7000 Members which trigger a spike in processed bounces (more than 3000 in an sinle run), followed by an Out of Memory situation in the BounceRunner and an > 20 GB bounce-event-XXXX.pck
file.

We tried to mitigate the problem by increasing the system memory, running the BoucneRunner
every minute and limiting the number of mails delivered at ounce by postfix.

But it happened again:

Dec 30 19:53:29 2019 (13392) <BounceRunner at 140395473885088> processing 4134 queued bounces

...
Dec 30 19:53:59 mx09 kernel: [13392] 41 13392 2755797 1874474 5337 825695 0 python
...
Dec 30 19:53:59 mx09 kernel: Out of memory: Kill process 13392 (python) score 896 or sacrifice child
Dec 30 19:53:59 mx09 kernel: Killed process 13392 (python), UID 41, total-vm:11023188kB, anon-rss:7497896kB, file-rss:0kB, shmem-rss:0kB

We analyzed the bounce-event file, extracting data with "stings".
This time we extracted the postfix mail queue ids from the received headers
with our listserver. We found the following:

cat /tmp/bounce-20191230.txt | sed 's/;//' | sort | uniq -c | sort -n

      1 01A7DE9314
      1 10F6AE9319
      1 18456E930E
      1 27D0BAC960
      1 3B51CE9316
      1 57C2DAC992
      1 5D3B2E9310
      1 5EF11E9311
      1 63054E9312
      1 69377E9313
      1 ED636E930F
      2 29884E9315
      2 49ECEAC98D
      2 99A16A9DA7
      2 (Postfix)
      3 59EE1AC995
      3 CEB61AC996
    192 C12E9D3B48
   3929 F2BEEE9318
   4122 CC58AAC993
   4134 6EFD1A9DA7

As the bounces the last 2 qids are from the original mail send to the list (6EFD1A9DA7),
and one of mails send by mailman to 500 members of that list (CC58AAC993).

(F2BEEE9318 and C12E9D3B48) are both bounces from members that where deliverd only once
to /usr/lib/mailman/mail/mailman bounces NAME-OF-HUGE-LIST

So from me it seems that somehow some few bounce where "multiplied" so that ~20 real
bounce produced 4134 virtual bounce.

I see the potential of a deny of service attack, as it could be used
to fill up the disk where the bounce-event files get dumped
But I don't know if this would warrant marking "This bug is a security vulnerability".

Related branches

Mark Sapiro (msapiro) wrote :

We need more information to know what's happening.

You say the MTA log shows only one delivery for, e.g., F2BEEE9318, to /usr/lib/mailman/mail/mailman bounces NAME-OF-HUGE-LIST yet strings on the bounce event file finds F2BEEE9318 a total of 3929 times. Can you examine the bounce-events-ppppp.pck file with Mailman's bin/dumpdb and see what the actual bounce messages are and where they all come from?

Changed in mailman:
status: New → Incomplete
Mark Sapiro (msapiro) wrote :

If it does appear that Mailman is somehow generating these multiple bounce-events entries, you might compare your BounceRunner.py to the 2.1.15 base at https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/1357/Mailman/Queue/BounceRunner.py to see if there might be some corruption.

Michael Menge (mmenge) wrote :

Thanks for the hint to the dumpdb command.

though the command was killed by the system, it extracted sufficient data,
and I was able to extract some useful informations.
I did split the output into files for each object (csplit '/^<----- start object/' '{*}')

1. one bounced mail (qid F2BEEE9318) was multiple times in the bonce-event-13392.pck

2. the dumped objects for the duplicated bounce differ only in the 1st and 3rd line,
   first line beeing '<----- start object' and second line been the mailinglist

3. I guess the detected mail-address should be in the third line. For the first
   instance of the duplicated mail this was '\x00', for the other objects it
   seems to contain "random" data e.g.

   '\x00',
   '\x16\xc9\xeaNq\xf8\xfbU\xe9\xa6\xbd\xbc\x92\x19\xaf\xee.o.\xa3M\x82i\x98g\xaep=\xb3\\\x94S_[\x01',
   'y\x9d\xf5O\x88\x05O\x88\x05O\xb0r/+\xbfQ\xd4<\xb2\xd7\xd2\xec\xf7E\xda\x1f\x86\xa7e\xb0\xa7e\xb0\xa52\xd8\xd3\xf6\xe6\x16{s\x8b\x0c\xb6\xd4\xfe|\xd4\xfe\xdcb\x7f',
   "\x87\xe5\xc7\xe3\xaa\xd2\xc5\xff\x8f\xb6\xbe\x15\xe2\x9b\xb8b\x8a\xa0\x842\xa8\x87\t0\xe5a\xd0\xb1\x9bqY\x1d\xc6\x9e\xddYY\x19\x93\xa7|\x14\x83\x02\xd8\xd9\xb3\x10\x83\xe6\xbb\xd9\x1f'\x13tVk\xb9\xb3Dq\x85\xd8`\xaa-W^!L\x81\xf2\xdew\xde~\x1a\x1f\xcf\xc7\x96\xf9\x9eg\xbe\xb7{^\xe91\xf6",

4. The envelope-from of the bounce mail did contain '=' 2 times
   <email address hidden> while the from address is
   <email address hidden>

5. the bounced mail included the complete original mail with pdf attachment ~5 MB

There is no change in my BounceRunner.py compared to the linked version.

Mark Sapiro (msapiro) wrote :

Each object in the bounce events is a 4-tuple like (list-name, bouncing address, bounce date, bounce message). The bounce message is a Mailman.Message.Message object, but should be and apparently is rendered as the text of the message. So yes, the third line is the bouncing addresses, and it is clearly corrupted, but I suspect that is a symptom, not a cause.

I still don't really understand what's happening. Given that the message contained a large PDF attachment, I suspect this isn't a real bounce, but rather is spam sent to the list-bounces address or possibly a list post mis-sent to the list-bounces address. The size of it may have something to do with the corruption that occurred, although I still don't understand how the duplication comes about.

What's in Mailman's bounce log from the time this happened. I'm particularly interested in any 'bounce message w/no discernable addresses' or other messages just before the "Dec 30 19:53:29 2019 (13392) <BounceRunner at 140395473885088> processing 4134 queued bounces" message.

Also are there any "VERP_PROBE_REGEXP doesn't yield the right match groups:" messages in Mailman's error log from this time.

Michael Menge (mmenge) wrote :

To clarify some points:

1. the bounce mail is genuine. My logs show that a ~5MB mail was send to the
list by the owner of the list, it was distributed to the members of the list,
and caused some bounce mails by some recipients. The mail that caused the
problem was normal non delivery notification with the characteristic that
the complete original mail was included as message/rfc822.

2. I don't know if the size of the bounced mail has an impact on the number
of duplications, but the size combined with the duplications resulted
in the bounce process requesting using ~20 GB of ram

3. My logs indicate some cases with >2000 processed bounces that didn't crash
the BounceRunner, most likely there was also a duplication, but the mail
was smaller

At the moment I try to create a striped down version that I can share
that reproduces the problem

Michael Menge (mmenge) wrote :

I was able to create a striped down version of the mail.

I also found that the size of the attachment has an influence
on the number of dulicates

I have created 2 examples

Michael Menge (mmenge) wrote :
Mark Sapiro (msapiro) wrote :

Thanks for your work on this. I will use your examples to try to recreate the issue and diagnose further.

Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
importance: Undecided → Medium
milestone: none → 2.1.30
status: Incomplete → Triaged
Mark Sapiro (msapiro) wrote :

I know what's going on. The bounce message itself is unusual. It consists of a simple text/plain body containing the text

----------------------------------------------------------------------------

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its recipients.

One or more of the recipients are not a valid user of the system. Please check to be sure the email address is correct.

----------------------------------------------------------------------------

followed by a message/rfc822 part containing the entire original message.

Bounce processing has a number of recognizers that examine the message and if they recognize it, return a list of the bouncing addresses. One of these recognizers matches RFC compliant DSNs, but there are many non-compliant DSNs that we have seen that we try to recognize. Another 'heuristic' recognizer is SimpleMatch. It looks at every line of the message until it finds a line matching a start pattern. Tnen it continues to look for lines matching an address pattern until it finds a line matching a stop pattern.

In this case, the start pattern that matches is

'A message that you( have)? sent could not be delivered'

It then looks for addresses matching

'<(?P<addr>[^>]*)>'

until it finds a line matching

'---'

It never finds the stop pattern so it goes through the entire base64 decoded PDF line by line looking for any string between '<' and '>', and it finds a bunch of them, each of which is returned as a bouncing address even though they don't look at all like addresses, and in fact, the actual bouncing address is nowhere at all in the bounce message. Then each 'address' is recorded as a bounce in the bounce_events file.

I will change the address patterns so the string inside the angle brackets has to at least resemble an email address. That should fix this.

Changed in mailman:
status: Triaged → In Progress
Mark Sapiro (msapiro) on 2020-01-17
Changed in mailman:
status: In Progress → Fix Committed
Mark Sapiro (msapiro) wrote :

Actually, https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/revision/1834 will fix your issue, but it is not completely correct and https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/revision/1835 fixes that, so yes, install both.

Mark Sapiro (msapiro) on 2020-04-13
Changed in mailman:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers