MM changes content-transfer-encoding when adding footer/scrubbing

Bug #373083 reported by Petr Hroudný
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
GNU Mailman
Confirmed
Undecided
Mark Sapiro

Bug Description

When MM is adding header/footer or scrubbing, as a side effect it also changes content-transfer-encoding of the message from 8bit to e.g. base64.

This is highly undesirable in some cases. For instance, mailinglist might be used to distribute trouble tickets or other content which is expected to be easily parsable by automated text-based utilities. At the same time, with base64, emails grow in size by 33 % and such
emails are getting much higher spam scores since base64 is typically used by spammers to obfuscate the payload.

MM does not change content-transfer-encoding when no header/footer is configured.

The attached patches try to fix the problem by preserving original Content-Transfer-Encoding even when adding header/footer or scrubbing. I believe adding of footer should be as non-intrusive as possible, so keeping the original Content-Transfer-Encoding is as important as keeping e.g. format=flowed which is being done already.

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :
Revision history for this message
Mark Sapiro (msapiro) wrote :

I note that the patch to Decorate.py changes the order of trying to encode the message from trying list's preferred_language charset first to trying the original message's charset first. Do you have any actual examples of the former strategy causing a problem?

I think I can understand your thinking in doing this, but I'm concerned because the original ordering was done by Tokio Kikuchi who, unlike me, has much experience with Japanese language lists, and he may have had good reasons for doing it as he did.

Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
status: New → Confirmed
Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

The motivation here is exactly the same - to preserve the original message format as much as possible.

Changing charset in the path between the sender and the recipient breaks automated processing as described above. Moreover, several MUAs make an asumption that it's best to reply in the same charset in which the original message arrived. When the listserver changes charset, this assumption becomes invalid.

Examples:

- sender generates email in UTF-8 to ensure uniform encoding in all cases. However, mailman configured for language with iso-8859-* charset sometimes "downgrades" that to iso-8859-*, sometimes keeps UTF-8 - depending on email content.

- sender generates email in iso-8859-1 since it can't display anything else properly. However MM configured for language with UTF-8 charset always changes that to UTF-8 and replies will use that.

The latter will become more visible after you change all languages to UTF-8. To be precise, I'm all in favour of changing all languages to UTF-8 as soon as possible for obvious reasons, but I believe you need to apply this patch beforehand otherwise some people with legacy environment will start complaining about getting all emails in UTF-8. With all languages in UTF-8, the new order will in fact mean: message charset first, then UTF-8 - which I think makes a lot of sense.

If Japanese need the reverse order for some reasons, an exception for euc_jp might be a decent solution, but I can't comment whether this is really needed.

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

Because there is high possibility that header/footer is written in the list preferred language and they are used to it, it may not be desirable to change the default behaviour of Decorate.py. May be we should make the behaviour configurable. Please find an alternative patch:

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

Yes, the header/footer might be written in the list's preferred language, but the emails sent to the list are supposed to be in that language as well, or? So I see no advantage in trying the LCSET first.

There are however many opposite situations where having MCSET first would be a clear advantage. Many international lists are using English as default language and this means us-ascii. Anyone posting to this list in any variant of iso-8859-*, windows-125* or utf-8 will have charset which is capable to absorb all ascii chars from the header/footer, but trying LCSET first almost always fails just because the poster happens to have some accentuated character in his name or in his signature.

Last but not least, I believe the listserver is supposed to distribute mail 'as is' without any unnecessary conversions. I mentioned two examples above where an unneeded conversion causes trouble. Thus I believe addition of header/footer should be as non-intrusive as possible by default.

Note that "downgrading" of email body from utf-8 to any iso-8859-* variant by default is a serious problem, and should be avoided by all means - since instead of fully unambiguous representation for any character you're going to receive body which couldn't be properly interpreted without additional information (the charset designation). There are 22 languages using iso-8859-* charsets in mailman 2.1.12, which are hit by this.

Revision history for this message
Mark Sapiro (msapiro) wrote :

I see a potential problem with both Petr's original patch and Tokio's suggested replacement. Namely, suppose a list's preferred_language has character set UTF-8 or some asian character set or even an iso-8859-n character set, and the list's msg_footer contains non-ascii characters from that character set. Further suppose a message is posted to the list with charset=us-ascii and no encoding. The patch will 'upgrade' the character set of the message to that of the list's preferred language because of the non-ascii characters in msg_footer, but it will not encode the body for transfer because the original body was not encoded. This unencoded 8bit transfer may not be appropriate for the msg_footer.

Note also, that I am currently considering changing the Mailman character set for English and the current iso-8859-* languages to UTF-8 for Mailman 2.2.

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

Mark, I attach a new patch to solve the problem. It only keeps CTE for the message-charset.
Petr, mailman's behaviour for adding header/footer have been the same for more than 4 years. No one cared this until I needed this charset adjustment feature other than Japanese. Changing the default behavior may cause unwanted trouble. So, let's keep the default charset order at least in mailman-2.1 and cosider changing in 2.2.

Note that the order is configurable in mm_cfg.py with this patch.

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

Tokio, I tried your patch, but unfortunately it does not solve our problem. Try setting language to e.g. German and feed mailman with UTF-8/8bit email containing german text. What comes out is ISO-8859-1/QP, i.e. both charset and encoding are different...

You seem to have introduced 'LCSET first' rule because of Japanese lists. Please note however, that situation in Japan is very specific - instead of using 8bit/Base64/QP with standard charset you're converting all email into special iso-2022-jp charset which is 7bit already and doesn't need any form of content-transfer-encoding by principle. In this situation I understand your preference to convert anything into iso-2022-jp at first place, but that's clearly not apropriate as general solution.

Therefore I propose to try LCSET first only if the LCSET is euc-jp. That way you will keep your specific requirements and others won't be negatively affected by the side-effects.

Now to Mark's concern: my patch uses None in the situation you described. This will call encode_7or8bit fuction from email/encoders.py which will do the right thing - i.e. it will set content-transfer-encoding to 7bit if body is ascii or iso-2022-* based, but will set it to 8bit if some accentuated character is present in the footer. This is perfectly valid according to RFC2045 and is heavily used in the field - for example popular MUAs like Thunderbird or Mutt are doing this by default, also e.g. bugzilla or subversion send all emails this way. Mailman passes such messages just fine if no footer is configured.

Thus I don't think mailman should base64-encode (=make unreadable) us-ascii emails just because it adds a footer. If you're concerned about generating 8bit CTE, then the second best solution is to add footer as MIME multipart (wrap=True).

I'm glad to hear that you're considering changing to UTF-8 for Mailman 2.2 ! BTW, you can also change Russian (koi8-r) as they are using UTF-8 in e.g. Squirrelmail for quite a long time already.

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

Have you set

DECORATE_CHARSETS = [DECORATE_MCSET, DECORATE_LCSET, 'utf-8']

in mm_cfg.py and execute bin/mailmanctl restart ?

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

It doesn't solve the problem no matter what the settings are, since your patch only preserves CTE when charset is not changed.

Revision history for this message
Mark Sapiro (msapiro) wrote :

Petr Wrote:
Tokio, I tried your patch, but unfortunately it does not solve our problem. Try setting language to e.g. German and feed mailman with UTF-8/8bit email containing german text. What comes out is ISO-8859-1/QP, i.e. both charset and encoding are different...

Tokio asked:
Have you set

DECORATE_CHARSETS = [DECORATE_MCSET, DECORATE_LCSET, 'utf-8']

in mm_cfg.py and execute bin/mailmanctl restart ?

Petr replied:
It doesn't solve the problem no matter what the settings are, since your patch only preserves CTE when charset is not changed.

Mark asks:
Why is charset changed in your example? With Tokio's patch, if DECORATE_MCSET is first, the outgoing mail should be in the charset and encoding of the original assuming the footer can be encoded in that charset which is true if it's UTF-8.

Petr also wrote:
This will call encode_7or8bit fuction from email/encoders.py which will do the right thing - i.e. it will set content-transfer-encoding to 7bit if body is ascii or iso-2022-* based, but will set it to 8bit if some accentuated character is present in the footer. This is perfectly valid according to RFC2045 and is heavily used in the field - for example popular MUAs like Thunderbird or Mutt are doing this by default, also e.g. bugzilla or subversion send all emails this way.

Mark replies:
This assumes all MTA's in the delivery path support 8BITMIME. Granted, If a receiving MTA doesn't support 8BITMIME, the sender MAY convert to another encoding (RFC1652), which will work as long as the MTA which doesn't support it is not the initial MTA that Mailman is sending to. In short, delivery of messages containing unencoded bytes >x7F is not guaranteed.

These are hard choices in general, but I am reluctant to make changes that would potentially break Mailman altogether, even if only in rare cases.

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

Mark asks:
Why is charset changed in your example? With Tokio's patch, if DECORATE_MCSET is first...

Petr replies:
I tried the default configuration first. Since it changed the enconding from 8bit to QP, I wasn't testing any other settings since this alone is a no go. Moreover, Tokio's config option again applies to all languages, which is suboptimal, as the non-standard sequence is apparently only needed for Japanese. Thus I believe it should only be restricted to Japanese and not extended to all languages by default. And if most locales will be UTF-8, putting LCSET first renders all the following options useless (never reached).

Mark also wrote:
In short, delivery of messages containing unencoded bytes >x7F is not guaranteed.
These are hard choices in general, but I am reluctant to make changes that would potentially break Mailman altogether, even if only in rare cases.

Petr replies:
When you ultimately want to avoid 7bit->8bit upgrading in Mailman, then there are two alternatives:
1)add the footer as MIME multipart and leave the original message 7bit, or
2)apply my patch with a small modification - instead of

+ else:
+ newcset.body_encoding = None

use

+ elif cte=='8bit':
+ newcset.body_encoding = None

Which one do you prefer?

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

> Petr replied:
> It doesn't solve the problem no matter what the settings are, since your patch only preserves CTE when charset is not changed.

> Mark asks:
> Why is charset changed in your example? With Tokio's patch, if DECORATE_MCSET is first...

> Petr replies:
> I tried the default configuration first. Since it changed the enconding from 8bit to QP, I wasn't testing any other settings since this alone is a no go.

So, you didn't try my patch's all feature. If you insist on chaning the default feature, you ask too much. Just try configure in mm_cfg.py. I believe it is a rare case that both German and Japanese lists are on the same list server. Or, do you want to configure this per list basis ? (Well, it is not impossible. Just need more my spare time)

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

Please read again what I wrote.

1) Your patch changed transfer-encoding from 8bit to QP for no reason. This is a no go.

2) You introduced workaround for Japanese lists and now you insist keeping it for the whole world although I clearly demonstrated it's causing problems at least here in Europe. I also explained that it's unusable with UTF-8 languages, and there are 10 of them already in Mailman 2.1.12. Please be so kind and keep Japanese workarounds local to your language.

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

> 1) Your patch changed transfer-encoding from 8bit to QP for no reason. This is a no go.

Because you didn't try mm_cfg configuration (or, failed to restart mailman).

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

No, it's because your patch keeps content-transfer-encoding only for MCSET, but not for other cases. This is inappropriate.

Mark's concern was about upgrading from 7bit to 8bit, but what you implemented rejects also valid cases i.e. 8bit -> 8bit.

Please see above the proper fix for Mark's concern (elif cte=='8bit':)

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

If you apply my decorate2.patch, configure mm_cfg.py, restart mailman, and you send UTF-8/8bit mail to a ISO-8859-1 list, then Decorate.py first check the order of mcset/lcset and finds you have set mcset first, try to decorate the message with heder/footer in mcset (UTF-8) and because cs==mcset, check the cte and set cs.body_encoding=None. You will happily get UTF/8bit message from the list.

Am I wrong ?

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

In this particular case, i.e when mcset is UTF-8, it will work. It won't work in many other situations:

- when you send iso-8859-1/8bit mail to iso-8859-2 list
- when you send iso-8859-1/8bit mail to utf-8 list

In all those cases the mail will leave as utf-8/base64 which is clearly wrong.
CTE=8bit should be preserved, that's what this bug is all about.

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :

These are not what you have asked before.

In these cases, new message are highly dependent of header/footer contents and should not keep content-transfer-encoding, I believe. Possible work around is to use ASCII only in header/footer.

Also, if you want to process the delivered message automatically, your program should be prepared for MIME encoded message. This, we have done for automated processing of -request messages.

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

Please see my initial posting in this bug and also my patch. There's no single word stating CTE=8bit should only be preserved in some cases. Automated processing was just one of the examples, high spam scores for base64 and unnecessary mail growing were others. That list is not exhaustive of course.

CTE=8bit states the part is 8bit. No less, no more. There's absolutely no reason to replace it with QP or Base64 when you change charset.

Once more, I posted yesterday the proper fix for Mark's concern (elif cte=='8bit':) so I really fail to understand why you're still trying to push suboptimal solutions which don't fully resolve the problem.

Revision history for this message
Tokio Kikuchi (tkikuchi) wrote :
Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

I'm uploading a new patch. Corrections from Tokio's decorate3:

- Japanese workarounds should not be enforced for other languages, LCSET can't be first for UTF-8 languages
- better exception catching synced with Scrubber.py
- preserve 8bit when it was seen on input

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :
Revision history for this message
Barry Warsaw (barry) wrote :

I'm sorry, I'm finally catching up on this thread and I may be missing something important. This is only for the case where header and footer is appended to the main body, and not attached via MIME, right? When attached via MIME, it's always going to do the right thing, correct?

If so, then I think we're going about this the wrong way, and I'm quite concerned about changing this behavior in MM2.1.

Most mailing lists won't care, which I think is evidenced by the fact that this behavior has been working this way, largely successfully for a long time. Some mailing lists care a lot though, so I would rather see an option which tells the mailing list to /always/ attach via MIME any headers and footers. The only reason it isn't done in this way in the first place is because it confuses some people in some MUAs which won't show those attachments inline. Those people will not be able to see the header and footers and won't know to expand the attachments to see them. But I think for the few lists that really care about this, that's an okay trade-off because the list owner can inform people about what's going on.

ISTM a much simpler patch for MM2.1 would be to add "always attach" as a list option. Then, if and when the default charsets are changed in MM2.2 (and probably also 3.0), we can readdress whether "always attach" is the right default and whether we really need "add by append" at all.

Revision history for this message
Petr Hroudný (petr-hroudny) wrote :

Yes, when the footer is attached via MIME, the original message is kept 'as is' - i.e. both charset and CTE are correctly preserved.

Anyway, it would be a pity do this this always for any list, since there are lot of situations where the footer could be safely added without any changes to the message properties. These are:

- message and footer are in similar languages/charset (iso-8859-1 covers most of western Europe)
- footer is in pure ASCII (fits into any message, a typical case for global mailinglists)
- message is in UTF-8 (any footer fits)

Thus I'd say mailman should try to append the footer and attempt to encode the result into message charset at first place.
I also fully agree with Barry that if this fails, it's far better to append the footer via MIME than changing charset/CTE of the message - and this will also solve Mark's concern against changing CTE from 7bit into 8bit.

So my final proposal (at least for MM2.2 and above) is:

1) try only MCSET, if this suceeds, preserve CTE accoding to my patch
2) if the above fails, attach footer via MIME

The above will preserve message charset and CTE at all times, and will avoid problems with MUAs not showing attachments in most cases.

Now the question is what to do for MM2.1. I still believe that trying LCSET first is just plain wrong. MTA/listserver is not supposed to modify message charset during transfer, as this beaks the assumption that the sender is always able to read messages in the charset it uses for sending. Due to this, people in the iso-8859-2 region are often getting garbled messages from iso-8859-1 lists or vice versa - a typical example is having e.g. "Č" replaced by "C(" or "?".

 I'm quite surprised that noone complained by now, but this probably confirms your point that most mailinglists simply don't care. If that's really the case, then changing this back to MCSET shouldn't be a problem, unless someone actually "abuses" footer to do things it was never meant for. Of course I have no problem keeping an exception for Japanese, which need LCSET first for very specific reasons not applicable to any other language.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.