subject encoding bug

Bug #1582819 reported by carlos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Invalid
Undecided
Unassigned

Bug Description

When mailman receives a message with =?ISO-8859-2?Q? in it twice, the second =?ISO-8859-2?Q? gets encoded, and looks like this for example:
=3F=3D=3D=3FI?==?iso-8859-2?q?SO-8859-2=3FQ=3F
Where the position of the newly inserted encoding is between the "I" and "SO" of the original encoding sign ( which is not recognized by the mailman )

Revision history for this message
Mark Sapiro (msapiro) wrote :

I am unable to duplicate exactly what you describe, but I do see an issue in that this:

Subject: =?ISO-8859-2?Q?Part_=31_?==?ISO-8859-2?Q?Part_=32_?=

gets changed to

Subject: [List1] =?iso-8859-2?q?Part_1_=3F=3D=3D=3FISO-8859-2=3FQ=3FPart_2?=

however the incoming Subject is non-compliant. RFC2047 section 5(1) says in part:

Ordinary ASCII text and 'encoded-word's may appear together in the same header field. However, an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If I add a space as in

Subject: =?ISO-8859-2?Q?Part_=31_?= =?ISO-8859-2?Q?Part_=32_?=

the result is

Subject: [List1] =?iso-8859-2?q?Part_1_Part_2?=

which is correct.

If you see this issue with an RFC 2047 compliant encoded Subject: or other header, pleas provide the exact header that causes the issue.

Changed in mailman:
status: New → Incomplete
Revision history for this message
carlos (carlo7) wrote :

Fixed the sender software's encoding issues...

Changed in mailman:
status: Incomplete → Invalid
Revision history for this message
carlos (carlo7) wrote :

New, fixed subject example sent (from the sent file directly):
=?ISO-8859-2?Q?PIVOT_kutat=E1sfinan?= =?ISO-8859-2?Q?sz=EDroz=E1si_adatb=E1zi?=s
Received back from mailman (from the mailfile directly):
[Xxx-yyyyyyyy] =?iso-8859-2?q?PIVOT_kutat=E1sfinan_=3D=3FISO-8859?=
        =?iso-8859-2?q?-2=3FQ=3Fsz=3DEDroz=3DE1si=5Fadatb=3DE1zi=3F=3Ds?=

So something is not correct :(

Changed in mailman:
status: Invalid → New
Revision history for this message
Mark Sapiro (msapiro) wrote :

=?ISO-8859-2?Q?PIVOT_kutat=E1sfinan?= =?ISO-8859-2?Q?sz=EDroz=E1si_adatb=E1zi?=s

is still non-compliant.

It is too long: RFC2047 section 2 says in part

   While there is no limit to the length of a multiple-line header
   field, each line of a header field that contains one or more
   'encoded-word's is limited to 76 characters.

but that is not the issue here. The issue is the same as before. The final 's' is not separated from the preceding encoded word by linear white space. If the header were

Subject: =?ISO-8859-2?Q?PIVOT_kutat=E1sfinan?=
 =?ISO-8859-2?Q?sz=EDroz=E1si_adatb=E1zi?= s

Mailman would produce

Subject: [List1] =?iso-8859-2?q?PIVOT_kutat=E1sfinansz=EDroz=E1si_adatb=E1?=
 =?iso-8859-2?q?zi_s?=

which is not quite what you want as it decodes to

Subject: [List1] PIVOT kutatásfinanszírozási adatbázi s

and you presumably want 'adatbázis', not 'adatbázi s'. There are always whitespace issues when mixing encoded words and plain text in one header. The proper encoding of that header is

Subject: =?ISO-8859-2?Q?PIVOT_kutat=E1sfinan?=
 =?ISO-8859-2?Q?sz=EDroz=E1si_adatb=E1zis?=

Granted, various MUAs are more forgiving of the defects, but the Mailman issue is really in the underlying Python email package which insists that the terminating ?= be followed by white space in order that the encoded word be recognized as such.

Changed in mailman:
status: New → Invalid
Revision history for this message
Harka Győző (carlos-gamma) wrote : Re: [Bug 1582819] Re: subject encoding bug

> and you presumably want 'adatbázis', not 'adatbázi s'. There are always
> whitespace issues when mixing encoded words and plain text in one
> header. The proper encoding of that header is
>
> Subject: PIVOT kutatásfinanszírozási adatbázis
>
> Granted, various MUAs are more forgiving of the defects, but the Mailman
> issue is really in the underlying Python email package which insists
> that the terminating ?= be followed by white space in order that the
> encoded word be recognized as such.
>

Just to be clear, if I have a Q encoded word as in rfc2047, and there is a \n
and space after it, mailman puts an extra space in place.
The only solution is if I encode all "words" even if that is not needed.
rfc2047 allow to have encoded words mixed with non encoded texts.

"Ordinary ASCII text and 'encoded-word's may appear together in the
    same header field. However, an 'encoded-word' that appears in a
    header field defined as '*text' MUST be separated from any adjacent
    'encoded-word' or 'text' by 'linear-white-space'"

So this is a BUG in the underlying python package.

Revision history for this message
Mark Sapiro (msapiro) wrote :

There is and probably always will be discrepancies in the way MUAs handle header folding/unfolding and RFC2047 decoding. There is a possible ambiguity in the part of RFC2047 you quote. 'linear-white-space' is any number of white space characters, so if part of the header is an encoded word followed by some white space followed by ordinary ASCII text, how much of the white space is the separating linear-white-space and how much is leading white space in the text field? Thus the only unambiguous way to represent this is to never begin a text field with white space.

This is further complicated by header folding and unfolding. RFC5322 sec 2.2.3 is clear on how headers should be folded and unfolded. Folding is done per "The general rule is that wherever this specification allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP." and unfolding per "Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP." This is clear, but contradicts the original RFC822 sec 3.1 which says "The general rule is that wherever there may be linear-white-space (NOT simply LWSP-chars), a CRLF immediately followed by AT LEAST one LWSP-char may instead be inserted." and "Unfolding is accomplished by regarding CRLF immediately followed by a LWSP-char as equivalent to the LWSP-char." This effectively says that when folding you can insert multiple whitespace characters, but when unfolding, you don't remove any.

Thus the standards have been buggy and even the best intentioned MUAs have to guess about what to do.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.