i18n: Header Filter Rules (& fix) - rules don't match if header characters aren't representable in cset of list's preferred language.

Bug #558155 reported by hatukanezumi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Low
Mark Sapiro

Bug Description

- Decode headers to be matched.

- Normalize header & pattern so that compatibility
characters
  (fullwidth forms of ASCII, Compatibility Ideographs etc.)
  will be matched. Normalization Form KC (NFKC) is used.
  note: This feature is available on Python >= 2.3.

- Fix: Ignore empty lines in pattern to prevent matching
  any strings.

Related branches

Revision history for this message
hatukanezumi (hatukanezumi-users-sf) wrote :

Logged In: YES
user_id=529503

Error handlings are added.

Revision history for this message
hatukanezumi (hatukanezumi-users-sf) wrote :

The file mailman-2.1.5-unicode_headermatch.patch was added: for 2.1.5-release

Mark Sapiro (msapiro)
summary: - i18n: Header Filter Rules (& fix)
+ i18n: Header Filter Rules (& fix) - rules don't match if header
+ characters aren't representable in cset of list's preferred language.
Revision history for this message
Mark Sapiro (msapiro) wrote :

Portions of this patch, but not the Unicode normalization have been applied or otherwise addressed in MM versions 2.1.6 through 2.1.15.

I intend to deal with the spirit of the rest by converting the headers to the cset of the list's preferred language using encode(errors='backslashreplace') instead of encode(errors='replace'). In this way, these characters will be converted to '\uxxxx' escapes rather than '?', and header_filter_rules patterns can be constructed to match them.

Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
importance: Undecided → Low
milestone: none → 2.1.23
status: New → In Progress
Revision history for this message
Mark Sapiro (msapiro) wrote :

The committed fix together with prior changes implements a few of the things in this patch. It does not do the Unicode normalization portion of this patch. I was mostly trying to address the issue of trying to recognize Chinese spam by detecting Chinese characters in message headers.

I understand that the normalization can be important to actually match specific things in subjects in say Japanese on a Japanese language list. If that is still desired, please submit a new patch against the current code base.

Changed in mailman:
status: In Progress → Fix Committed
Revision history for this message
Mark Sapiro (msapiro) wrote :

I have committed another change at http://bazaar.launchpad.net/~mailman-coders/mailman/2.1/revision/1664 which does the conversion to unicode and the unicode normalization, so everything in this patch has now been committed, albeit in a somewhat different way. Refer to the NEWS item in rev 1664 for more details.

Mark Sapiro (msapiro)
Changed in mailman:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers