edithtml.py saves en templates using html entity reference with raw iso-8859-1 character

Bug #1779445 reported by Yasuhito FUTATSUKI at POEM
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Fix Released
Medium
Mark Sapiro

Bug Description

In Mailman's web administrative interface, edithtml page saves en language templates by using iso-8859-1 raw character if the template uses html entity reference like " ".

For example, If "General list information page" (templates/en/listinfo.html), which contains &nbsp;", has been saved without modification from web UI, the lists template en/listinfo.html will contain raw '\xa0' character. If Adding "<!-- &copy;&reg; --> in text area and submit changes twice, it will turn into "<!-- \xa9\xae -->".

I'm not sure the patch attached is a good way to fix it because I don't know these entity reference characters are always ISO-8859-1 character, but for reference.

Related branches

Revision history for this message
Yasuhito FUTATSUKI at POEM (futatuki) wrote :
Revision history for this message
Mark Sapiro (msapiro) wrote :

Actually, this behavior was caused by rev. 1188. Unfortunately, I don't recall specifically why I made that change. I will attach a patch of what I have so far. Because the call to websafe comes from htmlformat.TextArea(), I need more testing to see if the other uses of TextArea are adversely impacted.

Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
importance: Undecided → Medium
milestone: none → 2.1.28
status: New → In Progress
Revision history for this message
Mark Sapiro (msapiro) wrote :

Revised possible fix patch. I think the main reason for not double escaping HTML entities was to make HTML text displayed in the admindb interface more readable. This patch will avoid double escaping only in readonly TextArea.

Revision history for this message
Yasuhito FUTATSUKI at POEM (futatuki) wrote :

I understand that your fix is to preserve character entity reference in the text of TextArea through the post method and I made sure it have been fixed in Rev 1788. Thank you.

I think one more problem about charset of query strings from Text or TextArea which is not restricted to ascii text for all language. If a text contains raw non-ascii character, its charset depends on implementation of browsers, even if the HTML 4.01 specification mentions its default is "UNKNOWN", which means "User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element." (https://www.w3.org/TR/html401/interact/forms.html)

It seems that it is not a problem in most case on browsers nowadays respecting the specification, but it is still problem in some case. At least I put into non-breaking space ('\xa0' in iso-8859-1) character in Text field in us-ascii form using Firefox 61 on FreeBSD, it encoded as '%A0' in query string although characters in Unicode are encoded as numeric character references. The code to handle this special care for 'us-ascii' is found in Utils.canonstr(), so it may be needed to use it in some place including TextArea in edithtml.py (Though using non-ascii characters in us-ascii form is irregular, of course)

Revision history for this message
Mark Sapiro (msapiro) wrote :

I think the issue in the original description is fixed and that described in comment #5 is a different issue. If you think this is a significant issue that needs to be fixed, please open a new bug for it.

Revision history for this message
Yasuhito FUTATSUKI at POEM (futatuki) wrote :

I don't think it is a significant, as I mentioned comment #5 in last sentence within the ()'s. So I won't open a bug for it. I'm sorry to bother you.

Mark Sapiro (msapiro)
Changed in mailman:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.